DPDK patches and discussions
 help / color / mirror / Atom feed
* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
@ 2016-08-01 15:08 Kieran Mansley
  2016-08-03 15:19 ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Kieran Mansley @ 2016-08-01 15:08 UTC (permalink / raw)
  To: dev; +Cc: adrien.mazarguil

Apologies for coming a little late to this thread about the proposed new
API for filtering etc.

I've reviewed it based on Solarflare's needs and hardware capabilities,
and think the proposal is likely to be a big improvement on the current
system.

I worry slightly that the goal of having applications that are not aware
of the hardware they are running on will be difficult to meet.  My guess
is that the different hardware platforms will have so little overlap in
the functionality they support that to get best performance the
applications will still be heavily tailored to the subsets of the API
that the hardware they are using provides.  The discussion of filter
priorities is a good example of this: to get best performance the
application will want to use the hardware's filtering capabilities to do
all the heavy lifting, but the abilities of different NICs to support
particular priorities and combinations of filters will mean what works
very well for one NIC may well return "I can't do that" for another, and
vice versa.

One suggestion for extending the API further would be to allow it to
also describe transmit filters as well as receive filters.

There are also some filters that can prove very useful to our customers
that while they could be achieved through the careful insertion of
multiple filters with the right order and priorities, could be made more
application-friendly by having a more meaningful alias. For example:
  - multicast-mismatch (all multicast traffic that doesn't match another
filter and would otherwise be discarded)
  - unicast-mismatch (all unicast traffic that doesn't match another
filter and would otherwise be discarded)
  - all-multicast (all multicast traffic)
  - all-unicast (all unicast traffic)

Finally, I wonder if any thought has been given to dealing with
situations where there is a conflict between different virtual or
physical functions.  E.g. attempting to insert a MAC filter on one VF
that would steal traffic destined to a different VF.  Will it be up to
each driver to enforce these sorts of policies or will there be a
general vendor-neutral framework to deal with this?

I should reiterate that I think this will be a big improvement, so thank
you for proposing it.

Kieran
The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-01 15:08 [dpdk-dev] [RFC] Generic flow director/filtering/classification API Kieran Mansley
@ 2016-08-03 15:19 ` Adrien Mazarguil
  0 siblings, 0 replies; 54+ messages in thread
From: Adrien Mazarguil @ 2016-08-03 15:19 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: dev

Hi Kieran,

On Mon, Aug 01, 2016 at 04:08:51PM +0100, Kieran Mansley wrote:
> Apologies for coming a little late to this thread about the proposed new
> API for filtering etc.
> 
> I've reviewed it based on Solarflare's needs and hardware capabilities,
> and think the proposal is likely to be a big improvement on the current
> system.
> 
> I worry slightly that the goal of having applications that are not aware
> of the hardware they are running on will be difficult to meet.  My guess
> is that the different hardware platforms will have so little overlap in
> the functionality they support that to get best performance the
> applications will still be heavily tailored to the subsets of the API
> that the hardware they are using provides.  The discussion of filter
> priorities is a good example of this: to get best performance the
> application will want to use the hardware's filtering capabilities to do
> all the heavy lifting, but the abilities of different NICs to support
> particular priorities and combinations of filters will mean what works
> very well for one NIC may well return "I can't do that" for another, and
> vice versa.

I also think most applications will end up using mostly generic rules, while
applications tailored for specific devices will use more features. In my
mind this is like how applications would handle SSE/AVX/AltiVec/etc
optimizations. They need to be aware such features exist and have both
specific and more generic code. The query interface should help with that.

> One suggestion for extending the API further would be to allow it to
> also describe transmit filters as well as receive filters.

Yes, TX is probably the next step. I think it will be part of the same API,
using pattern/actions similarly only they would affect the TX path. But
let's focus on the RX side for now.

> There are also some filters that can prove very useful to our customers
> that while they could be achieved through the careful insertion of
> multiple filters with the right order and priorities, could be made more
> application-friendly by having a more meaningful alias. For example:
>  - multicast-mismatch (all multicast traffic that doesn't match another
> filter and would otherwise be discarded)
>  - unicast-mismatch (all unicast traffic that doesn't match another
> filter and would otherwise be discarded)
>  - all-multicast (all multicast traffic)
>  - all-unicast (all unicast traffic)

Why not, those may be added as new pattern items if the community feels they
are necessary. But right now I do not think these are difficult to specify,
of course one should dedicate priority levels far apart to avoid collisions
with more specific rules, but you still need priorities to determine which
of "all-multicast" or "unicast-mistmatch" should match first.

> Finally, I wonder if any thought has been given to dealing with
> situations where there is a conflict between different virtual or
> physical functions.  E.g. attempting to insert a MAC filter on one VF
> that would steal traffic destined to a different VF.  Will it be up to
> each driver to enforce these sorts of policies or will there be a
> general vendor-neutral framework to deal with this?

PFs and VFs are a complex topic eh? Considering it is not even guaranteed
for a PF to be able to see VF-addressed traffic as is currently the case
for mlx4 and mlx5 (AFAIK). It will be up to each PMD, but they must all
follow the same logic.

A flow rule with a VF pattern item should not be allowed on a PF device if
the PF is either unable to receive VF-addressed traffic, or if doing so
would prevent traffic from being received by a VF when the flow rule
specifies that it is supposed to pass through (either implictly or through a
VF action).

Simply matching the MAC address of a VF from a PF (without specifying the VF
pattern item) should be allowed though. It may not work as packets may not
be received at all, but if it does the application should take care of the
consequences as VF may not receive packets anymore.

Creating or updating the MAC address of a VF after adding a conflicting
flow rule on a PF should not be allowed or remain undefined.

All of this is not described in the specification yet because PF/VF patterns
and actions are not fully defined at the moment, there is still some
uncertainty about them.

> I should reiterate that I think this will be a big improvement, so thank
> you for proposing it.

Thanks!

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-26 10:07         ` Rahul Lakkireddy
  2016-08-03 16:44           ` Adrien Mazarguil
@ 2016-08-19 21:13           ` John Daley (johndale)
  1 sibling, 0 replies; 54+ messages in thread
From: John Daley (johndale) @ 2016-08-19 21:13 UTC (permalink / raw)
  To: Rahul Lakkireddy, John Fastabend, Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Wenzhuo Lu, Jan Medala, Jing Chen,
	Konstantin Ananyev, Matej Vido, Alejandro Lucero, Sony Chacko,
	Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

Hi, this is an old thread, but I'll reply to this instead of the RFC v2 since there is more context here.
Thanks for pushing the new api forward Adrien.
-john daley

> > >>> - Match range of Physical Functions (PFs) on the NIC in a single rule
> > >>>   via masks. For ex: match all traffic coming on several PFs.
> > >>
> > >> The PF and VF pattern items assume there is a single PF associated
> > >> with a DPDK port. VFs are identified with an ID. I basically took
> > >> the same definitions as the existing filter types, perhaps this is
> > >> not enough for Chelsio adapters.
> > >>
> > >> Do you expose more than one PF for a DPDK port?

The Cisco VIC can support multiple PFs per Ethernet port.  These are called virtual-nics (VNICs). It would be nice to be able to redirect matched Rx packets to another queue on another VNIC.

> > >>
> > >> Anyway, I'd suggest the same approach as above, automatic
> > >> aggregation of rules for performance reasons, otherwise new or
> > >> updated PF/VF pattern items, in which case it would be great if you
> > >> could provide ideal structure definitions for this use case.
> > >>
> > >
> > > In Chelsio hardware, all the ports of a device are exposed via
> > > single PF4. There could be many VFs attached to a PF.  Physical NIC
> > > functions are operational on PF4, while VFs can be attached to PFs 0-3.
> > > So, Chelsio hardware doesn't remain tied on a PF-to-Port, one-to-one
> > > mapping assumption.
> > >
> > > There already seems to be a PF meta-item, but it doesn't seem to
> > > accept any "spec" and "mask" field.  Similarly, the VF meta-item
> > > doesn't seem to accept a "mask" field.  We could probably enable
> > > these fields in the PF and VF meta-items to allow configuration.
> >
I would like to see an ID property added to the PF action meta-item, where perhaps a BDF can be specified. This would potentially allow matched Rx packets to be redirected to another VNIC and could be paired with the QUEUE action meta-item to redirect to a specific queue on a VNIC. The PF ID property set to 0 would have the current specified behavior or redirecting to the current PF. Is something like this possible?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-10 13:37                   ` Adrien Mazarguil
@ 2016-08-10 16:46                     ` John Fastabend
  0 siblings, 0 replies; 54+ messages in thread
From: John Fastabend @ 2016-08-10 16:46 UTC (permalink / raw)
  To: Rahul Lakkireddy, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley,
	Jing Chen, Konstantin Ananyev, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

On 16-08-10 06:37 AM, Adrien Mazarguil wrote:
> On Tue, Aug 09, 2016 at 02:47:44PM -0700, John Fastabend wrote:
>> On 16-08-04 06:24 AM, Adrien Mazarguil wrote:
>>> On Wed, Aug 03, 2016 at 12:11:56PM -0700, John Fastabend wrote:
> [...]
>>>> The problem is keeping priorities in order and/or possibly breaking
>>>> rules apart (e.g. you have an L2 table and an L3 table) becomes very
>>>> complex to manage at driver level. I think its easier for the
>>>> application which has some context to do this. The application "knows"
>>>> if its a router for example will likely be able to pack rules better
>>>> than a PMD will.
>>>
>>> I don't think most applications know they are L2 or L3 routers. They may not
>>> know more than the pattern provided to the PMD, which may indeed end at a L2
>>> or L3 protocol. If the application simply chooses a table based on this
>>> information, then the PMD could have easily done the same.
>>>
>>
>> But when we start thinking about encap/decap then its natural to start
>> using this interface to implement various forwarding dataplanes. And one
>> common way to organize a switch is into a TEP, router, switch
>> (mac/vlan), ACL tables, etc. In fact we see this topology starting to
>> show up in the NICs now.
>>
>> Further each table may be "managed" by a different entity. In which
>> case the software will want to manage the physical and virtual networks
>> separately.
>>
>> It doesn't make sense to me to require a software aggregator object to
>> marshal the rules into a flat table then for a PMD to split them apart
>> again.
> 
> OK, my point was mostly about handling basic cases easily and making sure
> applications do not have to bother with petty HW details when they do not
> want to, yet still get maximum performance by having the PMD make the most
> appropriate choices automatically.
> 
> You've convinced me that in many cases PMDs won't be able to optimize
> efficiently and that conscious applications will know better. The API has to
> provide the ability to do so. I think it's fine as long as it is not
> mandatory.
> 

Great. I also agree making table feature _not_ mandatory for many use
cases will be helpful. I'm just making sure we get all the use cases I
know of covered.

>>> I understand the issue is what happens when applications really want to
>>> define e.g. L2/L3/L2 rules in this specific order (or any ordering that
>>> cannot be satisfied by HW due to table constraints).
>>>
>>> By exposing tables, in such a case applications should move all rules from
>>> L2 to a L3 table themselves (assuming this is even supported) to guarantee
>>> ordering between rules, or fail to add them. This is basically what the PMD
>>> could have done, possibly in a more efficient manner in my opinion.
>>
>> I disagree with the more efficient comment :)
>>
>> If the software layer is working on L2/TEP/ACL/router layers merging
>> them just to pull them back apart is not going to be more efficient.
> 
> Moving flow rules around cannot be efficient by definition, however I think
> that attempting to describe table capabilities may be as complicated as
> describing HW bit-masking features. Applications may get it wrong as a
> result while a PMD would not make any mistake.
> 
> Your use case is valid though, if the application already groups rules, then
> sharing this information with the PMD would make sense from a performance
> standpoint.
> 
>>> Let's assume two opposite scenarios for this discussion:
>>>
>>> - App #1 is a command-line interface directly mapped to flow rules, which
>>>   basically gets slow random input from users depending on how they want to
>>>   configure their traffic. All rules differ considerably (L2, L3, L4, some
>>>   with incomplete bit-masks, etc). All in all, few but complex rules with
>>>   specific priorities.
>>>
>>
>> Agree with this and in this case the application should be behind any
>> network physical/virtual and not giving rules like encap/decap/etc. This
>> application either sits on the physical function and "owns" the hardware
>> resource or sits behind a virtual switch.
>>
>>
>>> - App #2 is something like OVS, creating and deleting a large number of very
>>>   specific (without incomplete bit-masks) and mostly identical
>>>   single-priority rules automatically and very frequently.
>>>
>>
>> Maybe for OVS but not all virtual switches are built with flat tables
>> at the bottom like this. Nor is it optimal it necessarily optimal.
>>
>> Another application (the one I'm concerned about :) would be build as
>> a pipeline, something like
>>
>> 	ACL -> TEP -> ACL -> VEB -> ACL
>>
>> If I have hardware that supports a TEP hardware block an ACL hardware
>> block and a VEB  block for example I don't want to merge my control
>> plane into a single table. The merging in this case is just pure
>> overhead/complexity for no gain.
> 
> It could be done by dedicating priority ranges for each item in the
> pipeline but then it would be clunky. OK then, let's discuss the best
> approach to implement this.
> 
> [...]
>>>> Its not about mask vs no mask. The devices with multiple tables that I
>>>> have don't have this mask limitations. Its about how to optimally pack
>>>> the rules and who implements that logic. I think its best done in the
>>>> application where I have the context.
>>>>
>>>> Is there a way to omit the table field if the PMD is expected to do
>>>> a best effort and add the table field if the user wants explicit
>>>> control over table mgmt. This would support both models. I at least
>>>> would like to have explicit control over rule population in my pipeline
>>>> for use cases where I'm building a pipeline on top of the hardware.
>>>
>>> Yes that's a possibility. Perhaps the table ID to use could be specified as
>>> a meta pattern item? We'd still need methods to report how many tables exist
>>> and perhaps some way to report their limitations, these could be later
>>> through a separate set of functions.
>>
>> Sure I think a meta pattern item would be fine or put it in the API call
>> directly, something like
>>
>>   rte_flow_create(port_id, pattern, actions);
>>   rte_flow_create_table(port_id, table_id, pattern, actions);
> 
> I suggest using a common method for both cases, either seems fine to me, as
> long as a default table value can be provided (zero) when applications do
> not care.
> 

Works for me just use zero as the default when the application has no
preference and expects PMD to do the table mapping.

> Now about tables management, I think there is no need to not expose table
> capabilities (in case they have different capabilities) but instead provide
> guidelines as part of the specification to encourage applications writers to
> group similar rules in tables. A previously discussed, flow rules priorities
> would be specific to the table they are affected to.

This seems sufficient to me.

> 
> Like flow rules, table priorities could be handled through their index with
> index 0 having the highest priority. Like flow rule priorities, table
> indices wouldn't have to be contiguous.
> 
> If this works for you, how about renaming "tables" to "groups"?
> 

Works for me. And actually I like renaming them "groups" as this seems
more neutral to how the hardware actually implements a group. For
example I've worked on hardware with multiple Tunnel Endpoint engines
but we exposed it as a single "group" to simplify the user interface.

.John

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-10 11:02                 ` Adrien Mazarguil
@ 2016-08-10 16:35                   ` John Fastabend
  0 siblings, 0 replies; 54+ messages in thread
From: John Fastabend @ 2016-08-10 16:35 UTC (permalink / raw)
  To: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

On 16-08-10 04:02 AM, Adrien Mazarguil wrote:
> On Tue, Aug 09, 2016 at 02:24:26PM -0700, John Fastabend wrote:
> [...]
>>> Just an idea, could some kind of meta pattern items specifying time
>>> constraints for a rule address this issue? Say, how long (cycles/ms) the PMD
>>> may take to query/apply/delete the rule. If it cannot be guaranteed, the
>>> rule cannot be created. Applications could mantain statistic counters about
>>> failed rules to determine if performance issues are caused by the inability
>>> to create them.
>>
>> It seems a bit heavy to me to have each PMD driver implementing
>> something like this. But it would be interesting to explore probably
>> after the basic support is implemented though.
> 
> OK, let's keep this for later.
> 
> [...]
>>> Such a pattern could be generated from a separate function before feeding it
>>> to rte_flow_create(), or translated by the PMD afterwards assuming a
>>> separate meta item such as RAW_END exists to signal the end of a RAW layer.
>>> Of course processing this would be more expensive.
>>>
>>
>> Or the supported parse graph could be fetched from the hardware with the
>> values for each protocol so that the programming interface is the same.
>> The well known protocols could keep the 'enum values' in the header
>> rte_flow_item_type enum so that users would not be required to do
>> the parse graph but for new or experimental protocols we could query
>> the parse graph and get the programming pattern matching id for them.
>>
>> The normal flow would be unchanged but we don't get stuck upgrading
>> everything to add our own protocol. So the flow would be,
>>
>>  rte_get_parse_graph(graph);
>>  flow_item_proto = is_my_proto_supported(graph);
>>
>>  pattern = build_flow_match(flow_item_proto, value, mask);
>>  action = build_action();
>>  rte_flow_create(my_port, pattern, action);
>>
>> The only change to the API proposed to support this would be to allow
>> unsupported RTE_FLOW_ values to be pushed to the hardware and define
>> a range of values that are reserved for use by the parse graph discover.
>>
>> This would not have to be any more expensive.
> 
> Makes sense. Unless made entirely out of RAW items however the ensuing
> pattern would not be portable across DPDK ports, instances and versions if
> dumped in binary form for later use.
> 

Right.

> Since those would have be recognized by PMDs and applications regardless of
> the API version, I suggest making generated item types negative (enums are
> signed, let's use that).

That works then the normal positive enums maintain the list of
known/accepted protocols.

> 
> DPDK would have to maintain a list of expended values to avoid collisions
> between PMDs. A method should be provided to release them.

The 'middle layer' could have a non-public API for PMDs to get new
values call it get_flow_type_item_id() or something.

> 
> [...]
>> hmm for performance reasons building an entire graph up using RAW items
>> seems to be a bit heavy. Another alternative to the above parse graph
>> notion would be to allow users to add RAW node definitions at init time
>> and have the PMD give a ID back for those. Then the new node could be
>> used just like any other RTE_FLOW_ITEM_TYPE in a pattern.
>>
>> Something like,
>>
>> 	ret_flow_item_type_foo = rte_create_raw_node(foo_raw_pattern)
>> 	ret_flow_item_type_bar = rte_create_raw_node(bar_raw_pattern)
>>
>> then allow ret_flow_item_type_{foo|bar} to be used in subsequent
>> pattern matching items. And if the hardware can not support this return
>> an error from the initial rte_create_raw_node() API call.
>>
>> Do any either of those proposals sound like reasonable extensions?
> 
> Both seem acceptable in my opinion as they fit in the described API. However
> I think it would be better for this function to spit out a pattern made of
> any number of items instead of a single new item type. That way, existing
> fixed items can be reused as well, the entire pattern may even become
> portable as a result, it could be considered as a method to optimize a
> RAW pattern.
> 
> The RAW approach has the advantage of not requiring much additional code in
> the public API besides a couple of function declarations. A proper full
> blown graph would require a lot more as described in your original
> reply. Not sure which is better.
> 
> Either way they won't be part of the initial specification but it looks like
> they can be added later without affecting the basics.
> 

Right its not needed in initial spec as long as we have a path to get
there and it looks like we have two usable possibilities so that works
for me.


>>> [...]
>>>> The two open items from me are do we need to support adding new variable
>>>> length headers? And how do we handle multiple tables I'll take that up
>>>> in the other thread.
>>>
>>> I think variable length headers may be eventually supported through pattern
>>> tricks or eventually a separate conversion layer.
>>>
>>
>> A parse graph notion would support this naturally though without pattern
>> tricks hence my above suggestions.
> 
> All right, I agree a method to let applications precisely define what they
> want to match can be useful now I understand what you mean by
> "dynamically".
> 
>> Also in the current scheme how would I match an ipv6 option or specific
>> nsh option or mpls tag?
> 
> Ideally through specific pattern items defined for this purpose, which is
> how I thought the API would evolve. Of course it wouldn't be fully dynamic
> and you'd have to wait for a DPDK release that implements them.
> 

The only trouble is if you don't know exactly where the option is in the
list of options (which you wont in general) its a bit hard to get right
with the existing spec as best I can tell. Because RAW patterns
would require you to know where the option is in the list and ANY
pattern wouldn't guarantee a match is in the correct header with stacked
headers. At least if I'm reading the spec correctly it seems to be
an issue.

.John

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-09 21:47                 ` John Fastabend
@ 2016-08-10 13:37                   ` Adrien Mazarguil
  2016-08-10 16:46                     ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-08-10 13:37 UTC (permalink / raw)
  To: John Fastabend
  Cc: Rahul Lakkireddy, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley,
	Jing Chen, Konstantin Ananyev, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

On Tue, Aug 09, 2016 at 02:47:44PM -0700, John Fastabend wrote:
> On 16-08-04 06:24 AM, Adrien Mazarguil wrote:
> > On Wed, Aug 03, 2016 at 12:11:56PM -0700, John Fastabend wrote:
[...]
> >> The problem is keeping priorities in order and/or possibly breaking
> >> rules apart (e.g. you have an L2 table and an L3 table) becomes very
> >> complex to manage at driver level. I think its easier for the
> >> application which has some context to do this. The application "knows"
> >> if its a router for example will likely be able to pack rules better
> >> than a PMD will.
> > 
> > I don't think most applications know they are L2 or L3 routers. They may not
> > know more than the pattern provided to the PMD, which may indeed end at a L2
> > or L3 protocol. If the application simply chooses a table based on this
> > information, then the PMD could have easily done the same.
> > 
> 
> But when we start thinking about encap/decap then its natural to start
> using this interface to implement various forwarding dataplanes. And one
> common way to organize a switch is into a TEP, router, switch
> (mac/vlan), ACL tables, etc. In fact we see this topology starting to
> show up in the NICs now.
> 
> Further each table may be "managed" by a different entity. In which
> case the software will want to manage the physical and virtual networks
> separately.
> 
> It doesn't make sense to me to require a software aggregator object to
> marshal the rules into a flat table then for a PMD to split them apart
> again.

OK, my point was mostly about handling basic cases easily and making sure
applications do not have to bother with petty HW details when they do not
want to, yet still get maximum performance by having the PMD make the most
appropriate choices automatically.

You've convinced me that in many cases PMDs won't be able to optimize
efficiently and that conscious applications will know better. The API has to
provide the ability to do so. I think it's fine as long as it is not
mandatory.

> > I understand the issue is what happens when applications really want to
> > define e.g. L2/L3/L2 rules in this specific order (or any ordering that
> > cannot be satisfied by HW due to table constraints).
> > 
> > By exposing tables, in such a case applications should move all rules from
> > L2 to a L3 table themselves (assuming this is even supported) to guarantee
> > ordering between rules, or fail to add them. This is basically what the PMD
> > could have done, possibly in a more efficient manner in my opinion.
> 
> I disagree with the more efficient comment :)
> 
> If the software layer is working on L2/TEP/ACL/router layers merging
> them just to pull them back apart is not going to be more efficient.

Moving flow rules around cannot be efficient by definition, however I think
that attempting to describe table capabilities may be as complicated as
describing HW bit-masking features. Applications may get it wrong as a
result while a PMD would not make any mistake.

Your use case is valid though, if the application already groups rules, then
sharing this information with the PMD would make sense from a performance
standpoint.

> > Let's assume two opposite scenarios for this discussion:
> > 
> > - App #1 is a command-line interface directly mapped to flow rules, which
> >   basically gets slow random input from users depending on how they want to
> >   configure their traffic. All rules differ considerably (L2, L3, L4, some
> >   with incomplete bit-masks, etc). All in all, few but complex rules with
> >   specific priorities.
> > 
> 
> Agree with this and in this case the application should be behind any
> network physical/virtual and not giving rules like encap/decap/etc. This
> application either sits on the physical function and "owns" the hardware
> resource or sits behind a virtual switch.
> 
> 
> > - App #2 is something like OVS, creating and deleting a large number of very
> >   specific (without incomplete bit-masks) and mostly identical
> >   single-priority rules automatically and very frequently.
> > 
> 
> Maybe for OVS but not all virtual switches are built with flat tables
> at the bottom like this. Nor is it optimal it necessarily optimal.
> 
> Another application (the one I'm concerned about :) would be build as
> a pipeline, something like
> 
> 	ACL -> TEP -> ACL -> VEB -> ACL
> 
> If I have hardware that supports a TEP hardware block an ACL hardware
> block and a VEB  block for example I don't want to merge my control
> plane into a single table. The merging in this case is just pure
> overhead/complexity for no gain.

It could be done by dedicating priority ranges for each item in the
pipeline but then it would be clunky. OK then, let's discuss the best
approach to implement this.

[...]
> >> Its not about mask vs no mask. The devices with multiple tables that I
> >> have don't have this mask limitations. Its about how to optimally pack
> >> the rules and who implements that logic. I think its best done in the
> >> application where I have the context.
> >>
> >> Is there a way to omit the table field if the PMD is expected to do
> >> a best effort and add the table field if the user wants explicit
> >> control over table mgmt. This would support both models. I at least
> >> would like to have explicit control over rule population in my pipeline
> >> for use cases where I'm building a pipeline on top of the hardware.
> > 
> > Yes that's a possibility. Perhaps the table ID to use could be specified as
> > a meta pattern item? We'd still need methods to report how many tables exist
> > and perhaps some way to report their limitations, these could be later
> > through a separate set of functions.
> 
> Sure I think a meta pattern item would be fine or put it in the API call
> directly, something like
> 
>   rte_flow_create(port_id, pattern, actions);
>   rte_flow_create_table(port_id, table_id, pattern, actions);

I suggest using a common method for both cases, either seems fine to me, as
long as a default table value can be provided (zero) when applications do
not care.

Now about tables management, I think there is no need to not expose table
capabilities (in case they have different capabilities) but instead provide
guidelines as part of the specification to encourage applications writers to
group similar rules in tables. A previously discussed, flow rules priorities
would be specific to the table they are affected to.

Like flow rules, table priorities could be handled through their index with
index 0 having the highest priority. Like flow rule priorities, table
indices wouldn't have to be contiguous.

If this works for you, how about renaming "tables" to "groups"?

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-09 21:24               ` John Fastabend
@ 2016-08-10 11:02                 ` Adrien Mazarguil
  2016-08-10 16:35                   ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-08-10 11:02 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

On Tue, Aug 09, 2016 at 02:24:26PM -0700, John Fastabend wrote:
[...]
> > Just an idea, could some kind of meta pattern items specifying time
> > constraints for a rule address this issue? Say, how long (cycles/ms) the PMD
> > may take to query/apply/delete the rule. If it cannot be guaranteed, the
> > rule cannot be created. Applications could mantain statistic counters about
> > failed rules to determine if performance issues are caused by the inability
> > to create them.
> 
> It seems a bit heavy to me to have each PMD driver implementing
> something like this. But it would be interesting to explore probably
> after the basic support is implemented though.

OK, let's keep this for later.

[...]
> > Such a pattern could be generated from a separate function before feeding it
> > to rte_flow_create(), or translated by the PMD afterwards assuming a
> > separate meta item such as RAW_END exists to signal the end of a RAW layer.
> > Of course processing this would be more expensive.
> > 
> 
> Or the supported parse graph could be fetched from the hardware with the
> values for each protocol so that the programming interface is the same.
> The well known protocols could keep the 'enum values' in the header
> rte_flow_item_type enum so that users would not be required to do
> the parse graph but for new or experimental protocols we could query
> the parse graph and get the programming pattern matching id for them.
> 
> The normal flow would be unchanged but we don't get stuck upgrading
> everything to add our own protocol. So the flow would be,
> 
>  rte_get_parse_graph(graph);
>  flow_item_proto = is_my_proto_supported(graph);
> 
>  pattern = build_flow_match(flow_item_proto, value, mask);
>  action = build_action();
>  rte_flow_create(my_port, pattern, action);
> 
> The only change to the API proposed to support this would be to allow
> unsupported RTE_FLOW_ values to be pushed to the hardware and define
> a range of values that are reserved for use by the parse graph discover.
> 
> This would not have to be any more expensive.

Makes sense. Unless made entirely out of RAW items however the ensuing
pattern would not be portable across DPDK ports, instances and versions if
dumped in binary form for later use.

Since those would have be recognized by PMDs and applications regardless of
the API version, I suggest making generated item types negative (enums are
signed, let's use that).

DPDK would have to maintain a list of expended values to avoid collisions
between PMDs. A method should be provided to release them.

[...]
> hmm for performance reasons building an entire graph up using RAW items
> seems to be a bit heavy. Another alternative to the above parse graph
> notion would be to allow users to add RAW node definitions at init time
> and have the PMD give a ID back for those. Then the new node could be
> used just like any other RTE_FLOW_ITEM_TYPE in a pattern.
> 
> Something like,
> 
> 	ret_flow_item_type_foo = rte_create_raw_node(foo_raw_pattern)
> 	ret_flow_item_type_bar = rte_create_raw_node(bar_raw_pattern)
> 
> then allow ret_flow_item_type_{foo|bar} to be used in subsequent
> pattern matching items. And if the hardware can not support this return
> an error from the initial rte_create_raw_node() API call.
> 
> Do any either of those proposals sound like reasonable extensions?

Both seem acceptable in my opinion as they fit in the described API. However
I think it would be better for this function to spit out a pattern made of
any number of items instead of a single new item type. That way, existing
fixed items can be reused as well, the entire pattern may even become
portable as a result, it could be considered as a method to optimize a
RAW pattern.

The RAW approach has the advantage of not requiring much additional code in
the public API besides a couple of function declarations. A proper full
blown graph would require a lot more as described in your original
reply. Not sure which is better.

Either way they won't be part of the initial specification but it looks like
they can be added later without affecting the basics.

> > [...]
> >> The two open items from me are do we need to support adding new variable
> >> length headers? And how do we handle multiple tables I'll take that up
> >> in the other thread.
> > 
> > I think variable length headers may be eventually supported through pattern
> > tricks or eventually a separate conversion layer.
> > 
> 
> A parse graph notion would support this naturally though without pattern
> tricks hence my above suggestions.

All right, I agree a method to let applications precisely define what they
want to match can be useful now I understand what you mean by
"dynamically".

> Also in the current scheme how would I match an ipv6 option or specific
> nsh option or mpls tag?

Ideally through specific pattern items defined for this purpose, which is
how I thought the API would evolve. Of course it wouldn't be fully dynamic
and you'd have to wait for a DPDK release that implements them.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-04 13:24               ` Adrien Mazarguil
@ 2016-08-09 21:47                 ` John Fastabend
  2016-08-10 13:37                   ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: John Fastabend @ 2016-08-09 21:47 UTC (permalink / raw)
  To: Rahul Lakkireddy, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley,
	Jing Chen, Konstantin Ananyev, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

On 16-08-04 06:24 AM, Adrien Mazarguil wrote:
> On Wed, Aug 03, 2016 at 12:11:56PM -0700, John Fastabend wrote:
>> [...]
>>
>>>>>>>> The proposal looks very good.  It satisfies most of the features
>>>>>>>> supported by Chelsio NICs.  We are looking for suggestions on exposing
>>>>>>>> more additional features supported by Chelsio NICs via this API.
>>>>>>>>
>>>>>>>> Chelsio NICs have two regions in which filters can be placed -
>>>>>>>> Maskfull and Maskless regions.  As their names imply, maskfull region
>>>>>>>> can accept masks to match a range of values; whereas, maskless region
>>>>>>>> don't accept any masks and hence perform a more strict exact-matches.
>>>>>>>> Filters without masks can also be placed in maskfull region.  By
>>>>>>>> default, maskless region have higher priority over the maskfull region.
>>>>>>>> However, the priority between the two regions is configurable.
>>>>>>>
>>>>>>> I understand this configuration affects the entire device. Just to be clear,
>>>>>>> assuming some filters are already configured, are they affected by a change
>>>>>>> of region priority later?
>>>>>>>
>>>>>>
>>>>>> Both the regions exist at the same time in the device.  Each filter can
>>>>>> either belong to maskfull or the maskless region.
>>>>>>
>>>>>> The priority is configured at time of filter creation for every
>>>>>> individual filter and cannot be changed while the filter is still
>>>>>> active. If priority needs to be changed for a particular filter then,
>>>>>> it needs to be deleted first and re-created.
>>>>>
>>>>> Could you model this as two tables and add a table_id to the API? This
>>>>> way user space could populate the table it chooses. We would have to add
>>>>> some capabilities attributes to "learn" if tables support masks or not
>>>>> though.
>>>>>
>>>>
>>>> This approach sounds interesting.
>>>
>>> Now I understand the idea behind these tables, however from an application
>>> point of view I still think it's better if the PMD could take care of flow
>>> rules optimizations automatically. Think about it, PMDs have exactly a
>>> single kind of device they know perfectly well to manage, while applications
>>> want the best possible performance out of any device in the most generic
>>> fashion.
>>
>> The problem is keeping priorities in order and/or possibly breaking
>> rules apart (e.g. you have an L2 table and an L3 table) becomes very
>> complex to manage at driver level. I think its easier for the
>> application which has some context to do this. The application "knows"
>> if its a router for example will likely be able to pack rules better
>> than a PMD will.
> 
> I don't think most applications know they are L2 or L3 routers. They may not
> know more than the pattern provided to the PMD, which may indeed end at a L2
> or L3 protocol. If the application simply chooses a table based on this
> information, then the PMD could have easily done the same.
> 

But when we start thinking about encap/decap then its natural to start
using this interface to implement various forwarding dataplanes. And one
common way to organize a switch is into a TEP, router, switch
(mac/vlan), ACL tables, etc. In fact we see this topology starting to
show up in the NICs now.

Further each table may be "managed" by a different entity. In which
case the software will want to manage the physical and virtual networks
separately.

It doesn't make sense to me to require a software aggregator object to
marshal the rules into a flat table then for a PMD to split them apart
again.

> I understand the issue is what happens when applications really want to
> define e.g. L2/L3/L2 rules in this specific order (or any ordering that
> cannot be satisfied by HW due to table constraints).
> 
> By exposing tables, in such a case applications should move all rules from
> L2 to a L3 table themselves (assuming this is even supported) to guarantee
> ordering between rules, or fail to add them. This is basically what the PMD
> could have done, possibly in a more efficient manner in my opinion.

I disagree with the more efficient comment :)

If the software layer is working on L2/TEP/ACL/router layers merging
them just to pull them back apart is not going to be more efficient.

> 
> Let's assume two opposite scenarios for this discussion:
> 
> - App #1 is a command-line interface directly mapped to flow rules, which
>   basically gets slow random input from users depending on how they want to
>   configure their traffic. All rules differ considerably (L2, L3, L4, some
>   with incomplete bit-masks, etc). All in all, few but complex rules with
>   specific priorities.
> 

Agree with this and in this case the application should be behind any
network physical/virtual and not giving rules like encap/decap/etc. This
application either sits on the physical function and "owns" the hardware
resource or sits behind a virtual switch.


> - App #2 is something like OVS, creating and deleting a large number of very
>   specific (without incomplete bit-masks) and mostly identical
>   single-priority rules automatically and very frequently.
> 

Maybe for OVS but not all virtual switches are built with flat tables
at the bottom like this. Nor is it optimal it necessarily optimal.

Another application (the one I'm concerned about :) would be build as
a pipeline, something like

	ACL -> TEP -> ACL -> VEB -> ACL

If I have hardware that supports a TEP hardware block an ACL hardware
block and a VEB  block for example I don't want to merge my control
plane into a single table. The merging in this case is just pure
overhead/complexity for no gain.

> Actual applications will certainly be a mix of both.
> 
> For app #1, users would have to be aware of these tables and base their
> filtering decisions according to them. Reporting tables capabilities, making
> sure priorities between tables are well configured will be their
> responsibility. Obviously applications may take care of these details for
> them, but the end result will be the same. At some point, some combination
> won't be possible. Getting there was only more complicated from
> users/applications point of view.
> 
> For app #2 if the first rule can be created then subsequent rules shouldn't
> be a problem until their number reaches device limits. Selecting the proper
> table to use for these can easily be done by the PMD.
> 

But it requires rewriting my pipeline software to be useful and this I
want to avoid. Using my TEP example again I'll need something in
software to catch every VEB/ACL rule and append the rest of the rule
creating wide rules. For my use cases its not a very user friendly API.

>>>>> I don't see how the PMD can sort this out in any meaningful way and it
>>>>> has to be exposed to the application that has the intelligence to 'know'
>>>>> priorities between masks and non-masks filters. I'm sure you could come
>>>>> up with something but it would be less than ideal in many cases I would
>>>>> guess and we can't have the driver getting priorities wrong or we may
>>>>> not get the correct behavior.
>>>
>>> It may be solved by having the PMD maintain a SW state to quickly know which
>>> rules are currently created and in what state the device is so basically the
>>> application doesn't have to perform this work.
>>>
>>> This API allows applications to express basic needs such as "redirect
>>> packets matching this pattern to that queue". It must not deal with HW
>>> details and limitations in my opinion. If a request cannot be satisfied,
>>> then the rule cannot be created. No help from the application must be
>>> expected by PMDs, otherwise it opens the door to the same issues as the
>>> legacy filtering APIs.
>>
>> This depends on the application and what/how it wants to manage the
>> device. If the application manages a pipeline with some set of tables,
>> then mapping this down to a single table, which then the PMD has to
>> unwind back to a multi-table topology to me seems like a waste.
> 
> Of course, only I am not sure applications will behave differently if they
> are aware of HW tables. I fear it will make things more complicated for
> them and they will just stick with the most capable table all the time, but
> I agree it should be easier for PMDs.
> 

On the other side if the API doesn't match my software pipeline the
complexity/overhead of merging it just to tear it apart again may
prohibit use of the API in these cases.

>>> [...]
>>>>>> Unfortunately, our maskfull region is extremely small too compared to
>>>>>> maskless region.
>>>>>>
>>>>>
>>>>> To me this means a userspace application would want to pack it
>>>>> carefully to get the full benefit. So you need some mechanism to specify
>>>>> the "region" hence the above table proposal.
>>>>>
>>>>
>>>> Right. Makes sense.
>>>
>>> I do not agree, applications should not be aware of it. Note this case can
>>> be handled differently, so that rules do not have to be moved back and forth
>>> between both tables. If the first created rule requires a maskfull entry,
>>> then all subsequent rules will be entered into that table. Otherwise no
>>> maskfull entry can be created as long as there is one maskless entry. When
>>> either table is full, no more rules may be added. Would that work for you?
>>>
>>
>> Its not about mask vs no mask. The devices with multiple tables that I
>> have don't have this mask limitations. Its about how to optimally pack
>> the rules and who implements that logic. I think its best done in the
>> application where I have the context.
>>
>> Is there a way to omit the table field if the PMD is expected to do
>> a best effort and add the table field if the user wants explicit
>> control over table mgmt. This would support both models. I at least
>> would like to have explicit control over rule population in my pipeline
>> for use cases where I'm building a pipeline on top of the hardware.
> 
> Yes that's a possibility. Perhaps the table ID to use could be specified as
> a meta pattern item? We'd still need methods to report how many tables exist
> and perhaps some way to report their limitations, these could be later
> through a separate set of functions.

Sure I think a meta pattern item would be fine or put it in the API call
directly, something like

  rte_flow_create(port_id, pattern, actions);
  rte_flow_create_table(port_id, table_id, pattern, actions);


> 
> [...]
>>>>> For this adding a meta-data item seems simplest to me. And if you want
>>>>> to make the default to be only a single port that would maybe make it
>>>>> easier for existing apps to port from flow director. Then if an
>>>>> application cares it can create a list of ports if needed.
>>>>>
>>>>
>>>> Agreed.
>>>
>>> However although I'm not opposed to adding dedicated meta items, remember
>>> applications will not automatically benefit from the increased performance
>>> if a single PMD implements this feature, their maintainers will probably not
>>> bother with it.
>>>
>>
>> Unless as we noted in other thread the application is closely bound to
>> its hardware for capability reasons. In this case it would make sense
>> to implement.
> 
> Sure.
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-04 13:05             ` Adrien Mazarguil
@ 2016-08-09 21:24               ` John Fastabend
  2016-08-10 11:02                 ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: John Fastabend @ 2016-08-09 21:24 UTC (permalink / raw)
  To: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

[...]

>> I'm not sure I understand 'bit granularity' here. I would say we have
>> devices now that have rather strange restrictions due to hardware
>> implementation. Going forward we should get better hardware and a lot
>> of this will go away in my view. Yes this is a long term view and
>> doesn't help the current state. The overall point you are making is
>> the sum off all these strange/odd bits in the hardware implementation
>> means capabilities queries are very difficult to guarantee. On existing
>> hardware and I think you've convinced me. Thanks ;)
> 
> Precisely. By "bit granularity" I meant that while it is fairly easy to
> report whether bit-masking is supported on protocol fields such as MAC
> addresses at all, devices may have restrictions on the possible bit-masks,
> like they may only have an effect at byte level (0xff), may not allow
> specific bits (broadcast) or there even may be a fixed set of bit-masks to
> choose from.

Yep lots of strange hardware implementation voodoo here.

> 
> [...]
>>> I understand, however I think this approach may be too low-level to express
>>> all the possible combinations. This graph would have to include possible
>>> actions for each possible pattern, all while considering that some actions
>>> are not possible with some patterns and that there are exclusive actions.
>>>
>>
>> Really? You have hardware that has dependencies between the parser and
>> the supported actions? Ugh...
> 
> Not that I know of actually, even though we cannot rule out this
> possibility.
> 
> Here are the possible cases I have in mind with existing HW:
> 
> - Too many actions specified for a single rule, even though each of them is
>   otherwise supported.

Yep most hardware will have this restriction.

> 
> - Performing several encap/decap actions. None are defined in the initial
>   specification but these are already planned.
> 

Great this is certainly needed.

> - Assuming there is a single table from the application point of view
>   (separate discussion for the other thread), some actions may only be
>   possible with the right pattern item or meta item. Asking HW to perform
>   tunnel decap may only be safe if the pattern specifically matches that
>   protocol.
> 

Yep continue in other thread.

>> If the hardware has separate tables then we shouldn't try to have the
>> PMD flatten those into a single table because we will have no way of
>> knowing how to do that. (I'll respond to the other thread on this in
>> an attempt to not get to scattered).
> 
> OK, will reply there as well.
> 
>>> Also while memory consumption is not really an issue, such a graph may be
>>> huge. It could take a while for the PMD to update it when adding a rule
>>> impacting capabilities.
>>
>> Ugh... I wouldn't suggest updating the capabilities at runtime like
>> this. But I see your point if the graph has to _guarantee_ correctness
>> how does it represent limited number of masks and other strange hw,
>> its unfortunate the hardware isn't more regular.
>>
>> You have convinced me that guaranteed correctness via capabilities
>> is going to difficult for many types of devices although not all.
> 
> I'll just add that these capabilities also depend on side effects of
> configuration performed outside the scope of this API. The way queues are
> (re)initialized or offloads configured may affect them. RSS configuration is
> the most obvious example.
> 

OK.

[...]

>>
>> My concern is this non-determinism will create performance issues in
>> the network because when a flow may or may not be offloaded this can
>> have a rather significant impact on its performance. This can make
>> debugging network wide performance miserable when at time X I get
>> performance X and then for whatever reason something degrades to
>> software and at time Y I get some performance Y << X. I suspect that
>> in general applications will bind tightly with hardware they know
>> works.
> 
> You are right, performance determinism is not taken into account at all, at
> least not yet. It should not be an issue at the beginning as long as the
> API has the ability evolve later for applications that need it.
> 
> Just an idea, could some kind of meta pattern items specifying time
> constraints for a rule address this issue? Say, how long (cycles/ms) the PMD
> may take to query/apply/delete the rule. If it cannot be guaranteed, the
> rule cannot be created. Applications could mantain statistic counters about
> failed rules to determine if performance issues are caused by the inability
> to create them.

It seems a bit heavy to me to have each PMD driver implementing
something like this. But it would be interesting to explore probably
after the basic support is implemented though.

> 
> [...]
>>> For individual points:
>>>
>>> (i) should be doable with the query API without recompiling DPDK as well,
>>> the fact API/ABI breakage must be avoided being part of the requirements. If
>>> you think there is a problem regarding this, can you provide a specific
>>> example?
>>
>> What I was after you noted yourself in the doc here,
>>
>> "PMDs can rely on this capability to simulate support for protocols with
>> fixed headers not directly recognized by hardware."
>>
>> I was trying to get variable header support with the RAW capabilities. A
>> parse graph supports this for example the proposed query API does not.
> 
> OK, I see, however the RAW capability itself may not be supported everywhere
> in patterns. What I described is that PMDs, not applications, could leverage
> the RAW abilities of underlying devices to implement otherwise unsupported
> but fixed patterns.
> 
> So basically you would like to expose the ability to describe fixed protocol
> definitions following RAW patterns, as in:

Correct for say some new tunnel metadata or something.

> 
>  ETH / RAW / IP / UDP / ...
> 
> While with such a pattern the current specification makes RAW (4.1.4.2) and
> IP start matching from the same offset as two different branches, in effect
> you cannot specify a fixed protocol following a RAW item.

What this means though is for every new protocol we will need to rebuild
drivers and dpdk. For a shared lib DPDK environment or a Linux
distribution this can be painful. It would be best to avoid this.

> 
> It is defined that way because I do not see how HW could parse higher level
> protocols after having given up due to a RAW pattern, however assuming the
> entire stack is described only using RAW patterns I guess it could be done.
> 
> Such a pattern could be generated from a separate function before feeding it
> to rte_flow_create(), or translated by the PMD afterwards assuming a
> separate meta item such as RAW_END exists to signal the end of a RAW layer.
> Of course processing this would be more expensive.
> 

Or the supported parse graph could be fetched from the hardware with the
values for each protocol so that the programming interface is the same.
The well known protocols could keep the 'enum values' in the header
rte_flow_item_type enum so that users would not be required to do
the parse graph but for new or experimental protocols we could query
the parse graph and get the programming pattern matching id for them.

The normal flow would be unchanged but we don't get stuck upgrading
everything to add our own protocol. So the flow would be,

 rte_get_parse_graph(graph);
 flow_item_proto = is_my_proto_supported(graph);

 pattern = build_flow_match(flow_item_proto, value, mask);
 action = build_action();
 rte_flow_create(my_port, pattern, action);

The only change to the API proposed to support this would be to allow
unsupported RTE_FLOW_ values to be pushed to the hardware and define
a range of values that are reserved for use by the parse graph discover.

This would not have to be any more expensive.

[...]

>>>>> So you can put it after "known"
>>>>> variable length headers like IP. The limitation is it can't get past
>>>>> undefined variable length headers.
>>>
>>> RTE_FLOW_ITEM_TYPE_ANY is made for that purpose. Is that what you are
>>> looking for?
>>>
>>
>> But FLOW_ITEM_TYPE_ANY skips "any" header type is my understanding if
>> we have new variable length header in the future we will have to add
>> a new type RTE_FLOW_ITEM_TYPE_FOO for example. The RAW type will work
>> for fixed headers as noted above.
> 
> I'm (slowly) starting to get it. How about the suggestion I made above for
> RAW items then?

hmm for performance reasons building an entire graph up using RAW items
seems to be a bit heavy. Another alternative to the above parse graph
notion would be to allow users to add RAW node definitions at init time
and have the PMD give a ID back for those. Then the new node could be
used just like any other RTE_FLOW_ITEM_TYPE in a pattern.

Something like,

	ret_flow_item_type_foo = rte_create_raw_node(foo_raw_pattern)
	ret_flow_item_type_bar = rte_create_raw_node(bar_raw_pattern)

then allow ret_flow_item_type_{foo|bar} to be used in subsequent
pattern matching items. And if the hardware can not support this return
an error from the initial rte_create_raw_node() API call.

Do any either of those proposals sound like reasonable extensions?

> 
> [...]
>> The two open items from me are do we need to support adding new variable
>> length headers? And how do we handle multiple tables I'll take that up
>> in the other thread.
> 
> I think variable length headers may be eventually supported through pattern
> tricks or eventually a separate conversion layer.
> 

A parse graph notion would support this naturally though without pattern
tricks hence my above suggestions.

Also in the current scheme how would I match an ipv6 option or specific
nsh option or mpls tag?

>>>>> I looked at the git repo but I only saw the header definition I guess
>>>>> the implementation is TBD after there is enough agreement on the
>>>>> interface?
>>>
>>> Precisely, I intend to update the tree and send a v2 soon (unfortunately did
>>> not have much time these past few days to work on this).
>>>
>>> Now what if, instead of a seemingly complex parse graph and still in
>>> addition to the query method, enum values were defined for PMDs to report
>>> an array of supported items, typical patterns and actions so applications
>>> can get a quick idea of what devices are capable of without being too
>>> specific. Something like:
>>>
>>>  enum rte_flow_capability {
>>>      RTE_FLOW_CAPABILITY_ITEM_ETH,
>>>      RTE_FLOW_CAPABILITY_PATTERN_ETH_IP_TCP,
>>>      RTE_FLOW_CAPABILITY_ACTION_ID,
>>>      ...
>>>  };
>>>
>>> Although I'm not convinced about the usefulness of this because it would
>>> have to be maintained separately, but that would be easier than building a
>>> dummy flow rule for simple query purposes.
>>
>> I'm not sure its necessary either at first.
> 
> Then I'll discard this idea.
> 
>>> The main question I have for you is, do you think the core of the specified
>>> API is adequate enough assuming it can be extended later with new methods?
>>>
>>
>> The above two items are my only opens at this point, I agree with your
>> summary of my capabilities proposal namely it can be added.
> 
> Thanks, see you in the other thread.
> 

Thanks,
John

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-03 19:11             ` John Fastabend
@ 2016-08-04 13:24               ` Adrien Mazarguil
  2016-08-09 21:47                 ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-08-04 13:24 UTC (permalink / raw)
  To: John Fastabend
  Cc: Rahul Lakkireddy, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley,
	Jing Chen, Konstantin Ananyev, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

On Wed, Aug 03, 2016 at 12:11:56PM -0700, John Fastabend wrote:
> [...]
> 
> >>>>>> The proposal looks very good.  It satisfies most of the features
> >>>>>> supported by Chelsio NICs.  We are looking for suggestions on exposing
> >>>>>> more additional features supported by Chelsio NICs via this API.
> >>>>>>
> >>>>>> Chelsio NICs have two regions in which filters can be placed -
> >>>>>> Maskfull and Maskless regions.  As their names imply, maskfull region
> >>>>>> can accept masks to match a range of values; whereas, maskless region
> >>>>>> don't accept any masks and hence perform a more strict exact-matches.
> >>>>>> Filters without masks can also be placed in maskfull region.  By
> >>>>>> default, maskless region have higher priority over the maskfull region.
> >>>>>> However, the priority between the two regions is configurable.
> >>>>>
> >>>>> I understand this configuration affects the entire device. Just to be clear,
> >>>>> assuming some filters are already configured, are they affected by a change
> >>>>> of region priority later?
> >>>>>
> >>>>
> >>>> Both the regions exist at the same time in the device.  Each filter can
> >>>> either belong to maskfull or the maskless region.
> >>>>
> >>>> The priority is configured at time of filter creation for every
> >>>> individual filter and cannot be changed while the filter is still
> >>>> active. If priority needs to be changed for a particular filter then,
> >>>> it needs to be deleted first and re-created.
> >>>
> >>> Could you model this as two tables and add a table_id to the API? This
> >>> way user space could populate the table it chooses. We would have to add
> >>> some capabilities attributes to "learn" if tables support masks or not
> >>> though.
> >>>
> >>
> >> This approach sounds interesting.
> > 
> > Now I understand the idea behind these tables, however from an application
> > point of view I still think it's better if the PMD could take care of flow
> > rules optimizations automatically. Think about it, PMDs have exactly a
> > single kind of device they know perfectly well to manage, while applications
> > want the best possible performance out of any device in the most generic
> > fashion.
> 
> The problem is keeping priorities in order and/or possibly breaking
> rules apart (e.g. you have an L2 table and an L3 table) becomes very
> complex to manage at driver level. I think its easier for the
> application which has some context to do this. The application "knows"
> if its a router for example will likely be able to pack rules better
> than a PMD will.

I don't think most applications know they are L2 or L3 routers. They may not
know more than the pattern provided to the PMD, which may indeed end at a L2
or L3 protocol. If the application simply chooses a table based on this
information, then the PMD could have easily done the same.

I understand the issue is what happens when applications really want to
define e.g. L2/L3/L2 rules in this specific order (or any ordering that
cannot be satisfied by HW due to table constraints).

By exposing tables, in such a case applications should move all rules from
L2 to a L3 table themselves (assuming this is even supported) to guarantee
ordering between rules, or fail to add them. This is basically what the PMD
could have done, possibly in a more efficient manner in my opinion.

Let's assume two opposite scenarios for this discussion:

- App #1 is a command-line interface directly mapped to flow rules, which
  basically gets slow random input from users depending on how they want to
  configure their traffic. All rules differ considerably (L2, L3, L4, some
  with incomplete bit-masks, etc). All in all, few but complex rules with
  specific priorities.

- App #2 is something like OVS, creating and deleting a large number of very
  specific (without incomplete bit-masks) and mostly identical
  single-priority rules automatically and very frequently.

Actual applications will certainly be a mix of both.

For app #1, users would have to be aware of these tables and base their
filtering decisions according to them. Reporting tables capabilities, making
sure priorities between tables are well configured will be their
responsibility. Obviously applications may take care of these details for
them, but the end result will be the same. At some point, some combination
won't be possible. Getting there was only more complicated from
users/applications point of view.

For app #2 if the first rule can be created then subsequent rules shouldn't
be a problem until their number reaches device limits. Selecting the proper
table to use for these can easily be done by the PMD.

> >>> I don't see how the PMD can sort this out in any meaningful way and it
> >>> has to be exposed to the application that has the intelligence to 'know'
> >>> priorities between masks and non-masks filters. I'm sure you could come
> >>> up with something but it would be less than ideal in many cases I would
> >>> guess and we can't have the driver getting priorities wrong or we may
> >>> not get the correct behavior.
> > 
> > It may be solved by having the PMD maintain a SW state to quickly know which
> > rules are currently created and in what state the device is so basically the
> > application doesn't have to perform this work.
> > 
> > This API allows applications to express basic needs such as "redirect
> > packets matching this pattern to that queue". It must not deal with HW
> > details and limitations in my opinion. If a request cannot be satisfied,
> > then the rule cannot be created. No help from the application must be
> > expected by PMDs, otherwise it opens the door to the same issues as the
> > legacy filtering APIs.
> 
> This depends on the application and what/how it wants to manage the
> device. If the application manages a pipeline with some set of tables,
> then mapping this down to a single table, which then the PMD has to
> unwind back to a multi-table topology to me seems like a waste.

Of course, only I am not sure applications will behave differently if they
are aware of HW tables. I fear it will make things more complicated for
them and they will just stick with the most capable table all the time, but
I agree it should be easier for PMDs.

> > [...]
> >>>> Unfortunately, our maskfull region is extremely small too compared to
> >>>> maskless region.
> >>>>
> >>>
> >>> To me this means a userspace application would want to pack it
> >>> carefully to get the full benefit. So you need some mechanism to specify
> >>> the "region" hence the above table proposal.
> >>>
> >>
> >> Right. Makes sense.
> > 
> > I do not agree, applications should not be aware of it. Note this case can
> > be handled differently, so that rules do not have to be moved back and forth
> > between both tables. If the first created rule requires a maskfull entry,
> > then all subsequent rules will be entered into that table. Otherwise no
> > maskfull entry can be created as long as there is one maskless entry. When
> > either table is full, no more rules may be added. Would that work for you?
> > 
> 
> Its not about mask vs no mask. The devices with multiple tables that I
> have don't have this mask limitations. Its about how to optimally pack
> the rules and who implements that logic. I think its best done in the
> application where I have the context.
> 
> Is there a way to omit the table field if the PMD is expected to do
> a best effort and add the table field if the user wants explicit
> control over table mgmt. This would support both models. I at least
> would like to have explicit control over rule population in my pipeline
> for use cases where I'm building a pipeline on top of the hardware.

Yes that's a possibility. Perhaps the table ID to use could be specified as
a meta pattern item? We'd still need methods to report how many tables exist
and perhaps some way to report their limitations, these could be later
through a separate set of functions.

[...]
> >>> For this adding a meta-data item seems simplest to me. And if you want
> >>> to make the default to be only a single port that would maybe make it
> >>> easier for existing apps to port from flow director. Then if an
> >>> application cares it can create a list of ports if needed.
> >>>
> >>
> >> Agreed.
> > 
> > However although I'm not opposed to adding dedicated meta items, remember
> > applications will not automatically benefit from the increased performance
> > if a single PMD implements this feature, their maintainers will probably not
> > bother with it.
> > 
> 
> Unless as we noted in other thread the application is closely bound to
> its hardware for capability reasons. In this case it would make sense
> to implement.

Sure.

[...]

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-03 18:10           ` John Fastabend
@ 2016-08-04 13:05             ` Adrien Mazarguil
  2016-08-09 21:24               ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-08-04 13:05 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

On Wed, Aug 03, 2016 at 11:10:49AM -0700, John Fastabend wrote:
> [...]
> 
> >>>> Considering that allowed pattern/actions combinations cannot be known in
> >>>> advance and would result in an unpractically large number of capabilities to
> >>>> expose, a method is provided to validate a given rule from the current
> >>>> device configuration state without actually adding it (akin to a "dry run"
> >>>> mode).
> >>>
> >>> Rather than have a query/validate process why did we jump over having an
> >>> intermediate representation of the capabilities? Here you state it is
> >>> unpractical but we know how to represent parse graphs and the drivers
> >>> could report their supported parse graph via a single query to a middle
> >>> layer.
> >>>
> >>> This will actually reduce the msg chatter imagine many applications at
> >>> init time or in boundary cases where a large set of applications come
> >>> online at once and start banging on the interface all at once seems less
> >>> than ideal.
> > 
> > Well, I also thought about a kind of graph to represent capabilities but
> > feared the extra complexity would not be worth the trouble, thus settled on
> > the query idea. A couple more reasons:
> > 
> > - Capabilities evolve at the same time as devices are configured. For
> >   example, if a device supports a single RSS context, then a single rule
> >   with a RSS action may be created. The graph would have to be rewritten
> >   accordingly and thus queried/parsed again by the application.
> 
> The graph would not help here because this is an action
> restriction not a parsing restriction. This is yet another query to see
> what actions are supported and how many of each action are supported.
> 
>    get_parse_graph - report the parsable fields
>    get_actions - report the supported actions and possible num of each

OK, now I understand your idea, in my mind the graph was indeed supposed to
represent complete flow rules.

> > - Expressing capabilities at bit granularity (say, for a matching pattern
> >   item mask) is complex, there is no way to simplify the representation of
> >   capabilities without either losing information or making the graph more
> >   complex to parse than simply providing a flow rule from an application
> >   point of view.
> > 
> 
> I'm not sure I understand 'bit granularity' here. I would say we have
> devices now that have rather strange restrictions due to hardware
> implementation. Going forward we should get better hardware and a lot
> of this will go away in my view. Yes this is a long term view and
> doesn't help the current state. The overall point you are making is
> the sum off all these strange/odd bits in the hardware implementation
> means capabilities queries are very difficult to guarantee. On existing
> hardware and I think you've convinced me. Thanks ;)

Precisely. By "bit granularity" I meant that while it is fairly easy to
report whether bit-masking is supported on protocol fields such as MAC
addresses at all, devices may have restrictions on the possible bit-masks,
like they may only have an effect at byte level (0xff), may not allow
specific bits (broadcast) or there even may be a fixed set of bit-masks to
choose from.

[...]
> > I understand, however I think this approach may be too low-level to express
> > all the possible combinations. This graph would have to include possible
> > actions for each possible pattern, all while considering that some actions
> > are not possible with some patterns and that there are exclusive actions.
> > 
> 
> Really? You have hardware that has dependencies between the parser and
> the supported actions? Ugh...

Not that I know of actually, even though we cannot rule out this
possibility.

Here are the possible cases I have in mind with existing HW:

- Too many actions specified for a single rule, even though each of them is
  otherwise supported.

- Performing several encap/decap actions. None are defined in the initial
  specification but these are already planned.

- Assuming there is a single table from the application point of view
  (separate discussion for the other thread), some actions may only be
  possible with the right pattern item or meta item. Asking HW to perform
  tunnel decap may only be safe if the pattern specifically matches that
  protocol.

> If the hardware has separate tables then we shouldn't try to have the
> PMD flatten those into a single table because we will have no way of
> knowing how to do that. (I'll respond to the other thread on this in
> an attempt to not get to scattered).

OK, will reply there as well.

> > Also while memory consumption is not really an issue, such a graph may be
> > huge. It could take a while for the PMD to update it when adding a rule
> > impacting capabilities.
> 
> Ugh... I wouldn't suggest updating the capabilities at runtime like
> this. But I see your point if the graph has to _guarantee_ correctness
> how does it represent limited number of masks and other strange hw,
> its unfortunate the hardware isn't more regular.
> 
> You have convinced me that guaranteed correctness via capabilities
> is going to difficult for many types of devices although not all.

I'll just add that these capabilities also depend on side effects of
configuration performed outside the scope of this API. The way queues are
(re)initialized or offloads configured may affect them. RSS configuration is
the most obvious example.

> [...]
> 
> >>
> >> The cost doing all this is some additional overhead at init time. But
> >> building generic function over this and having a set of predefined
> >> uids for well-known protocols such ip, udp, tcp, etc helps. What you
> >> get for the cost is a few things that I think are worth it. (i) Now
> >> new protocols can be added/removed without recompiling DPDK (ii) a
> >> software package can use the capability query to verify the required
> >> protocols are off-loadable vs a possibly large set of test queries and
> >> (iii) when we do the programming of the device we can provide a tuple
> >> (table-uid, header-uid, field-uid, value, mask, priority) and the
> >> middle layer "knowing" the above graph can verify the command so
> >> drivers only ever see "good"  commands, (iv) finally it should be
> >> faster in terms of cmds per second because the drivers can map the
> >> tuple (table, header, field, priority) to a slot efficiently vs
> >> parsing.
> >>
> >> IMO point (iii) and (iv) will in practice make the code much simpler
> >> because we can maintain common middle layer and not require parsing
> >> by drivers. Making each driver simpler by abstracting into common
> >> layer.
> > 
> > Before answering your points, let's consider how applications are going to
> > be written. Not only devices do not support all possible pattern/actions
> > combinations, they also have memory constraints. Whichever method
> > applications use to determine if a flow rule is supported, at some point
> > they won't be able to add any more due to device limitations.
> > 
> > Sane applications designed to work regardless of the underlying device won't
> > simply call abort() at this point but provide a software fallback
> > instead. My bet is that applications will provide one every time a rule
> > cannot be added for any reason, they won't even bother to query capabilities
> > except perhaps for a very small subset, as in "does this device support the
> > ID action at all?".
> > 
> > Applications that really want/need to know at init time whether all the
> > rules they may want to possibly create are supported will spend about the
> > same time in both cases (query or graph). For queries, by iterating on a
> > list of typical rules. For a graph, by walking through it. Either way, it
> > won't be done later from the data path.
> 
> The queries and graph suffer from the same problems you noted above if
> actually instantiating the rules will impact what rules are allowed. So
> that in both cases we may run into corner cases but it seems that this
> is a result of hardware deficiencies and can't be solved easily at least
> with software.
> 
> My concern is this non-determinism will create performance issues in
> the network because when a flow may or may not be offloaded this can
> have a rather significant impact on its performance. This can make
> debugging network wide performance miserable when at time X I get
> performance X and then for whatever reason something degrades to
> software and at time Y I get some performance Y << X. I suspect that
> in general applications will bind tightly with hardware they know
> works.

You are right, performance determinism is not taken into account at all, at
least not yet. It should not be an issue at the beginning as long as the
API has the ability evolve later for applications that need it.

Just an idea, could some kind of meta pattern items specifying time
constraints for a rule address this issue? Say, how long (cycles/ms) the PMD
may take to query/apply/delete the rule. If it cannot be guaranteed, the
rule cannot be created. Applications could mantain statistic counters about
failed rules to determine if performance issues are caused by the inability
to create them.

[...]
> > For individual points:
> > 
> > (i) should be doable with the query API without recompiling DPDK as well,
> > the fact API/ABI breakage must be avoided being part of the requirements. If
> > you think there is a problem regarding this, can you provide a specific
> > example?
> 
> What I was after you noted yourself in the doc here,
> 
> "PMDs can rely on this capability to simulate support for protocols with
> fixed headers not directly recognized by hardware."
> 
> I was trying to get variable header support with the RAW capabilities. A
> parse graph supports this for example the proposed query API does not.

OK, I see, however the RAW capability itself may not be supported everywhere
in patterns. What I described is that PMDs, not applications, could leverage
the RAW abilities of underlying devices to implement otherwise unsupported
but fixed patterns.

So basically you would like to expose the ability to describe fixed protocol
definitions following RAW patterns, as in:

 ETH / RAW / IP / UDP / ...

While with such a pattern the current specification makes RAW (4.1.4.2) and
IP start matching from the same offset as two different branches, in effect
you cannot specify a fixed protocol following a RAW item.

It is defined that way because I do not see how HW could parse higher level
protocols after having given up due to a RAW pattern, however assuming the
entire stack is described only using RAW patterns I guess it could be done.

Such a pattern could be generated from a separate function before feeding it
to rte_flow_create(), or translated by the PMD afterwards assuming a
separate meta item such as RAW_END exists to signal the end of a RAW layer.
Of course processing this would be more expensive.

[...]
> >>> One strategy I've used in other systems that worked relatively well
> >>> is if the query for the parse graph above returns a key for each node
> >>> in the graph then a single lookup can map the key to a node. Its
> >>> unambiguous and then these operations simply become a table lookup.
> >>> So to be a bit more concrete this changes the pattern structure in
> >>> rte_flow_create() into a  <key,value,mask> tuple where the key is known
> >>> by the initial parse graph query. If you reserve a set of well-defined
> >>> key values for well known protocols like ethernet, ip, etc. then the
> >>> query model also works but the middle layer catches errors in this case
> >>> and again the driver only gets known good flows. So something like this,
> >>>
> >>>   struct rte_flow_pattern {
> >>> 	uint32_t priority;
> >>> 	uint32_t key;
> >>> 	uint32_t value_length;
> >>> 	u8 *value;
> >>>   }
> > 
> > I agree that having an integer representing an entire pattern/actions combo
> > would be great, however how do you tell whether you want matched packets to
> > be duplicated to queue 6 and redirected to queue 3? This method can be used
> > to check if a type of rule is allowed but not whether it is actually
> > applicable. You still need to provide the entire pattern/actions description
> > to create a flow rule.
> 
> In reality its almost the same as your proposal it just took me a moment
> to see it. The only difference I can see is adding new headers via RAW
> type only supports fixed length headers.
> 
> To answer your question the flow_pattern would have to include a action
> set as well to give a list of actions to perform. I just didn't include
> it here.

OK.

> >>> Also if we have multiple tables what do you think about adding a
> >>> table_id to the signature. Probably not needed in the first generation
> >>> but is likely useful for hardware with multiple tables so that it
> >>> would be,
> >>>
> >>>    rte_flow_create(uint8_t port_id, uint8_t table_id, ...);
> > 
> > Not sure if I understand the table ID concept, do you mean in case a device
> > supports entirely different sets of features depending on something? (What?)
> > 
> 
> In many devices we support multiple tables each with their own size,
> match fields and action set. This is useful for building routers for
> example along with lots of other constructs. The basic idea is
> smashing everything into a single table creates a Cartesian product
> problem.

Right, so I understand we'd need a method to express table capabilities as
well as you described (a topic for the other thread then).

[...]
> >>> So you can put it after "known"
> >>> variable length headers like IP. The limitation is it can't get past
> >>> undefined variable length headers.
> > 
> > RTE_FLOW_ITEM_TYPE_ANY is made for that purpose. Is that what you are
> > looking for?
> > 
> 
> But FLOW_ITEM_TYPE_ANY skips "any" header type is my understanding if
> we have new variable length header in the future we will have to add
> a new type RTE_FLOW_ITEM_TYPE_FOO for example. The RAW type will work
> for fixed headers as noted above.

I'm (slowly) starting to get it. How about the suggestion I made above for
RAW items then?

[...]
> The two open items from me are do we need to support adding new variable
> length headers? And how do we handle multiple tables I'll take that up
> in the other thread.

I think variable length headers may be eventually supported through pattern
tricks or eventually a separate conversion layer.

> >>> I looked at the git repo but I only saw the header definition I guess
> >>> the implementation is TBD after there is enough agreement on the
> >>> interface?
> > 
> > Precisely, I intend to update the tree and send a v2 soon (unfortunately did
> > not have much time these past few days to work on this).
> > 
> > Now what if, instead of a seemingly complex parse graph and still in
> > addition to the query method, enum values were defined for PMDs to report
> > an array of supported items, typical patterns and actions so applications
> > can get a quick idea of what devices are capable of without being too
> > specific. Something like:
> > 
> >  enum rte_flow_capability {
> >      RTE_FLOW_CAPABILITY_ITEM_ETH,
> >      RTE_FLOW_CAPABILITY_PATTERN_ETH_IP_TCP,
> >      RTE_FLOW_CAPABILITY_ACTION_ID,
> >      ...
> >  };
> > 
> > Although I'm not convinced about the usefulness of this because it would
> > have to be maintained separately, but that would be easier than building a
> > dummy flow rule for simple query purposes.
> 
> I'm not sure its necessary either at first.

Then I'll discard this idea.

> > The main question I have for you is, do you think the core of the specified
> > API is adequate enough assuming it can be extended later with new methods?
> > 
> 
> The above two items are my only opens at this point, I agree with your
> summary of my capabilities proposal namely it can be added.

Thanks, see you in the other thread.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-03 16:44           ` Adrien Mazarguil
@ 2016-08-03 19:11             ` John Fastabend
  2016-08-04 13:24               ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: John Fastabend @ 2016-08-03 19:11 UTC (permalink / raw)
  To: Rahul Lakkireddy, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley,
	Jing Chen, Konstantin Ananyev, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

[...]

>>>>>> The proposal looks very good.  It satisfies most of the features
>>>>>> supported by Chelsio NICs.  We are looking for suggestions on exposing
>>>>>> more additional features supported by Chelsio NICs via this API.
>>>>>>
>>>>>> Chelsio NICs have two regions in which filters can be placed -
>>>>>> Maskfull and Maskless regions.  As their names imply, maskfull region
>>>>>> can accept masks to match a range of values; whereas, maskless region
>>>>>> don't accept any masks and hence perform a more strict exact-matches.
>>>>>> Filters without masks can also be placed in maskfull region.  By
>>>>>> default, maskless region have higher priority over the maskfull region.
>>>>>> However, the priority between the two regions is configurable.
>>>>>
>>>>> I understand this configuration affects the entire device. Just to be clear,
>>>>> assuming some filters are already configured, are they affected by a change
>>>>> of region priority later?
>>>>>
>>>>
>>>> Both the regions exist at the same time in the device.  Each filter can
>>>> either belong to maskfull or the maskless region.
>>>>
>>>> The priority is configured at time of filter creation for every
>>>> individual filter and cannot be changed while the filter is still
>>>> active. If priority needs to be changed for a particular filter then,
>>>> it needs to be deleted first and re-created.
>>>
>>> Could you model this as two tables and add a table_id to the API? This
>>> way user space could populate the table it chooses. We would have to add
>>> some capabilities attributes to "learn" if tables support masks or not
>>> though.
>>>
>>
>> This approach sounds interesting.
> 
> Now I understand the idea behind these tables, however from an application
> point of view I still think it's better if the PMD could take care of flow
> rules optimizations automatically. Think about it, PMDs have exactly a
> single kind of device they know perfectly well to manage, while applications
> want the best possible performance out of any device in the most generic
> fashion.

The problem is keeping priorities in order and/or possibly breaking
rules apart (e.g. you have an L2 table and an L3 table) becomes very
complex to manage at driver level. I think its easier for the
application which has some context to do this. The application "knows"
if its a router for example will likely be able to pack rules better
than a PMD will.

> 
>>> I don't see how the PMD can sort this out in any meaningful way and it
>>> has to be exposed to the application that has the intelligence to 'know'
>>> priorities between masks and non-masks filters. I'm sure you could come
>>> up with something but it would be less than ideal in many cases I would
>>> guess and we can't have the driver getting priorities wrong or we may
>>> not get the correct behavior.
> 
> It may be solved by having the PMD maintain a SW state to quickly know which
> rules are currently created and in what state the device is so basically the
> application doesn't have to perform this work.
> 
> This API allows applications to express basic needs such as "redirect
> packets matching this pattern to that queue". It must not deal with HW
> details and limitations in my opinion. If a request cannot be satisfied,
> then the rule cannot be created. No help from the application must be
> expected by PMDs, otherwise it opens the door to the same issues as the
> legacy filtering APIs.

This depends on the application and what/how it wants to manage the
device. If the application manages a pipeline with some set of tables,
then mapping this down to a single table, which then the PMD has to
unwind back to a multi-table topology to me seems like a waste.

> 
> [...]
>>>> Unfortunately, our maskfull region is extremely small too compared to
>>>> maskless region.
>>>>
>>>
>>> To me this means a userspace application would want to pack it
>>> carefully to get the full benefit. So you need some mechanism to specify
>>> the "region" hence the above table proposal.
>>>
>>
>> Right. Makes sense.
> 
> I do not agree, applications should not be aware of it. Note this case can
> be handled differently, so that rules do not have to be moved back and forth
> between both tables. If the first created rule requires a maskfull entry,
> then all subsequent rules will be entered into that table. Otherwise no
> maskfull entry can be created as long as there is one maskless entry. When
> either table is full, no more rules may be added. Would that work for you?
> 

Its not about mask vs no mask. The devices with multiple tables that I
have don't have this mask limitations. Its about how to optimally pack
the rules and who implements that logic. I think its best done in the
application where I have the context.

Is there a way to omit the table field if the PMD is expected to do
a best effort and add the table field if the user wants explicit
control over table mgmt. This would support both models. I at least
would like to have explicit control over rule population in my pipeline
for use cases where I'm building a pipeline on top of the hardware.

>> [...]
>>>>> Now about this "promisc" match criteria, it can be added as a new meta
>>>>> pattern item (4.1.3 Meta item types). Do you want it to be defined from the
>>>>> start or add it later with the related code in your PMD?
>>>>>
>>>>
>>>> It could be added as a meta item.  If there are other interested
>>>> parties, it can be added now.  Otherwise, we'll add it with our filtering
>>>> related code.
>>>>
>>>
>>> hmm I guess by "promisc" here you mean match packets received from the
>>> wire before they have been switched by the silicon?
>>>
>>
>> Match packets received from wire before they have been switched by
>> silicon, and which also includes packets not destined for DUT and were
>> still received due to interface being in promisc mode.
> 
> I think it's fine, but we'll have to precisely define what happens when a
> packet matched with such pattern is part of a terminating rule. For instance
> if it is duplicated by HW, then the rule cannot be terminating.
> 
> [...]
>>>> This raises another interesting question.  What should the PMD do
>>>> if it has support to only a subset of fields in the particular item?
>>>>
>>>> For example, if a rule has been sent to match IP fragmentation along
>>>> with several other IPv4 fields, and if the underlying hardware doesn't
>>>> support matching based on IP fragmentation, does the PMD reject the
>>>> complete rule although it could have done the matching for rest of the
>>>> IPv4 fields?
>>>
>>> I think it has to fail the command other wise user space will not have
>>> any way to understand that the full match criteria can not be met and
>>> we will get different behavior for the same applications on different
>>> nics depending on hardware feature set. This will most likely break
>>> applications so we need the error IMO.
>>>
>>
>> Ok. Makes sense.
> 
> Yes, I fully agree with this.
> 
>>>>>> - Match range of physical ports on the NIC in a single rule via masks.
>>>>>>   For ex: match all UDP packets coming on ports 3 and 4 out of 4
>>>>>>   ports available on the NIC.
>>>>>
>>>>> Applications create flow rules per port, I'd thus suggest that the PMD
>>>>> should detect identical rules created on different ports and aggregate them
>>>>> as a single HW rule automatically.
>>>>>
>>>>> If you think this approach is not right, the alternative is a meta pattern
>>>>> item that provides a list of ports. I'm not sure this is the right approach
>>>>> considering it would most likely not be supported by most NICs. Applications
>>>>> may not request it explicitly.
>>>>>
>>>>
>>>> Aggregating via PMD will be expensive operation since it would involve:
>>>> - Search of existing filters.
>>>> - Deleting those filters.
>>>> - Creating a single combined filter.
>>>>
>>>> And all of above 3 operations would need to be atomic so as not to
>>>> affect existing traffic which is hitting above filters.
> 
> Atomicity may not be a problem if the PMD makes sure the new combined rule
> is inserted before the others, so they do not need to be removed either.
> 
>>>> Adding a
>>>> meta item would be a simpler solution here.
> 
> Yes, clearly.
> 
>>> For this adding a meta-data item seems simplest to me. And if you want
>>> to make the default to be only a single port that would maybe make it
>>> easier for existing apps to port from flow director. Then if an
>>> application cares it can create a list of ports if needed.
>>>
>>
>> Agreed.
> 
> However although I'm not opposed to adding dedicated meta items, remember
> applications will not automatically benefit from the increased performance
> if a single PMD implements this feature, their maintainers will probably not
> bother with it.
> 

Unless as we noted in other thread the application is closely bound to
its hardware for capability reasons. In this case it would make sense
to implement.

>>>>>> - Match range of Physical Functions (PFs) on the NIC in a single rule
>>>>>>   via masks. For ex: match all traffic coming on several PFs.
>>>>>
>>>>> The PF and VF pattern items assume there is a single PF associated with a
>>>>> DPDK port. VFs are identified with an ID. I basically took the same
>>>>> definitions as the existing filter types, perhaps this is not enough for
>>>>> Chelsio adapters.
>>>>>
>>>>> Do you expose more than one PF for a DPDK port?
>>>>>
>>>>> Anyway, I'd suggest the same approach as above, automatic aggregation of
>>>>> rules for performance reasons, otherwise new or updated PF/VF pattern items,
>>>>> in which case it would be great if you could provide ideal structure
>>>>> definitions for this use case.
>>>>>
>>>>
>>>> In Chelsio hardware, all the ports of a device are exposed via single
>>>> PF4. There could be many VFs attached to a PF.  Physical NIC functions
>>>> are operational on PF4, while VFs can be attached to PFs 0-3.
>>>> So, Chelsio hardware doesn't remain tied on a PF-to-Port, one-to-one
>>>> mapping assumption.
>>>>
>>>> There already seems to be a PF meta-item, but it doesn't seem to accept
>>>> any "spec" and "mask" field.  Similarly, the VF meta-item doesn't
>>>> seem to accept a "mask" field.  We could probably enable these fields
>>>> in the PF and VF meta-items to allow configuration.
>>>
>>> Maybe a range field would help here as well? So you could specify a VF
>>> range. It might be one of the things to consider adding later though if
>>> there is no clear use for it now.
>>>
>>
>> VF-value and VF-mask would help to achieve the desired filter.
>> VF-mask would also enable to specify a range of VF values.
> 
> Like John, I think a range or even a list instead of a mask would be better,
> the PMD can easily create a mask from that if necessary. Reason is that
> we've always had bad experiences with bit-fields, they're always too short
> at some point and we would like to avoid having to break the ABI to update
> existing pattern items later.

Agreed avoiding bit-fields is a good idea.

> 
> Also while I don't think this is the case yet, perhaps it will be a good
> idea for PFs/VFs to have global unique IDs, just like DPDK ports.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-03 14:30         ` Adrien Mazarguil
@ 2016-08-03 18:10           ` John Fastabend
  2016-08-04 13:05             ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: John Fastabend @ 2016-08-03 18:10 UTC (permalink / raw)
  To: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

[...]

>>>> Considering that allowed pattern/actions combinations cannot be known in
>>>> advance and would result in an unpractically large number of capabilities to
>>>> expose, a method is provided to validate a given rule from the current
>>>> device configuration state without actually adding it (akin to a "dry run"
>>>> mode).
>>>
>>> Rather than have a query/validate process why did we jump over having an
>>> intermediate representation of the capabilities? Here you state it is
>>> unpractical but we know how to represent parse graphs and the drivers
>>> could report their supported parse graph via a single query to a middle
>>> layer.
>>>
>>> This will actually reduce the msg chatter imagine many applications at
>>> init time or in boundary cases where a large set of applications come
>>> online at once and start banging on the interface all at once seems less
>>> than ideal.
> 
> Well, I also thought about a kind of graph to represent capabilities but
> feared the extra complexity would not be worth the trouble, thus settled on
> the query idea. A couple more reasons:
> 
> - Capabilities evolve at the same time as devices are configured. For
>   example, if a device supports a single RSS context, then a single rule
>   with a RSS action may be created. The graph would have to be rewritten
>   accordingly and thus queried/parsed again by the application.

The graph would not help here because this is an action
restriction not a parsing restriction. This is yet another query to see
what actions are supported and how many of each action are supported.

   get_parse_graph - report the parsable fields
   get_actions - report the supported actions and possible num of each

> 
> - Expressing capabilities at bit granularity (say, for a matching pattern
>   item mask) is complex, there is no way to simplify the representation of
>   capabilities without either losing information or making the graph more
>   complex to parse than simply providing a flow rule from an application
>   point of view.
> 

I'm not sure I understand 'bit granularity' here. I would say we have
devices now that have rather strange restrictions due to hardware
implementation. Going forward we should get better hardware and a lot
of this will go away in my view. Yes this is a long term view and
doesn't help the current state. The overall point you are making is
the sum off all these strange/odd bits in the hardware implementation
means capabilities queries are very difficult to guarantee. On existing
hardware and I think you've convinced me. Thanks ;)

> With that in mind, I am not opposed to the idea, both methods could even
> coexist, with the query function eventually evolving to become a front-end
> to a capability graph. Just remember that I am only defining the
> fundamentals for the initial implementation, i.e. how rules are expressed as
> patterns/actions and the basic functions to manage them, ideally without
> having to redefine them ever.
> 

Agreed they should be able to coexist. So I can get my capabilities
queries as a layer on top of the API here.

>> A bit more details on possible interface for capabilities query,
>>
>> One way I've used to describe these graphs from driver to software
>> stacks is to use a set of structures to build the graph. For fixed
>> graphs this could just be *.h file for programmable hardware (typically
>> coming from fw update on nics) the driver can read the parser details
>> out of firmware and render the structures.
> 
> I understand, however I think this approach may be too low-level to express
> all the possible combinations. This graph would have to include possible
> actions for each possible pattern, all while considering that some actions
> are not possible with some patterns and that there are exclusive actions.
> 

Really? You have hardware that has dependencies between the parser and
the supported actions? Ugh...

If the hardware has separate tables then we shouldn't try to have the
PMD flatten those into a single table because we will have no way of
knowing how to do that. (I'll respond to the other thread on this in
an attempt to not get to scattered).

> Also while memory consumption is not really an issue, such a graph may be
> huge. It could take a while for the PMD to update it when adding a rule
> impacting capabilities.

Ugh... I wouldn't suggest updating the capabilities at runtime like
this. But I see your point if the graph has to _guarantee_ correctness
how does it represent limited number of masks and other strange hw,
its unfortunate the hardware isn't more regular.

You have convinced me that guaranteed correctness via capabilities
is going to difficult for many types of devices although not all.

[...]

>>
>> The cost doing all this is some additional overhead at init time. But
>> building generic function over this and having a set of predefined
>> uids for well-known protocols such ip, udp, tcp, etc helps. What you
>> get for the cost is a few things that I think are worth it. (i) Now
>> new protocols can be added/removed without recompiling DPDK (ii) a
>> software package can use the capability query to verify the required
>> protocols are off-loadable vs a possibly large set of test queries and
>> (iii) when we do the programming of the device we can provide a tuple
>> (table-uid, header-uid, field-uid, value, mask, priority) and the
>> middle layer "knowing" the above graph can verify the command so
>> drivers only ever see "good"  commands, (iv) finally it should be
>> faster in terms of cmds per second because the drivers can map the
>> tuple (table, header, field, priority) to a slot efficiently vs
>> parsing.
>>
>> IMO point (iii) and (iv) will in practice make the code much simpler
>> because we can maintain common middle layer and not require parsing
>> by drivers. Making each driver simpler by abstracting into common
>> layer.
> 
> Before answering your points, let's consider how applications are going to
> be written. Not only devices do not support all possible pattern/actions
> combinations, they also have memory constraints. Whichever method
> applications use to determine if a flow rule is supported, at some point
> they won't be able to add any more due to device limitations.
> 
> Sane applications designed to work regardless of the underlying device won't
> simply call abort() at this point but provide a software fallback
> instead. My bet is that applications will provide one every time a rule
> cannot be added for any reason, they won't even bother to query capabilities
> except perhaps for a very small subset, as in "does this device support the
> ID action at all?".
> 
> Applications that really want/need to know at init time whether all the
> rules they may want to possibly create are supported will spend about the
> same time in both cases (query or graph). For queries, by iterating on a
> list of typical rules. For a graph, by walking through it. Either way, it
> won't be done later from the data path.

The queries and graph suffer from the same problems you noted above if
actually instantiating the rules will impact what rules are allowed. So
that in both cases we may run into corner cases but it seems that this
is a result of hardware deficiencies and can't be solved easily at least
with software.

My concern is this non-determinism will create performance issues in
the network because when a flow may or may not be offloaded this can
have a rather significant impact on its performance. This can make
debugging network wide performance miserable when at time X I get
performance X and then for whatever reason something degrades to
software and at time Y I get some performance Y << X. I suspect that
in general applications will bind tightly with hardware they know
works.

> 
> I think that for an application maintainer, writing or even generating a set
> of typical rules will also be easier than walking through a graph. It should
> also be easier on the PMD side.
> 

I tend to think getting a graph and doing operations on graphs is easier
myself but I can see this is a matter of opinion/style.

> For individual points:
> 
> (i) should be doable with the query API without recompiling DPDK as well,
> the fact API/ABI breakage must be avoided being part of the requirements. If
> you think there is a problem regarding this, can you provide a specific
> example?

What I was after you noted yourself in the doc here,

"PMDs can rely on this capability to simulate support for protocols with
fixed headers not directly recognized by hardware."

I was trying to get variable header support with the RAW capabilities. A
parse graph supports this for example the proposed query API does not.

> 
> (ii) as described above, I think this use case won't be very common in the
> wild, except for applications designed for a specific device and then they
> will probably know enough about it to skip the query step entirely. If time
> must be spent anyway, it will be in the control path at initialization
> time.
> 

OK.

> (iii) misses the fact that capabilities evolve as flow rules get added,
> there is no way for PMDs to only see "valid" rules also because device
> limitations may prevent adding an otherwise valid rule.

OK I agree for devices with this evolving characteristic we are lost.

> 
> (iv) could be true if not for the same reason as (iii). The graph would have
> to be verfied again before adding another rule. Note that PMDs maintainers
> are encouraged to make their query function as fast as possible, they may
> rely on static data internally for this as well.
> 

OK I'm not going to get hung up on this because I think its an
implementation detail and not an API problem. I would prefer to be
pragmatic and see how fast the API is before I bikeshed it to death for
no good reason.

>>> Worse in my opinion it requires all drivers to write mostly duplicating
>>> validation code where a common layer could easily do this if every
>>> driver reported a common data structure representing its parse graph
>>> instead. The nice fallout of this initial effort upfront is the driver
>>> no longer needs to do error handling/checking/etc and can assume all
>>> rules are correct and valid. It makes driver code much simpler to
>>> support. And IMO at least by doing this we get some other nice benefits
>>> described below.
> 
> About duplicated code, my usual reply is that DPDK will provide internal
> helper methods to assist PMDs with rules management/parsing/etc. These are
> not discussed in the specification because I wanted everyone to agree to the
> application side of things first, and it is difficult to know how much
> assistance PMDs might need without an initial implementation.
> 
> I think this private API will be built at the same time as support is added
> to PMDs and maintainers notice generic code that can be shared.
> Documentation may be written later once things start to settle down.

OK lets see.

> 
>>> Another related question is about performance.
>>>
>>>> Creation
>>>> ~~~~~~~~
>>>>
>>>> Creating a flow rule is similar to validating one, except the rule is
>>>> actually created.
>>>>
>>>> ::
>>>>
>>>>  struct rte_flow *
>>>>  rte_flow_create(uint8_t port_id,
>>>>                  const struct rte_flow_pattern *pattern,
>>>>                  const struct rte_flow_actions *actions);
>>>
>>> I gather this implies that each driver must parse the pattern/action
>>> block and map this onto the hardware. How many rules per second can this
>>> support? I've run into systems that expect a level of service somewhere
>>> around 50k cmds per second. So bulking will help at the message level
>>> but it seems like a lot of overhead to unpack the pattern/action section.
> 
> There is indeed no guarantee on the time taken to create a flow rule, as
> debated with Sugesh (see the full thread):
> 
>  http://dpdk.org/ml/archives/dev/2016-July/043958.html
> 
> I will update the specification accordingly.
> 
> Consider that even 50k cmds per second may not be fast enough. Applications
> always need to have some kind of fallback ready, and the ability to know
> whether a packet has been matched by a rule is a way to help with that.
> 
> In any case, flow rules must be managed from the control path, the data path
> must only handle consequences.

Same as above lets see I think it can probably be made fast enough.

> 
>>> One strategy I've used in other systems that worked relatively well
>>> is if the query for the parse graph above returns a key for each node
>>> in the graph then a single lookup can map the key to a node. Its
>>> unambiguous and then these operations simply become a table lookup.
>>> So to be a bit more concrete this changes the pattern structure in
>>> rte_flow_create() into a  <key,value,mask> tuple where the key is known
>>> by the initial parse graph query. If you reserve a set of well-defined
>>> key values for well known protocols like ethernet, ip, etc. then the
>>> query model also works but the middle layer catches errors in this case
>>> and again the driver only gets known good flows. So something like this,
>>>
>>>   struct rte_flow_pattern {
>>> 	uint32_t priority;
>>> 	uint32_t key;
>>> 	uint32_t value_length;
>>> 	u8 *value;
>>>   }
> 
> I agree that having an integer representing an entire pattern/actions combo
> would be great, however how do you tell whether you want matched packets to
> be duplicated to queue 6 and redirected to queue 3? This method can be used
> to check if a type of rule is allowed but not whether it is actually
> applicable. You still need to provide the entire pattern/actions description
> to create a flow rule.

In reality its almost the same as your proposal it just took me a moment
to see it. The only difference I can see is adding new headers via RAW
type only supports fixed length headers.

To answer your question the flow_pattern would have to include a action
set as well to give a list of actions to perform. I just didn't include
it here.

> 
>>> Also if we have multiple tables what do you think about adding a
>>> table_id to the signature. Probably not needed in the first generation
>>> but is likely useful for hardware with multiple tables so that it
>>> would be,
>>>
>>>    rte_flow_create(uint8_t port_id, uint8_t table_id, ...);
> 
> Not sure if I understand the table ID concept, do you mean in case a device
> supports entirely different sets of features depending on something? (What?)
> 

In many devices we support multiple tables each with their own size,
match fields and action set. This is useful for building routers for
example along with lots of other constructs. The basic idea is
smashing everything into a single table creates a Cartesian product
problem.

>>> Finally one other problem we've had which would be great to address
>>> if we are doing a rewrite of the API is adding new protocols to
>>> already deployed DPDK stacks. This is mostly a Linux distribution
>>> problem where you can't easily update DPDK.
>>>
>>> In the prototype header linked in this document it seems to add new
>>> headers requires adding a new enum in the rte_flow_item_type but there
>>> is at least an attempt at a catch all here,
>>>
>>>> 	/**
>>>> 	 * Matches a string of a given length at a given offset (in bytes),
>>>> 	 * or anywhere in the payload of the current protocol layer
>>>> 	 * (including L2 header if used as the first item in the stack).
>>>> 	 *
>>>> 	 * See struct rte_flow_item_raw.
>>>> 	 */
>>>> 	RTE_FLOW_ITEM_TYPE_RAW,
>>>
>>> Actually this is a nice implementation because it works after the
>>> previous item in the stack correct?
> 
> Yes, this is correct.

Great.

> 
>>> So you can put it after "known"
>>> variable length headers like IP. The limitation is it can't get past
>>> undefined variable length headers.
> 
> RTE_FLOW_ITEM_TYPE_ANY is made for that purpose. Is that what you are
> looking for?
> 

But FLOW_ITEM_TYPE_ANY skips "any" header type is my understanding if
we have new variable length header in the future we will have to add
a new type RTE_FLOW_ITEM_TYPE_FOO for example. The RAW type will work
for fixed headers as noted above.

>>> However if you use the above parse
>>> graph reporting from the driver mechanism and the driver always reports
>>> its largest supported graph then we don't have this issue where a new
>>> hardware sku/ucode/etc added support for new headers but we have no
>>> way to deploy it to existing software users without recompiling and
>>> redeploying.
> 
> I really would like to understand if you see a limitation regarding this
> with the specified API, even assuming DPDK is compiled as a shared library
> and thus not part of the user application.
> 

Thanks this thread was very helpful for me at least. So the summary
for me is. Capability queries can be build on top of this API no
problem and for many existing devices capability queries will not be
able to guarantee a flow insertion success due to hardware
quirks/limitations.

The two open items from me are do we need to support adding new variable
length headers? And how do we handle multiple tables I'll take that up
in the other thread.

>>> I looked at the git repo but I only saw the header definition I guess
>>> the implementation is TBD after there is enough agreement on the
>>> interface?
> 
> Precisely, I intend to update the tree and send a v2 soon (unfortunately did
> not have much time these past few days to work on this).
> 
> Now what if, instead of a seemingly complex parse graph and still in
> addition to the query method, enum values were defined for PMDs to report
> an array of supported items, typical patterns and actions so applications
> can get a quick idea of what devices are capable of without being too
> specific. Something like:
> 
>  enum rte_flow_capability {
>      RTE_FLOW_CAPABILITY_ITEM_ETH,
>      RTE_FLOW_CAPABILITY_PATTERN_ETH_IP_TCP,
>      RTE_FLOW_CAPABILITY_ACTION_ID,
>      ...
>  };
> 
> Although I'm not convinced about the usefulness of this because it would
> have to be maintained separately, but that would be easier than building a
> dummy flow rule for simple query purposes.

I'm not sure its necessary either at first.

> 
> The main question I have for you is, do you think the core of the specified
> API is adequate enough assuming it can be extended later with new methods?
> 

The above two items are my only opens at this point, I agree with your
summary of my capabilities proposal namely it can be added.

.John

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-26 10:07         ` Rahul Lakkireddy
@ 2016-08-03 16:44           ` Adrien Mazarguil
  2016-08-03 19:11             ` John Fastabend
  2016-08-19 21:13           ` John Daley (johndale)
  1 sibling, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-08-03 16:44 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: John Fastabend, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley,
	Jing Chen, Konstantin Ananyev, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

Replying to everything at once, please see below.

On Tue, Jul 26, 2016 at 03:37:35PM +0530, Rahul Lakkireddy wrote:
> On Monday, July 07/25/16, 2016 at 09:40:02 -0700, John Fastabend wrote:
> > On 16-07-25 04:32 AM, Rahul Lakkireddy wrote:
> > > Hi Adrien,
> > > 
> > > On Thursday, July 07/21/16, 2016 at 19:07:38 +0200, Adrien Mazarguil wrote:
> > >> Hi Rahul,
> > >>
> > >> Please see below.
> > >>
> > >> On Thu, Jul 21, 2016 at 01:43:37PM +0530, Rahul Lakkireddy wrote:
> > >>> Hi Adrien,
> > >>>
> > >>> The proposal looks very good.  It satisfies most of the features
> > >>> supported by Chelsio NICs.  We are looking for suggestions on exposing
> > >>> more additional features supported by Chelsio NICs via this API.
> > >>>
> > >>> Chelsio NICs have two regions in which filters can be placed -
> > >>> Maskfull and Maskless regions.  As their names imply, maskfull region
> > >>> can accept masks to match a range of values; whereas, maskless region
> > >>> don't accept any masks and hence perform a more strict exact-matches.
> > >>> Filters without masks can also be placed in maskfull region.  By
> > >>> default, maskless region have higher priority over the maskfull region.
> > >>> However, the priority between the two regions is configurable.
> > >>
> > >> I understand this configuration affects the entire device. Just to be clear,
> > >> assuming some filters are already configured, are they affected by a change
> > >> of region priority later?
> > >>
> > > 
> > > Both the regions exist at the same time in the device.  Each filter can
> > > either belong to maskfull or the maskless region.
> > > 
> > > The priority is configured at time of filter creation for every
> > > individual filter and cannot be changed while the filter is still
> > > active. If priority needs to be changed for a particular filter then,
> > > it needs to be deleted first and re-created.
> > 
> > Could you model this as two tables and add a table_id to the API? This
> > way user space could populate the table it chooses. We would have to add
> > some capabilities attributes to "learn" if tables support masks or not
> > though.
> > 
> 
> This approach sounds interesting.

Now I understand the idea behind these tables, however from an application
point of view I still think it's better if the PMD could take care of flow
rules optimizations automatically. Think about it, PMDs have exactly a
single kind of device they know perfectly well to manage, while applications
want the best possible performance out of any device in the most generic
fashion.

> > I don't see how the PMD can sort this out in any meaningful way and it
> > has to be exposed to the application that has the intelligence to 'know'
> > priorities between masks and non-masks filters. I'm sure you could come
> > up with something but it would be less than ideal in many cases I would
> > guess and we can't have the driver getting priorities wrong or we may
> > not get the correct behavior.

It may be solved by having the PMD maintain a SW state to quickly know which
rules are currently created and in what state the device is so basically the
application doesn't have to perform this work.

This API allows applications to express basic needs such as "redirect
packets matching this pattern to that queue". It must not deal with HW
details and limitations in my opinion. If a request cannot be satisfied,
then the rule cannot be created. No help from the application must be
expected by PMDs, otherwise it opens the door to the same issues as the
legacy filtering APIs.

[...]
> > > Unfortunately, our maskfull region is extremely small too compared to
> > > maskless region.
> > > 
> > 
> > To me this means a userspace application would want to pack it
> > carefully to get the full benefit. So you need some mechanism to specify
> > the "region" hence the above table proposal.
> > 
> 
> Right. Makes sense.

I do not agree, applications should not be aware of it. Note this case can
be handled differently, so that rules do not have to be moved back and forth
between both tables. If the first created rule requires a maskfull entry,
then all subsequent rules will be entered into that table. Otherwise no
maskfull entry can be created as long as there is one maskless entry. When
either table is full, no more rules may be added. Would that work for you?

> [...]
> > >> Now about this "promisc" match criteria, it can be added as a new meta
> > >> pattern item (4.1.3 Meta item types). Do you want it to be defined from the
> > >> start or add it later with the related code in your PMD?
> > >>
> > > 
> > > It could be added as a meta item.  If there are other interested
> > > parties, it can be added now.  Otherwise, we'll add it with our filtering
> > > related code.
> > > 
> > 
> > hmm I guess by "promisc" here you mean match packets received from the
> > wire before they have been switched by the silicon?
> > 
> 
> Match packets received from wire before they have been switched by
> silicon, and which also includes packets not destined for DUT and were
> still received due to interface being in promisc mode.

I think it's fine, but we'll have to precisely define what happens when a
packet matched with such pattern is part of a terminating rule. For instance
if it is duplicated by HW, then the rule cannot be terminating.

[...]
> > > This raises another interesting question.  What should the PMD do
> > > if it has support to only a subset of fields in the particular item?
> > > 
> > > For example, if a rule has been sent to match IP fragmentation along
> > > with several other IPv4 fields, and if the underlying hardware doesn't
> > > support matching based on IP fragmentation, does the PMD reject the
> > > complete rule although it could have done the matching for rest of the
> > > IPv4 fields?
> > 
> > I think it has to fail the command other wise user space will not have
> > any way to understand that the full match criteria can not be met and
> > we will get different behavior for the same applications on different
> > nics depending on hardware feature set. This will most likely break
> > applications so we need the error IMO.
> > 
> 
> Ok. Makes sense.

Yes, I fully agree with this.

> > >>> - Match range of physical ports on the NIC in a single rule via masks.
> > >>>   For ex: match all UDP packets coming on ports 3 and 4 out of 4
> > >>>   ports available on the NIC.
> > >>
> > >> Applications create flow rules per port, I'd thus suggest that the PMD
> > >> should detect identical rules created on different ports and aggregate them
> > >> as a single HW rule automatically.
> > >>
> > >> If you think this approach is not right, the alternative is a meta pattern
> > >> item that provides a list of ports. I'm not sure this is the right approach
> > >> considering it would most likely not be supported by most NICs. Applications
> > >> may not request it explicitly.
> > >>
> > > 
> > > Aggregating via PMD will be expensive operation since it would involve:
> > > - Search of existing filters.
> > > - Deleting those filters.
> > > - Creating a single combined filter.
> > > 
> > > And all of above 3 operations would need to be atomic so as not to
> > > affect existing traffic which is hitting above filters.

Atomicity may not be a problem if the PMD makes sure the new combined rule
is inserted before the others, so they do not need to be removed either.

> > > Adding a
> > > meta item would be a simpler solution here.

Yes, clearly.

> > For this adding a meta-data item seems simplest to me. And if you want
> > to make the default to be only a single port that would maybe make it
> > easier for existing apps to port from flow director. Then if an
> > application cares it can create a list of ports if needed.
> > 
> 
> Agreed.

However although I'm not opposed to adding dedicated meta items, remember
applications will not automatically benefit from the increased performance
if a single PMD implements this feature, their maintainers will probably not
bother with it.

> > >>> - Match range of Physical Functions (PFs) on the NIC in a single rule
> > >>>   via masks. For ex: match all traffic coming on several PFs.
> > >>
> > >> The PF and VF pattern items assume there is a single PF associated with a
> > >> DPDK port. VFs are identified with an ID. I basically took the same
> > >> definitions as the existing filter types, perhaps this is not enough for
> > >> Chelsio adapters.
> > >>
> > >> Do you expose more than one PF for a DPDK port?
> > >>
> > >> Anyway, I'd suggest the same approach as above, automatic aggregation of
> > >> rules for performance reasons, otherwise new or updated PF/VF pattern items,
> > >> in which case it would be great if you could provide ideal structure
> > >> definitions for this use case.
> > >>
> > > 
> > > In Chelsio hardware, all the ports of a device are exposed via single
> > > PF4. There could be many VFs attached to a PF.  Physical NIC functions
> > > are operational on PF4, while VFs can be attached to PFs 0-3.
> > > So, Chelsio hardware doesn't remain tied on a PF-to-Port, one-to-one
> > > mapping assumption.
> > > 
> > > There already seems to be a PF meta-item, but it doesn't seem to accept
> > > any "spec" and "mask" field.  Similarly, the VF meta-item doesn't
> > > seem to accept a "mask" field.  We could probably enable these fields
> > > in the PF and VF meta-items to allow configuration.
> > 
> > Maybe a range field would help here as well? So you could specify a VF
> > range. It might be one of the things to consider adding later though if
> > there is no clear use for it now.
> > 
> 
> VF-value and VF-mask would help to achieve the desired filter.
> VF-mask would also enable to specify a range of VF values.

Like John, I think a range or even a list instead of a mask would be better,
the PMD can easily create a mask from that if necessary. Reason is that
we've always had bad experiences with bit-fields, they're always too short
at some point and we would like to avoid having to break the ABI to update
existing pattern items later.

Also while I don't think this is the case yet, perhaps it will be a good
idea for PFs/VFs to have global unique IDs, just like DPDK ports.

Thanks.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-08-02 18:19       ` John Fastabend
@ 2016-08-03 14:30         ` Adrien Mazarguil
  2016-08-03 18:10           ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-08-03 14:30 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

Hi John,

I'm replying below to both messages.

On Tue, Aug 02, 2016 at 11:19:15AM -0700, John Fastabend wrote:
> On 16-07-23 02:10 PM, John Fastabend wrote:
> > On 16-07-21 12:20 PM, Adrien Mazarguil wrote:
> >> Hi Jerin,
> >>
> >> Sorry, looks like I missed your reply. Please see below.
> >>
> > 
> > Hi Adrian,
> > 
> > Sorry for a bit delay but a few comments that may be worth considering.
> > 
> > To start with completely agree on the general problem statement and the
> > nice summary of all the current models. Also good start on this.

Thanks.

> >> Considering that allowed pattern/actions combinations cannot be known in
> >> advance and would result in an unpractically large number of capabilities to
> >> expose, a method is provided to validate a given rule from the current
> >> device configuration state without actually adding it (akin to a "dry run"
> >> mode).
> > 
> > Rather than have a query/validate process why did we jump over having an
> > intermediate representation of the capabilities? Here you state it is
> > unpractical but we know how to represent parse graphs and the drivers
> > could report their supported parse graph via a single query to a middle
> > layer.
> > 
> > This will actually reduce the msg chatter imagine many applications at
> > init time or in boundary cases where a large set of applications come
> > online at once and start banging on the interface all at once seems less
> > than ideal.

Well, I also thought about a kind of graph to represent capabilities but
feared the extra complexity would not be worth the trouble, thus settled on
the query idea. A couple more reasons:

- Capabilities evolve at the same time as devices are configured. For
  example, if a device supports a single RSS context, then a single rule
  with a RSS action may be created. The graph would have to be rewritten
  accordingly and thus queried/parsed again by the application.

- Expressing capabilities at bit granularity (say, for a matching pattern
  item mask) is complex, there is no way to simplify the representation of
  capabilities without either losing information or making the graph more
  complex to parse than simply providing a flow rule from an application
  point of view.

With that in mind, I am not opposed to the idea, both methods could even
coexist, with the query function eventually evolving to become a front-end
to a capability graph. Just remember that I am only defining the
fundamentals for the initial implementation, i.e. how rules are expressed as
patterns/actions and the basic functions to manage them, ideally without
having to redefine them ever.

> A bit more details on possible interface for capabilities query,
> 
> One way I've used to describe these graphs from driver to software
> stacks is to use a set of structures to build the graph. For fixed
> graphs this could just be *.h file for programmable hardware (typically
> coming from fw update on nics) the driver can read the parser details
> out of firmware and render the structures.

I understand, however I think this approach may be too low-level to express
all the possible combinations. This graph would have to include possible
actions for each possible pattern, all while considering that some actions
are not possible with some patterns and that there are exclusive actions.

Also while memory consumption is not really an issue, such a graph may be
huge. It could take a while for the PMD to update it when adding a rule
impacting capabilities.

> I've done this two ways: one is to define all the fields in their
> own structures using something like,
> 
> struct field {
> 	char *name;
> 	u32 uid;
> 	u32 bitwidth;
> };
> 
> This gives a unique id (uid) for each field along with its
> width and a user friendly name. The fields are organized into
> headers via a header structure,
> 
> struct header_node {
> 	char *name;
> 	u32 uid;
> 	u32 *fields;
> 	struct parse_graph *jump;
> };
> 
> Each node has a unique id and then a list of fields. Where 'fields'
> is a list of uid's of fields its also easy enough to embed the field
> struct in the header_node if that is simpler its really a style
> question.
> 
> The 'struct parse_graph' gives the list of edges from this header node
> to other header nodes. Using a parse graph structure defined
> 
> struct parse_graph {
> 	struct field_reference ref;
> 	__u32 jump_uid;
> };
> 
> Again as a matter of style you can embed the parse graph in the header
> node as I did above or do it as its own object.
> 
> The field_reference noted below gives the id of the field and the value
> e.g. the tuple (ipv4.protocol, 6) then jump_uid would be the uid of TCP.
> 
> struct field_reference {
> 	__u32 header_uid;
> 	__u32 field_uid;
> 	__u32 mask_type;
> 	__u32 type;
> 	__u8  *value;
> 	__u8  *mask;
> };
> 
> The cost doing all this is some additional overhead at init time. But
> building generic function over this and having a set of predefined
> uids for well-known protocols such ip, udp, tcp, etc helps. What you
> get for the cost is a few things that I think are worth it. (i) Now
> new protocols can be added/removed without recompiling DPDK (ii) a
> software package can use the capability query to verify the required
> protocols are off-loadable vs a possibly large set of test queries and
> (iii) when we do the programming of the device we can provide a tuple
> (table-uid, header-uid, field-uid, value, mask, priority) and the
> middle layer "knowing" the above graph can verify the command so
> drivers only ever see "good"  commands, (iv) finally it should be
> faster in terms of cmds per second because the drivers can map the
> tuple (table, header, field, priority) to a slot efficiently vs
> parsing.
> 
> IMO point (iii) and (iv) will in practice make the code much simpler
> because we can maintain common middle layer and not require parsing
> by drivers. Making each driver simpler by abstracting into common
> layer.

Before answering your points, let's consider how applications are going to
be written. Not only devices do not support all possible pattern/actions
combinations, they also have memory constraints. Whichever method
applications use to determine if a flow rule is supported, at some point
they won't be able to add any more due to device limitations.

Sane applications designed to work regardless of the underlying device won't
simply call abort() at this point but provide a software fallback
instead. My bet is that applications will provide one every time a rule
cannot be added for any reason, they won't even bother to query capabilities
except perhaps for a very small subset, as in "does this device support the
ID action at all?".

Applications that really want/need to know at init time whether all the
rules they may want to possibly create are supported will spend about the
same time in both cases (query or graph). For queries, by iterating on a
list of typical rules. For a graph, by walking through it. Either way, it
won't be done later from the data path.

I think that for an application maintainer, writing or even generating a set
of typical rules will also be easier than walking through a graph. It should
also be easier on the PMD side.

For individual points:

(i) should be doable with the query API without recompiling DPDK as well,
the fact API/ABI breakage must be avoided being part of the requirements. If
you think there is a problem regarding this, can you provide a specific
example?

(ii) as described above, I think this use case won't be very common in the
wild, except for applications designed for a specific device and then they
will probably know enough about it to skip the query step entirely. If time
must be spent anyway, it will be in the control path at initialization
time.

(iii) misses the fact that capabilities evolve as flow rules get added,
there is no way for PMDs to only see "valid" rules also because device
limitations may prevent adding an otherwise valid rule.

(iv) could be true if not for the same reason as (iii). The graph would have
to be verfied again before adding another rule. Note that PMDs maintainers
are encouraged to make their query function as fast as possible, they may
rely on static data internally for this as well.

> > Worse in my opinion it requires all drivers to write mostly duplicating
> > validation code where a common layer could easily do this if every
> > driver reported a common data structure representing its parse graph
> > instead. The nice fallout of this initial effort upfront is the driver
> > no longer needs to do error handling/checking/etc and can assume all
> > rules are correct and valid. It makes driver code much simpler to
> > support. And IMO at least by doing this we get some other nice benefits
> > described below.

About duplicated code, my usual reply is that DPDK will provide internal
helper methods to assist PMDs with rules management/parsing/etc. These are
not discussed in the specification because I wanted everyone to agree to the
application side of things first, and it is difficult to know how much
assistance PMDs might need without an initial implementation.

I think this private API will be built at the same time as support is added
to PMDs and maintainers notice generic code that can be shared.
Documentation may be written later once things start to settle down.

> > Another related question is about performance.
> > 
> >> Creation
> >> ~~~~~~~~
> >>
> >> Creating a flow rule is similar to validating one, except the rule is
> >> actually created.
> >>
> >> ::
> >>
> >>  struct rte_flow *
> >>  rte_flow_create(uint8_t port_id,
> >>                  const struct rte_flow_pattern *pattern,
> >>                  const struct rte_flow_actions *actions);
> > 
> > I gather this implies that each driver must parse the pattern/action
> > block and map this onto the hardware. How many rules per second can this
> > support? I've run into systems that expect a level of service somewhere
> > around 50k cmds per second. So bulking will help at the message level
> > but it seems like a lot of overhead to unpack the pattern/action section.

There is indeed no guarantee on the time taken to create a flow rule, as
debated with Sugesh (see the full thread):

 http://dpdk.org/ml/archives/dev/2016-July/043958.html

I will update the specification accordingly.

Consider that even 50k cmds per second may not be fast enough. Applications
always need to have some kind of fallback ready, and the ability to know
whether a packet has been matched by a rule is a way to help with that.

In any case, flow rules must be managed from the control path, the data path
must only handle consequences.

> > One strategy I've used in other systems that worked relatively well
> > is if the query for the parse graph above returns a key for each node
> > in the graph then a single lookup can map the key to a node. Its
> > unambiguous and then these operations simply become a table lookup.
> > So to be a bit more concrete this changes the pattern structure in
> > rte_flow_create() into a  <key,value,mask> tuple where the key is known
> > by the initial parse graph query. If you reserve a set of well-defined
> > key values for well known protocols like ethernet, ip, etc. then the
> > query model also works but the middle layer catches errors in this case
> > and again the driver only gets known good flows. So something like this,
> > 
> >   struct rte_flow_pattern {
> > 	uint32_t priority;
> > 	uint32_t key;
> > 	uint32_t value_length;
> > 	u8 *value;
> >   }

I agree that having an integer representing an entire pattern/actions combo
would be great, however how do you tell whether you want matched packets to
be duplicated to queue 6 and redirected to queue 3? This method can be used
to check if a type of rule is allowed but not whether it is actually
applicable. You still need to provide the entire pattern/actions description
to create a flow rule.

> > Also if we have multiple tables what do you think about adding a
> > table_id to the signature. Probably not needed in the first generation
> > but is likely useful for hardware with multiple tables so that it
> > would be,
> > 
> >    rte_flow_create(uint8_t port_id, uint8_t table_id, ...);

Not sure if I understand the table ID concept, do you mean in case a device
supports entirely different sets of features depending on something? (What?)

> > Finally one other problem we've had which would be great to address
> > if we are doing a rewrite of the API is adding new protocols to
> > already deployed DPDK stacks. This is mostly a Linux distribution
> > problem where you can't easily update DPDK.
> > 
> > In the prototype header linked in this document it seems to add new
> > headers requires adding a new enum in the rte_flow_item_type but there
> > is at least an attempt at a catch all here,
> > 
> >> 	/**
> >> 	 * Matches a string of a given length at a given offset (in bytes),
> >> 	 * or anywhere in the payload of the current protocol layer
> >> 	 * (including L2 header if used as the first item in the stack).
> >> 	 *
> >> 	 * See struct rte_flow_item_raw.
> >> 	 */
> >> 	RTE_FLOW_ITEM_TYPE_RAW,
> > 
> > Actually this is a nice implementation because it works after the
> > previous item in the stack correct?

Yes, this is correct.

> > So you can put it after "known"
> > variable length headers like IP. The limitation is it can't get past
> > undefined variable length headers.

RTE_FLOW_ITEM_TYPE_ANY is made for that purpose. Is that what you are
looking for?

> > However if you use the above parse
> > graph reporting from the driver mechanism and the driver always reports
> > its largest supported graph then we don't have this issue where a new
> > hardware sku/ucode/etc added support for new headers but we have no
> > way to deploy it to existing software users without recompiling and
> > redeploying.

I really would like to understand if you see a limitation regarding this
with the specified API, even assuming DPDK is compiled as a shared library
and thus not part of the user application.

> > I looked at the git repo but I only saw the header definition I guess
> > the implementation is TBD after there is enough agreement on the
> > interface?

Precisely, I intend to update the tree and send a v2 soon (unfortunately did
not have much time these past few days to work on this).

Now what if, instead of a seemingly complex parse graph and still in
addition to the query method, enum values were defined for PMDs to report
an array of supported items, typical patterns and actions so applications
can get a quick idea of what devices are capable of without being too
specific. Something like:

 enum rte_flow_capability {
     RTE_FLOW_CAPABILITY_ITEM_ETH,
     RTE_FLOW_CAPABILITY_PATTERN_ETH_IP_TCP,
     RTE_FLOW_CAPABILITY_ACTION_ID,
     ...
 };

Although I'm not convinced about the usefulness of this because it would
have to be maintained separately, but that would be easier than building a
dummy flow rule for simple query purposes.

The main question I have for you is, do you think the core of the specified
API is adequate enough assuming it can be extended later with new methods?

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-23 21:10     ` John Fastabend
@ 2016-08-02 18:19       ` John Fastabend
  2016-08-03 14:30         ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: John Fastabend @ 2016-08-02 18:19 UTC (permalink / raw)
  To: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

On 16-07-23 02:10 PM, John Fastabend wrote:
> On 16-07-21 12:20 PM, Adrien Mazarguil wrote:
>> Hi Jerin,
>>
>> Sorry, looks like I missed your reply. Please see below.
>>
> 
> Hi Adrian,
> 
> Sorry for a bit delay but a few comments that may be worth considering.
> 
> To start with completely agree on the general problem statement and the
> nice summary of all the current models. Also good start on this.
> 
>>
>> Considering that allowed pattern/actions combinations cannot be known in
>> advance and would result in an unpractically large number of capabilities to
>> expose, a method is provided to validate a given rule from the current
>> device configuration state without actually adding it (akin to a "dry run"
>> mode).
> 
> Rather than have a query/validate process why did we jump over having an
> intermediate representation of the capabilities? Here you state it is
> unpractical but we know how to represent parse graphs and the drivers
> could report their supported parse graph via a single query to a middle
> layer.
> 
> This will actually reduce the msg chatter imagine many applications at
> init time or in boundary cases where a large set of applications come
> online at once and start banging on the interface all at once seems less
> than ideal.
> 

A bit more details on possible interface for capabilities query,

One way I've used to describe these graphs from driver to software
stacks is to use a set of structures to build the graph. For fixed
graphs this could just be *.h file for programmable hardware (typically
coming from fw update on nics) the driver can read the parser details
out of firmware and render the structures.

I've done this two ways: one is to define all the fields in their
own structures using something like,

struct field {
	char *name;
	u32 uid;
	u32 bitwidth;
};

This gives a unique id (uid) for each field along with its
width and a user friendly name. The fields are organized into
headers via a header structure,

struct header_node {
	char *name;
	u32 uid;
	u32 *fields;
	struct parse_graph *jump;
};

Each node has a unique id and then a list of fields. Where 'fields'
is a list of uid's of fields its also easy enough to embed the field
struct in the header_node if that is simpler its really a style
question.

The 'struct parse_graph' gives the list of edges from this header node
to other header nodes. Using a parse graph structure defined

struct parse_graph {
	struct field_reference ref;
	__u32 jump_uid;
};

Again as a matter of style you can embed the parse graph in the header
node as I did above or do it as its own object.

The field_reference noted below gives the id of the field and the value
e.g. the tuple (ipv4.protocol, 6) then jump_uid would be the uid of TCP.

struct field_reference {
	__u32 header_uid;
	__u32 field_uid;
	__u32 mask_type;
	__u32 type;
	__u8  *value;
	__u8  *mask;
};

The cost doing all this is some additional overhead at init time. But
building generic function over this and having a set of predefined
uids for well-known protocols such ip, udp, tcp, etc helps. What you
get for the cost is a few things that I think are worth it. (i) Now
new protocols can be added/removed without recompiling DPDK (ii) a
software package can use the capability query to verify the required
protocols are off-loadable vs a possibly large set of test queries and
(iii) when we do the programming of the device we can provide a tuple
(table-uid, header-uid, field-uid, value, mask, priority) and the
middle layer "knowing" the above graph can verify the command so
drivers only ever see "good"  commands, (iv) finally it should be
faster in terms of cmds per second because the drivers can map the
tuple (table, header, field, priority) to a slot efficiently vs
parsing.

IMO point (iii) and (iv) will in practice make the code much simpler
because we can maintain common middle layer and not require parsing
by drivers. Making each driver simpler by abstracting into common
layer.

> Worse in my opinion it requires all drivers to write mostly duplicating
> validation code where a common layer could easily do this if every
> driver reported a common data structure representing its parse graph
> instead. The nice fallout of this initial effort upfront is the driver
> no longer needs to do error handling/checking/etc and can assume all
> rules are correct and valid. It makes driver code much simpler to
> support. And IMO at least by doing this we get some other nice benefits
> described below.
> 
> Another related question is about performance.
> 
>> Creation
>> ~~~~~~~~
>>
>> Creating a flow rule is similar to validating one, except the rule is
>> actually created.
>>
>> ::
>>
>>  struct rte_flow *
>>  rte_flow_create(uint8_t port_id,
>>                  const struct rte_flow_pattern *pattern,
>>                  const struct rte_flow_actions *actions);
> 
> I gather this implies that each driver must parse the pattern/action
> block and map this onto the hardware. How many rules per second can this
> support? I've run into systems that expect a level of service somewhere
> around 50k cmds per second. So bulking will help at the message level
> but it seems like a lot of overhead to unpack the pattern/action section.
> 
> One strategy I've used in other systems that worked relatively well
> is if the query for the parse graph above returns a key for each node
> in the graph then a single lookup can map the key to a node. Its
> unambiguous and then these operations simply become a table lookup.
> So to be a bit more concrete this changes the pattern structure in
> rte_flow_create() into a  <key,value,mask> tuple where the key is known
> by the initial parse graph query. If you reserve a set of well-defined
> key values for well known protocols like ethernet, ip, etc. then the
> query model also works but the middle layer catches errors in this case
> and again the driver only gets known good flows. So something like this,
> 
>   struct rte_flow_pattern {
> 	uint32_t priority;
> 	uint32_t key;
> 	uint32_t value_length;
> 	u8 *value;
>   }
> 
> Also if we have multiple tables what do you think about adding a
> table_id to the signature. Probably not needed in the first generation
> but is likely useful for hardware with multiple tables so that it
> would be,
> 
>    rte_flow_create(uint8_t port_id, uint8_t table_id, ...);
> 
> Finally one other problem we've had which would be great to address
> if we are doing a rewrite of the API is adding new protocols to
> already deployed DPDK stacks. This is mostly a Linux distribution
> problem where you can't easily update DPDK.
> 
> In the prototype header linked in this document it seems to add new
> headers requires adding a new enum in the rte_flow_item_type but there
> is at least an attempt at a catch all here,
> 
>> 	/**
>> 	 * Matches a string of a given length at a given offset (in bytes),
>> 	 * or anywhere in the payload of the current protocol layer
>> 	 * (including L2 header if used as the first item in the stack).
>> 	 *
>> 	 * See struct rte_flow_item_raw.
>> 	 */
>> 	RTE_FLOW_ITEM_TYPE_RAW,
> 
> Actually this is a nice implementation because it works after the
> previous item in the stack correct? So you can put it after "known"
> variable length headers like IP. The limitation is it can't get past
> undefined variable length headers. However if you use the above parse
> graph reporting from the driver mechanism and the driver always reports
> its largest supported graph then we don't have this issue where a new
> hardware sku/ucode/etc added support for new headers but we have no
> way to deploy it to existing software users without recompiling and
> redeploying.
> 
> I looked at the git repo but I only saw the header definition I guess
> the implementation is TBD after there is enough agreement on the
> interface?
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-25 16:40       ` John Fastabend
@ 2016-07-26 10:07         ` Rahul Lakkireddy
  2016-08-03 16:44           ` Adrien Mazarguil
  2016-08-19 21:13           ` John Daley (johndale)
  0 siblings, 2 replies; 54+ messages in thread
From: Rahul Lakkireddy @ 2016-07-26 10:07 UTC (permalink / raw)
  To: John Fastabend, Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley, Jing Chen,
	Konstantin Ananyev, Matej Vido, Alejandro Lucero, Sony Chacko,
	Jerin Jacob, Pablo de Lara, Olga Shern, Kumar A S,
	Nirranjan Kirubaharan, Indranil Choudhury

On Monday, July 07/25/16, 2016 at 09:40:02 -0700, John Fastabend wrote:
> On 16-07-25 04:32 AM, Rahul Lakkireddy wrote:
> > Hi Adrien,
> > 
> > On Thursday, July 07/21/16, 2016 at 19:07:38 +0200, Adrien Mazarguil wrote:
> >> Hi Rahul,
> >>
> >> Please see below.
> >>
> >> On Thu, Jul 21, 2016 at 01:43:37PM +0530, Rahul Lakkireddy wrote:
> >>> Hi Adrien,
> >>>
> >>> The proposal looks very good.  It satisfies most of the features
> >>> supported by Chelsio NICs.  We are looking for suggestions on exposing
> >>> more additional features supported by Chelsio NICs via this API.
> >>>
> >>> Chelsio NICs have two regions in which filters can be placed -
> >>> Maskfull and Maskless regions.  As their names imply, maskfull region
> >>> can accept masks to match a range of values; whereas, maskless region
> >>> don't accept any masks and hence perform a more strict exact-matches.
> >>> Filters without masks can also be placed in maskfull region.  By
> >>> default, maskless region have higher priority over the maskfull region.
> >>> However, the priority between the two regions is configurable.
> >>
> >> I understand this configuration affects the entire device. Just to be clear,
> >> assuming some filters are already configured, are they affected by a change
> >> of region priority later?
> >>
> > 
> > Both the regions exist at the same time in the device.  Each filter can
> > either belong to maskfull or the maskless region.
> > 
> > The priority is configured at time of filter creation for every
> > individual filter and cannot be changed while the filter is still
> > active. If priority needs to be changed for a particular filter then,
> > it needs to be deleted first and re-created.
> 
> Could you model this as two tables and add a table_id to the API? This
> way user space could populate the table it chooses. We would have to add
> some capabilities attributes to "learn" if tables support masks or not
> though.
> 

This approach sounds interesting.

> I don't see how the PMD can sort this out in any meaningful way and it
> has to be exposed to the application that has the intelligence to 'know'
> priorities between masks and non-masks filters. I'm sure you could come
> up with something but it would be less than ideal in many cases I would
> guess and we can't have the driver getting priorities wrong or we may
> not get the correct behavior.
> 
> > 
> >>> Please suggest on how we can let the apps configure in which region
> >>> filters must be placed and set the corresponding priority accordingly
> >>> via this API.
> >>
> >> Okay. Applications, like customers, are always right.
> >>
> >> With this in mind, PMDs are not allowed to break existing flow rules, and
> >> face two options when applications provide a flow specification that would
> >> break an existing rule:
> >>
> >> - Refuse to create it (easiest).
> >>
> >> - Find a workaround to apply it anyway (possibly quite complicated).
> >>
> >> The reason you have these two regions is performance right? Otherwise I'd
> >> just say put everything in the maskfull region.
> >>
> > 
> > Unfortunately, our maskfull region is extremely small too compared to
> > maskless region.
> > 
> 
> To me this means a userspace application would want to pack it
> carefully to get the full benefit. So you need some mechanism to specify
> the "region" hence the above table proposal.
> 

Right. Makes sense.

[...]
> >> Now about this "promisc" match criteria, it can be added as a new meta
> >> pattern item (4.1.3 Meta item types). Do you want it to be defined from the
> >> start or add it later with the related code in your PMD?
> >>
> > 
> > It could be added as a meta item.  If there are other interested
> > parties, it can be added now.  Otherwise, we'll add it with our filtering
> > related code.
> > 
> 
> hmm I guess by "promisc" here you mean match packets received from the
> wire before they have been switched by the silicon?
> 

Match packets received from wire before they have been switched by
silicon, and which also includes packets not destined for DUT and were
still received due to interface being in promisc mode.

> >>> - Match FCoE packets.
> >>
> >> The "oE" part is covered by the ETH item (using the proper Ethertype), but a
> >> new pattern item should probably be added for the FC protocol itself. As
> >> above, I suggest adding it later.
> >>
> > 
> > Same here.
> > 
> >>> - Match IP Fragmented packets.
> >>
> >> It seems that I missed quite a few fields in the IPv4 item definition
> >> (struct rte_flow_item_ipv4). It should be in there.
> >>
> > 
> > This raises another interesting question.  What should the PMD do
> > if it has support to only a subset of fields in the particular item?
> > 
> > For example, if a rule has been sent to match IP fragmentation along
> > with several other IPv4 fields, and if the underlying hardware doesn't
> > support matching based on IP fragmentation, does the PMD reject the
> > complete rule although it could have done the matching for rest of the
> > IPv4 fields?
> 
> I think it has to fail the command other wise user space will not have
> any way to understand that the full match criteria can not be met and
> we will get different behavior for the same applications on different
> nics depending on hardware feature set. This will most likely break
> applications so we need the error IMO.
> 

Ok. Makes sense.

> > 
> >>> - Match range of physical ports on the NIC in a single rule via masks.
> >>>   For ex: match all UDP packets coming on ports 3 and 4 out of 4
> >>>   ports available on the NIC.
> >>
> >> Applications create flow rules per port, I'd thus suggest that the PMD
> >> should detect identical rules created on different ports and aggregate them
> >> as a single HW rule automatically.
> >>
> >> If you think this approach is not right, the alternative is a meta pattern
> >> item that provides a list of ports. I'm not sure this is the right approach
> >> considering it would most likely not be supported by most NICs. Applications
> >> may not request it explicitly.
> >>
> > 
> > Aggregating via PMD will be expensive operation since it would involve:
> > - Search of existing filters.
> > - Deleting those filters.
> > - Creating a single combined filter.
> > 
> > And all of above 3 operations would need to be atomic so as not to
> > affect existing traffic which is hitting above filters. Adding a
> > meta item would be a simpler solution here.
> > 
> 
> For this adding a meta-data item seems simplest to me. And if you want
> to make the default to be only a single port that would maybe make it
> easier for existing apps to port from flow director. Then if an
> application cares it can create a list of ports if needed.
> 

Agreed.

> >>> - Match range of Physical Functions (PFs) on the NIC in a single rule
> >>>   via masks. For ex: match all traffic coming on several PFs.
> >>
> >> The PF and VF pattern items assume there is a single PF associated with a
> >> DPDK port. VFs are identified with an ID. I basically took the same
> >> definitions as the existing filter types, perhaps this is not enough for
> >> Chelsio adapters.
> >>
> >> Do you expose more than one PF for a DPDK port?
> >>
> >> Anyway, I'd suggest the same approach as above, automatic aggregation of
> >> rules for performance reasons, otherwise new or updated PF/VF pattern items,
> >> in which case it would be great if you could provide ideal structure
> >> definitions for this use case.
> >>
> > 
> > In Chelsio hardware, all the ports of a device are exposed via single
> > PF4. There could be many VFs attached to a PF.  Physical NIC functions
> > are operational on PF4, while VFs can be attached to PFs 0-3.
> > So, Chelsio hardware doesn't remain tied on a PF-to-Port, one-to-one
> > mapping assumption.
> > 
> > There already seems to be a PF meta-item, but it doesn't seem to accept
> > any "spec" and "mask" field.  Similarly, the VF meta-item doesn't
> > seem to accept a "mask" field.  We could probably enable these fields
> > in the PF and VF meta-items to allow configuration.
> 
> Maybe a range field would help here as well? So you could specify a VF
> range. It might be one of the things to consider adding later though if
> there is no clear use for it now.
> 

VF-value and VF-mask would help to achieve the desired filter.
VF-mask would also enable to specify a range of VF values.

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-25 11:32     ` Rahul Lakkireddy
@ 2016-07-25 16:40       ` John Fastabend
  2016-07-26 10:07         ` Rahul Lakkireddy
  0 siblings, 1 reply; 54+ messages in thread
From: John Fastabend @ 2016-07-25 16:40 UTC (permalink / raw)
  To: Rahul Lakkireddy, Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley, Jing Chen,
	Konstantin Ananyev, Matej Vido, Alejandro Lucero, Sony Chacko,
	Jerin Jacob, Pablo de Lara, Olga Shern, Kumar Sanghvi,
	Nirranjan Kirubaharan, Indranil Choudhury

On 16-07-25 04:32 AM, Rahul Lakkireddy wrote:
> Hi Adrien,
> 
> On Thursday, July 07/21/16, 2016 at 19:07:38 +0200, Adrien Mazarguil wrote:
>> Hi Rahul,
>>
>> Please see below.
>>
>> On Thu, Jul 21, 2016 at 01:43:37PM +0530, Rahul Lakkireddy wrote:
>>> Hi Adrien,
>>>
>>> The proposal looks very good.  It satisfies most of the features
>>> supported by Chelsio NICs.  We are looking for suggestions on exposing
>>> more additional features supported by Chelsio NICs via this API.
>>>
>>> Chelsio NICs have two regions in which filters can be placed -
>>> Maskfull and Maskless regions.  As their names imply, maskfull region
>>> can accept masks to match a range of values; whereas, maskless region
>>> don't accept any masks and hence perform a more strict exact-matches.
>>> Filters without masks can also be placed in maskfull region.  By
>>> default, maskless region have higher priority over the maskfull region.
>>> However, the priority between the two regions is configurable.
>>
>> I understand this configuration affects the entire device. Just to be clear,
>> assuming some filters are already configured, are they affected by a change
>> of region priority later?
>>
> 
> Both the regions exist at the same time in the device.  Each filter can
> either belong to maskfull or the maskless region.
> 
> The priority is configured at time of filter creation for every
> individual filter and cannot be changed while the filter is still
> active. If priority needs to be changed for a particular filter then,
> it needs to be deleted first and re-created.

Could you model this as two tables and add a table_id to the API? This
way user space could populate the table it chooses. We would have to add
some capabilities attributes to "learn" if tables support masks or not
though.

I don't see how the PMD can sort this out in any meaningful way and it
has to be exposed to the application that has the intelligence to 'know'
priorities between masks and non-masks filters. I'm sure you could come
up with something but it would be less than ideal in many cases I would
guess and we can't have the driver getting priorities wrong or we may
not get the correct behavior.

> 
>>> Please suggest on how we can let the apps configure in which region
>>> filters must be placed and set the corresponding priority accordingly
>>> via this API.
>>
>> Okay. Applications, like customers, are always right.
>>
>> With this in mind, PMDs are not allowed to break existing flow rules, and
>> face two options when applications provide a flow specification that would
>> break an existing rule:
>>
>> - Refuse to create it (easiest).
>>
>> - Find a workaround to apply it anyway (possibly quite complicated).
>>
>> The reason you have these two regions is performance right? Otherwise I'd
>> just say put everything in the maskfull region.
>>
> 
> Unfortunately, our maskfull region is extremely small too compared to
> maskless region.
> 

To me this means a userspace application would want to pack it
carefully to get the full benefit. So you need some mechanism to specify
the "region" hence the above table proposal.

>> PMDs are allowed to rearrange existing rules or change device parameters as
>> long as existing constraints are satisfied. In my opinion it does not matter
>> which region has the highest default priority. Applications always want the
>> best performance so the first created rule should be in the most appropriate
>> region.
>>
>> If a subsequent rule requires it to be in the other region but the
>> application specified the wrong priority for this to work, then the PMD can
>> either choose to swap region priorities on the device (assuming it does not
>> affect other rules), or destroy and recreate the original rule in a way that
>> satisfies all constraints (i.e. moving conflicting rules from the maskless
>> region to the maskfull one).
>>
>> Going further, when subsequent rules get destroyed the PMD should ideally
>> move back maskfull rules back into the maskless region for better
>> performance.
>>
>> This is only a suggestion. PMDs have the right to say "no" at some point.
>>
> 
> Filter search and deletion are expensive operations and they need to be
> atomic in order to not affect existing traffic already hitting the
> filters.  Also, Chelsio hardware can support upto ~500,000 maskless
> region entries.  So, the cost of search becomes too high for the PMD
> when there are large number of filter entries present.
> 
> In my opinion, rather than PMD deciding on priority by itself, it would
> be more simpler if the apps have the flexibility to configure the
> priority between the regions by themselves?
> 

Agreed I think this will be a common problem especially as the hardware
tables get bigger and support masks where priority becomes important to
distinguish between multiple matching rules.

+1 for having a priority field.

>> More important in my opinion is to make sure applications can create a given
>> set of flow rules in any order. If rules a/b/c can be created, then it won't
>> make sense from an application point of view if c/a/b for some reason cannot
>> and the PMD maintainers will rightfully get a bug report.

[...]

> 
>> Now about this "promisc" match criteria, it can be added as a new meta
>> pattern item (4.1.3 Meta item types). Do you want it to be defined from the
>> start or add it later with the related code in your PMD?
>>
> 
> It could be added as a meta item.  If there are other interested
> parties, it can be added now.  Otherwise, we'll add it with our filtering
> related code.
> 

hmm I guess by "promisc" here you mean match packets received from the
wire before they have been switched by the silicon?

>>> - Match FCoE packets.
>>
>> The "oE" part is covered by the ETH item (using the proper Ethertype), but a
>> new pattern item should probably be added for the FC protocol itself. As
>> above, I suggest adding it later.
>>
> 
> Same here.
> 
>>> - Match IP Fragmented packets.
>>
>> It seems that I missed quite a few fields in the IPv4 item definition
>> (struct rte_flow_item_ipv4). It should be in there.
>>
> 
> This raises another interesting question.  What should the PMD do
> if it has support to only a subset of fields in the particular item?
> 
> For example, if a rule has been sent to match IP fragmentation along
> with several other IPv4 fields, and if the underlying hardware doesn't
> support matching based on IP fragmentation, does the PMD reject the
> complete rule although it could have done the matching for rest of the
> IPv4 fields?

I think it has to fail the command other wise user space will not have
any way to understand that the full match criteria can not be met and
we will get different behavior for the same applications on different
nics depending on hardware feature set. This will most likely break
applications so we need the error IMO.

> 
>>> - Match range of physical ports on the NIC in a single rule via masks.
>>>   For ex: match all UDP packets coming on ports 3 and 4 out of 4
>>>   ports available on the NIC.
>>
>> Applications create flow rules per port, I'd thus suggest that the PMD
>> should detect identical rules created on different ports and aggregate them
>> as a single HW rule automatically.
>>
>> If you think this approach is not right, the alternative is a meta pattern
>> item that provides a list of ports. I'm not sure this is the right approach
>> considering it would most likely not be supported by most NICs. Applications
>> may not request it explicitly.
>>
> 
> Aggregating via PMD will be expensive operation since it would involve:
> - Search of existing filters.
> - Deleting those filters.
> - Creating a single combined filter.
> 
> And all of above 3 operations would need to be atomic so as not to
> affect existing traffic which is hitting above filters. Adding a
> meta item would be a simpler solution here.
> 

For this adding a meta-data item seems simplest to me. And if you want
to make the default to be only a single port that would maybe make it
easier for existing apps to port from flow director. Then if an
application cares it can create a list of ports if needed.

>>> - Match range of Physical Functions (PFs) on the NIC in a single rule
>>>   via masks. For ex: match all traffic coming on several PFs.
>>
>> The PF and VF pattern items assume there is a single PF associated with a
>> DPDK port. VFs are identified with an ID. I basically took the same
>> definitions as the existing filter types, perhaps this is not enough for
>> Chelsio adapters.
>>
>> Do you expose more than one PF for a DPDK port?
>>
>> Anyway, I'd suggest the same approach as above, automatic aggregation of
>> rules for performance reasons, otherwise new or updated PF/VF pattern items,
>> in which case it would be great if you could provide ideal structure
>> definitions for this use case.
>>
> 
> In Chelsio hardware, all the ports of a device are exposed via single
> PF4. There could be many VFs attached to a PF.  Physical NIC functions
> are operational on PF4, while VFs can be attached to PFs 0-3.
> So, Chelsio hardware doesn't remain tied on a PF-to-Port, one-to-one
> mapping assumption.
> 
> There already seems to be a PF meta-item, but it doesn't seem to accept
> any "spec" and "mask" field.  Similarly, the VF meta-item doesn't
> seem to accept a "mask" field.  We could probably enable these fields
> in the PF and VF meta-items to allow configuration.

Maybe a range field would help here as well? So you could specify a VF
range. It might be one of the things to consider adding later though if
there is no clear use for it now.

> 
>>> Please suggest on how we can expose the above features to DPDK apps via
>>> this API.
>>>

[...]

Thanks,
John

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-21 17:07   ` Adrien Mazarguil
@ 2016-07-25 11:32     ` Rahul Lakkireddy
  2016-07-25 16:40       ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: Rahul Lakkireddy @ 2016-07-25 11:32 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley, Jing Chen,
	Konstantin Ananyev, Matej Vido, Alejandro Lucero, Sony Chacko,
	Jerin Jacob, Pablo de Lara, Olga Shern, Kumar Sanghvi,
	Nirranjan Kirubaharan, Indranil Choudhury

Hi Adrien,

On Thursday, July 07/21/16, 2016 at 19:07:38 +0200, Adrien Mazarguil wrote:
> Hi Rahul,
> 
> Please see below.
> 
> On Thu, Jul 21, 2016 at 01:43:37PM +0530, Rahul Lakkireddy wrote:
> > Hi Adrien,
> > 
> > The proposal looks very good.  It satisfies most of the features
> > supported by Chelsio NICs.  We are looking for suggestions on exposing
> > more additional features supported by Chelsio NICs via this API.
> > 
> > Chelsio NICs have two regions in which filters can be placed -
> > Maskfull and Maskless regions.  As their names imply, maskfull region
> > can accept masks to match a range of values; whereas, maskless region
> > don't accept any masks and hence perform a more strict exact-matches.
> > Filters without masks can also be placed in maskfull region.  By
> > default, maskless region have higher priority over the maskfull region.
> > However, the priority between the two regions is configurable.
> 
> I understand this configuration affects the entire device. Just to be clear,
> assuming some filters are already configured, are they affected by a change
> of region priority later?
> 

Both the regions exist at the same time in the device.  Each filter can
either belong to maskfull or the maskless region.

The priority is configured at time of filter creation for every
individual filter and cannot be changed while the filter is still
active. If priority needs to be changed for a particular filter then,
it needs to be deleted first and re-created.

> > Please suggest on how we can let the apps configure in which region
> > filters must be placed and set the corresponding priority accordingly
> > via this API.
> 
> Okay. Applications, like customers, are always right.
> 
> With this in mind, PMDs are not allowed to break existing flow rules, and
> face two options when applications provide a flow specification that would
> break an existing rule:
> 
> - Refuse to create it (easiest).
> 
> - Find a workaround to apply it anyway (possibly quite complicated).
> 
> The reason you have these two regions is performance right? Otherwise I'd
> just say put everything in the maskfull region.
> 

Unfortunately, our maskfull region is extremely small too compared to
maskless region.

> PMDs are allowed to rearrange existing rules or change device parameters as
> long as existing constraints are satisfied. In my opinion it does not matter
> which region has the highest default priority. Applications always want the
> best performance so the first created rule should be in the most appropriate
> region.
> 
> If a subsequent rule requires it to be in the other region but the
> application specified the wrong priority for this to work, then the PMD can
> either choose to swap region priorities on the device (assuming it does not
> affect other rules), or destroy and recreate the original rule in a way that
> satisfies all constraints (i.e. moving conflicting rules from the maskless
> region to the maskfull one).
> 
> Going further, when subsequent rules get destroyed the PMD should ideally
> move back maskfull rules back into the maskless region for better
> performance.
> 
> This is only a suggestion. PMDs have the right to say "no" at some point.
> 

Filter search and deletion are expensive operations and they need to be
atomic in order to not affect existing traffic already hitting the
filters.  Also, Chelsio hardware can support upto ~500,000 maskless
region entries.  So, the cost of search becomes too high for the PMD
when there are large number of filter entries present.

In my opinion, rather than PMD deciding on priority by itself, it would
be more simpler if the apps have the flexibility to configure the
priority between the regions by themselves?

> More important in my opinion is to make sure applications can create a given
> set of flow rules in any order. If rules a/b/c can be created, then it won't
> make sense from an application point of view if c/a/b for some reason cannot
> and the PMD maintainers will rightfully get a bug report.
> 
> > More comments below.
> > 
> > On Tuesday, July 07/05/16, 2016 at 20:16:46 +0200, Adrien Mazarguil wrote:
> > > Hi All,
> > > 
> > [...]
> > 
> > > 
> > > ``ETH``
> > > ^^^^^^^
> > > 
> > > Matches an Ethernet header.
> > > 
> > > - ``dst``: destination MAC.
> > > - ``src``: source MAC.
> > > - ``type``: EtherType.
> > > - ``tags``: number of 802.1Q/ad tags defined.
> > > - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> > > 
> > >  - ``tpid``: Tag protocol identifier.
> > >  - ``tci``: Tag control information.
> > > 
> > > ``IPV4``
> > > ^^^^^^^^
> > > 
> > > Matches an IPv4 header.
> > > 
> > > - ``src``: source IP address.
> > > - ``dst``: destination IP address.
> > > - ``tos``: ToS/DSCP field.
> > > - ``ttl``: TTL field.
> > > - ``proto``: protocol number for the next layer.
> > > 
> > > ``IPV6``
> > > ^^^^^^^^
> > > 
> > > Matches an IPv6 header.
> > > 
> > > - ``src``: source IP address.
> > > - ``dst``: destination IP address.
> > > - ``tc``: traffic class field.
> > > - ``nh``: Next header field (protocol).
> > > - ``hop_limit``: hop limit field (TTL).
> > > 
> > > ``ICMP``
> > > ^^^^^^^^
> > > 
> > > Matches an ICMP header.
> > > 
> > > - TBD.
> > > 
> > > ``UDP``
> > > ^^^^^^^
> > > 
> > > Matches a UDP header.
> > > 
> > > - ``sport``: source port.
> > > - ``dport``: destination port.
> > > - ``length``: UDP length.
> > > - ``checksum``: UDP checksum.
> > > 
> > > .. raw:: pdf
> > > 
> > >    PageBreak
> > > 
> > > ``TCP``
> > > ^^^^^^^
> > > 
> > > Matches a TCP header.
> > > 
> > > - ``sport``: source port.
> > > - ``dport``: destination port.
> > > - All other TCP fields and bits.
> > > 
> > > ``VXLAN``
> > > ^^^^^^^^^
> > > 
> > > Matches a VXLAN header.
> > > 
> > > - TBD.
> > > 
> > 
> > In addition to above matches, Chelsio NICs have some additional
> > features:
> > 
> > - Match based on unicast DST-MAC, multicast DST-MAC, broadcast DST-MAC.
> >   Also, there is a match criteria available called 'promisc' - which
> >   matches packets that are not destined for the interface, but had
> >   been received by the hardware due to interface being in promiscuous
> >   mode.
> 
> There is no unicast/multicast/broadcast distinction in the ETH pattern item,
> those are covered by the specified MAC address.
> 

Ok.  Makes sense.

> Now about this "promisc" match criteria, it can be added as a new meta
> pattern item (4.1.3 Meta item types). Do you want it to be defined from the
> start or add it later with the related code in your PMD?
> 

It could be added as a meta item.  If there are other interested
parties, it can be added now.  Otherwise, we'll add it with our filtering
related code.

> > - Match FCoE packets.
> 
> The "oE" part is covered by the ETH item (using the proper Ethertype), but a
> new pattern item should probably be added for the FC protocol itself. As
> above, I suggest adding it later.
> 

Same here.

> > - Match IP Fragmented packets.
> 
> It seems that I missed quite a few fields in the IPv4 item definition
> (struct rte_flow_item_ipv4). It should be in there.
> 

This raises another interesting question.  What should the PMD do
if it has support to only a subset of fields in the particular item?

For example, if a rule has been sent to match IP fragmentation along
with several other IPv4 fields, and if the underlying hardware doesn't
support matching based on IP fragmentation, does the PMD reject the
complete rule although it could have done the matching for rest of the
IPv4 fields?

> > - Match range of physical ports on the NIC in a single rule via masks.
> >   For ex: match all UDP packets coming on ports 3 and 4 out of 4
> >   ports available on the NIC.
> 
> Applications create flow rules per port, I'd thus suggest that the PMD
> should detect identical rules created on different ports and aggregate them
> as a single HW rule automatically.
> 
> If you think this approach is not right, the alternative is a meta pattern
> item that provides a list of ports. I'm not sure this is the right approach
> considering it would most likely not be supported by most NICs. Applications
> may not request it explicitly.
> 

Aggregating via PMD will be expensive operation since it would involve:
- Search of existing filters.
- Deleting those filters.
- Creating a single combined filter.

And all of above 3 operations would need to be atomic so as not to
affect existing traffic which is hitting above filters. Adding a
meta item would be a simpler solution here.

> > - Match range of Physical Functions (PFs) on the NIC in a single rule
> >   via masks. For ex: match all traffic coming on several PFs.
> 
> The PF and VF pattern items assume there is a single PF associated with a
> DPDK port. VFs are identified with an ID. I basically took the same
> definitions as the existing filter types, perhaps this is not enough for
> Chelsio adapters.
> 
> Do you expose more than one PF for a DPDK port?
> 
> Anyway, I'd suggest the same approach as above, automatic aggregation of
> rules for performance reasons, otherwise new or updated PF/VF pattern items,
> in which case it would be great if you could provide ideal structure
> definitions for this use case.
> 

In Chelsio hardware, all the ports of a device are exposed via single
PF4. There could be many VFs attached to a PF.  Physical NIC functions
are operational on PF4, while VFs can be attached to PFs 0-3.
So, Chelsio hardware doesn't remain tied on a PF-to-Port, one-to-one
mapping assumption.

There already seems to be a PF meta-item, but it doesn't seem to accept
any "spec" and "mask" field.  Similarly, the VF meta-item doesn't
seem to accept a "mask" field.  We could probably enable these fields
in the PF and VF meta-items to allow configuration.

> > Please suggest on how we can expose the above features to DPDK apps via
> > this API.
> > 
> > [...]
> > 
> > > 
> > > Actions
> > > ~~~~~~~
> > > 
> > > Each possible action is represented by a type. Some have associated
> > > configuration structures. Several actions combined in a list can be affected
> > > to a flow rule. That list is not ordered.
> > > 
> > > At least one action must be defined in a filter rule in order to do
> > > something with matched packets.
> > > 
> > > - Actions are defined with ``struct rte_flow_action``.
> > > - A list of actions is defined with ``struct rte_flow_actions``.
> > > 
> > > They fall in three categories:
> > > 
> > > - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
> > >   processing matched packets by subsequent flow rules, unless overridden
> > >   with PASSTHRU.
> > > 
> > > - Non terminating actions (PASSTHRU, DUP) that leave matched packets up for
> > >   additional processing by subsequent flow rules.
> > > 
> > > - Other non terminating meta actions that do not affect the fate of packets
> > >   (END, VOID, ID, COUNT).
> > > 
> > > When several actions are combined in a flow rule, they should all have
> > > different types (e.g. dropping a packet twice is not possible). However
> > > considering the VOID type is an exception to this rule, the defined behavior
> > > is for PMDs to only take into account the last action of a given type found
> > > in the list. PMDs still perform error checking on the entire list.
> > > 
> > > *Note that PASSTHRU is the only action able to override a terminating rule.*
> > > 
> > 
> > Chelsio NICs can support an action 'switch' which can re-direct
> > matched packets from one port to another port in hardware.  In addition,
> > it can also optionally:
> > 
> > 1. Perform header rewrites (src-mac/dst-mac rewrite, src-mac/dst-mac
> >    swap, vlan add/remove/rewrite).
> > 
> > 2. Perform NAT'ing in hardware (4-tuple rewrite).
> > 
> > before sending it out on the wire [1].
> > 
> > To meet the above requirements, we'd need a way to pass sub-actions
> > to action 'switch' and a way to pass extra info (such as new
> > src-mac/dst-mac, new vlan, new 4-tuple for NAT) to rewrite
> > corresponding fields.
> > 
> > We're looking for suggestions on how we can achieve action 'switch'
> > in this new API.
> > 
> > From our understanding of this API, we could just expand
> > rte_flow_action_type with an additional action type
> > RTE_FLOW_ACTION_TYPE_SWITCH and define several sub-actions such as:
> > 
> > enum rte_flow_action_switch_type {
> >         RTE_FLOW_ACTION_SWITCH_TYPE_NONE,
> > 	RTE_FLOW_ACTION_SWITCH_TYPE_MAC_REWRITE,
> > 	RTE_FLOW_ACTION_SWITCH_TYPE_MAC_SWAP,
> > 	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_INSERT,
> > 	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_DELETE,
> > 	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_REWRITE,
> > 	RTE_FLOW_ACTION_SWITCH_TYPE_NAT_REWRITE,
> > };
> > 
> > and then define an rte_flow_action_switch as follows:
> > 
> > struct rte_flow_action_switch {
> > 	enum rte_flow_action_switch_type type; /* sub actions to perform */
> > 	uint16_t port;	/* Destination physical port to switch packet */
> > 	enum rte_flow_item_type	 type; /* Fields to rewrite */
> > 	const void *switch_spec;
> > 	/* Holds info to rewrite matched flows */
> > };
> > 
> > Does the above approach sound right with respect to this new API?
> 
> It does. Not sure I'd go with a sublevel of actions for switching types
> though. Think more generic, MAC swapping, MAC rewrite and so on could be
> defined as separate actions usable on their own by all PMDs, so you'd
> combine a SWITCH action with these.
> 

Ok.  Separate actions seem better.

> Also be careful with the "port" field definition. I'm sure the Chelsio
> switch cannot make a packet leave through a port of a Mellanox device. DPDK
> applications are aware of a single "port namespace", so it has to be
> translated by the PMD to a physical port on the same Chelio adapter,
> otherwise rule creation should fail.
> 

Agreed.

> > [...]
> > 
> > > 
> > > ``COUNT``
> > > ^^^^^^^^^
> > > 
> > > Enables hits counter for this rule.
> > > 
> > > This counter can be retrieved and reset through ``rte_flow_query()``, see
> > > ``struct rte_flow_query_count``.
> > > 
> > > - Counters can be retrieved with ``rte_flow_query()``.
> > > - No configurable property.
> > > 
> > > +---------------+
> > > | COUNT         |
> > > +===============+
> > > | no properties |
> > > +---------------+
> > > 
> > > Query structure to retrieve and reset the flow rule hits counter:
> > > 
> > > +------------------------------------------------+
> > > | COUNT query                                    |
> > > +===========+=====+==============================+
> > > | ``reset`` | in  | reset counter after query    |
> > > +-----------+-----+------------------------------+
> > > | ``hits``  | out | number of hits for this flow |
> > > +-----------+-----+------------------------------+
> > > 
> > 
> > Chelsio NICs can also count the number of bytes that hit the rule.
> > So, need a counter "bytes".
> 
> As well as the number of packets, right? Anyway it makes sense, I'll add
> a "bytes" field.
> 

Yes.  I hope 'hits' is analogous to 'number of packets' right?

> > [1]  http://www.dpdk.org/ml/archives/dev/2016-February/032605.html
> 
> Wow. I've not paid much attention to this thread at the time. Tweaking FDIR
> this much was quite an achievement. Now I feel lazy with my proposal for a
> brand new API instead, thanks.

Thanks.  A generic filtering infrastructure was what I was aiming for
and this proposal seems to have hit the mark.

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-21 19:20   ` Adrien Mazarguil
@ 2016-07-23 21:10     ` John Fastabend
  2016-08-02 18:19       ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: John Fastabend @ 2016-07-23 21:10 UTC (permalink / raw)
  To: Jerin Jacob, dev, Thomas Monjalon, Helin Zhang, Jingjing Wu,
	Rasesh Mody, Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu,
	Jan Medala, John Daley, Jing Chen, Konstantin Ananyev,
	Matej Vido, Alejandro Lucero, Sony Chacko, Pablo de Lara,
	Olga Shern

On 16-07-21 12:20 PM, Adrien Mazarguil wrote:
> Hi Jerin,
> 
> Sorry, looks like I missed your reply. Please see below.
> 

Hi Adrian,

Sorry for a bit delay but a few comments that may be worth considering.

To start with completely agree on the general problem statement and the
nice summary of all the current models. Also good start on this.

> 
> Considering that allowed pattern/actions combinations cannot be known in
> advance and would result in an unpractically large number of capabilities to
> expose, a method is provided to validate a given rule from the current
> device configuration state without actually adding it (akin to a "dry run"
> mode).

Rather than have a query/validate process why did we jump over having an
intermediate representation of the capabilities? Here you state it is
unpractical but we know how to represent parse graphs and the drivers
could report their supported parse graph via a single query to a middle
layer.

This will actually reduce the msg chatter imagine many applications at
init time or in boundary cases where a large set of applications come
online at once and start banging on the interface all at once seems less
than ideal.

Worse in my opinion it requires all drivers to write mostly duplicating
validation code where a common layer could easily do this if every
driver reported a common data structure representing its parse graph
instead. The nice fallout of this initial effort upfront is the driver
no longer needs to do error handling/checking/etc and can assume all
rules are correct and valid. It makes driver code much simpler to
support. And IMO at least by doing this we get some other nice benefits
described below.

Another related question is about performance.

> Creation
> ~~~~~~~~
> 
> Creating a flow rule is similar to validating one, except the rule is
> actually created.
> 
> ::
> 
>  struct rte_flow *
>  rte_flow_create(uint8_t port_id,
>                  const struct rte_flow_pattern *pattern,
>                  const struct rte_flow_actions *actions);

I gather this implies that each driver must parse the pattern/action
block and map this onto the hardware. How many rules per second can this
support? I've run into systems that expect a level of service somewhere
around 50k cmds per second. So bulking will help at the message level
but it seems like a lot of overhead to unpack the pattern/action section.

One strategy I've used in other systems that worked relatively well
is if the query for the parse graph above returns a key for each node
in the graph then a single lookup can map the key to a node. Its
unambiguous and then these operations simply become a table lookup.
So to be a bit more concrete this changes the pattern structure in
rte_flow_create() into a  <key,value,mask> tuple where the key is known
by the initial parse graph query. If you reserve a set of well-defined
key values for well known protocols like ethernet, ip, etc. then the
query model also works but the middle layer catches errors in this case
and again the driver only gets known good flows. So something like this,

  struct rte_flow_pattern {
	uint32_t priority;
	uint32_t key;
	uint32_t value_length;
	u8 *value;
  }

Also if we have multiple tables what do you think about adding a
table_id to the signature. Probably not needed in the first generation
but is likely useful for hardware with multiple tables so that it
would be,

   rte_flow_create(uint8_t port_id, uint8_t table_id, ...);

Finally one other problem we've had which would be great to address
if we are doing a rewrite of the API is adding new protocols to
already deployed DPDK stacks. This is mostly a Linux distribution
problem where you can't easily update DPDK.

In the prototype header linked in this document it seems to add new
headers requires adding a new enum in the rte_flow_item_type but there
is at least an attempt at a catch all here,

> 	/**
> 	 * Matches a string of a given length at a given offset (in bytes),
> 	 * or anywhere in the payload of the current protocol layer
> 	 * (including L2 header if used as the first item in the stack).
> 	 *
> 	 * See struct rte_flow_item_raw.
> 	 */
> 	RTE_FLOW_ITEM_TYPE_RAW,

Actually this is a nice implementation because it works after the
previous item in the stack correct? So you can put it after "known"
variable length headers like IP. The limitation is it can't get past
undefined variable length headers. However if you use the above parse
graph reporting from the driver mechanism and the driver always reports
its largest supported graph then we don't have this issue where a new
hardware sku/ucode/etc added support for new headers but we have no
way to deploy it to existing software users without recompiling and
redeploying.

I looked at the git repo but I only saw the header definition I guess
the implementation is TBD after there is enough agreement on the
interface?

Thanks,
John

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-21 13:37                       ` Adrien Mazarguil
@ 2016-07-22 16:32                         ` Chandran, Sugesh
  0 siblings, 0 replies; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-22 16:32 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey

HI Adrien,
Thank you for your effort and considering the inputs and comments.
The design looks fine for me now.


Regards
_Sugesh


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Thursday, July 21, 2016 2:37 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>;
> Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Rasesh Mody <rasesh.mody@qlogic.com>; Ajit
> Khaparde <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>;
> Chilikin, Andrey <andrey.chilikin@intel.com>
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> Hi Sugesh,
> 
> I do not have much to add, please see below.
> 
> On Thu, Jul 21, 2016 at 11:06:52AM +0000, Chandran, Sugesh wrote:
> [...]
> > > > RSS hashing support :- Just to confirm, the RSS flow action allows
> > > > application to decide the header fields to produce the hash. This
> > > > gives programmability on load sharing across different queues. The
> > > > application can program the NIC to calculate the RSS hash only
> > > > using mac or mac+ ip or ip only using this.
> > >
> > > I'd say yes but from your summary, I'm not sure we share the same
> > > idea of what the RSS action is supposed to do, so here is mine.
> > >
> > > Like all flow rules, the pattern part of the RSS action only filters
> > > the packets on which the action will be performed.
> > >
> > > The rss_conf parameter (struct rte_eth_rss_conf) only provides a key
> > > and a RSS hash function to use (ETH_RSS_IPV4,
> > > ETH_RSS_NONFRAG_IPV6_UDP, etc).
> > >
> > > Nothing prevents the RSS hash function from being applied to
> > > protocol headers which are not necessarily present in the flow rule
> > > pattern. These are two independent things, e.g. you could have a
> > > pattern matching IPv4 packets yet perform RSS hashing only on UDP
> headers.
> > >
> > > Finally, the RSS action configuration only affects packets coming
> > > from this flow rule. It is not performed on the device globally so
> > > packets which are not matched are not affected by RSS processing. As
> > > a result it might not be possible to configure two flow rules
> > > specifying incompatible RSS actions simultaneously if the underlying
> > > device supports only a single global RSS context.
> > >
> > [Sugesh] thank you for the explanation. This means I can have a rule
> > that matches on Every incoming packets(all field wild card rule) and
> > does RSS hash on selected fields, MAC only, IP only or IP & MAC?
> 
> Yes, I guess it could even replace the current method for configuring RSS on a
> device in a more versatile fashion, but this is a topic for another debate.
> 
> Let's implement this API first!
> 
> > This can be useful to do a packet lookup in software by just using
> > Only hash.
> 
> Not sure to fully understand your idea, but I'm positive it could be done
> somehow :)
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-21 12:47               ` Adrien Mazarguil
@ 2016-07-22  1:38                 ` Lu, Wenzhuo
  0 siblings, 0 replies; 54+ messages in thread
From: Lu, Wenzhuo @ 2016-07-22  1:38 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Adrien,


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Thursday, July 21, 2016 8:48 PM
> To: Lu, Wenzhuo
> Cc: dev@dpdk.org; Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody;
> Ajit Khaparde; Rahul Lakkireddy; Jan Medala; John Daley; Chen, Jing D; Ananyev,
> Konstantin; Matej Vido; Alejandro Lucero; Sony Chacko; Jerin Jacob; De Lara
> Guarch, Pablo; Olga Shern
> Subject: Re: [RFC] Generic flow director/filtering/classification API
> 
> Hi Wenzhuo,
> 
> It seems that we agree on about everything now, just a few more comments
> below after snipping the now irrelevant parts.
Yes, I think we’re agree with each other now. And thanks for the additional explanation :)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-11 10:41 ` Jerin Jacob
@ 2016-07-21 19:20   ` Adrien Mazarguil
  2016-07-23 21:10     ` John Fastabend
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-21 19:20 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu, Jan Medala,
	John Daley, Jing Chen, Konstantin Ananyev, Matej Vido,
	Alejandro Lucero, Sony Chacko, Pablo de Lara, Olga Shern

Hi Jerin,

Sorry, looks like I missed your reply. Please see below.

On Mon, Jul 11, 2016 at 04:11:43PM +0530, Jerin Jacob wrote:
> On Tue, Jul 05, 2016 at 08:16:46PM +0200, Adrien Mazarguil wrote:
> 
> Hi Adrien,
> 
> Overall this proposal looks very good. I could easily map to the
> classification hardware engines I am familiar with.

Great, thanks.

> > Priorities
> > ~~~~~~~~~~
> > 
> > A priority can be assigned to a matching pattern.
> > 
> > The default priority level is 0 and is also the highest. Support for more
> > than a single priority level in hardware is not guaranteed.
> > 
> > If a packet is matched by several filters at a given priority level, the
> > outcome is undefined. It can take any path and can even be duplicated.
> 
> In some cases fatal unrecoverable error too

Right, do you think I need to elaborate regarding unrecoverable errors?

How much unrecoverable by the way? Like not being able to receive any more
packets?

> > Matching pattern items for packet data must be naturally stacked (ordered
> > from lowest to highest protocol layer), as in the following examples:
> > 
> > +--------------+
> > | TCPv4 as L4  |
> > +===+==========+
> > | 0 | Ethernet |
> > +---+----------+
> > | 1 | IPv4     |
> > +---+----------+
> > | 2 | TCP      |
> > +---+----------+
> > 
> > +----------------+
> > | TCPv6 in VXLAN |
> > +===+============+
> > | 0 | Ethernet   |
> > +---+------------+
> > | 1 | IPv4       |
> > +---+------------+
> > | 2 | UDP        |
> > +---+------------+
> > | 3 | VXLAN      |
> > +---+------------+
> > | 4 | Ethernet   |
> > +---+------------+
> > | 5 | IPv6       |
> > +---+------------+
> 
> How about enumerating as "Inner-IPV6" flow type to avoid any confusion. Though spec
> can be same for both IPv6 and Inner-IPV6.

I'm not sure, if we have a more than two encapsulated IPv6 headers, knowing
that one of them is "inner" is not really useful. This is why I choose to
enforce the stack ordering instead, I think it makes more sense.

> > | 6 | TCP        |
> > +---+------------+
> > 
> > +-----------------------------+
> > | TCPv4 as L4 with meta items |
> > +===+=========================+
> > | 0 | VOID                    |
> > +---+-------------------------+
> > | 1 | Ethernet                |
> > +---+-------------------------+
> > | 2 | VOID                    |
> > +---+-------------------------+
> > | 3 | IPv4                    |
> > +---+-------------------------+
> > | 4 | TCP                     |
> > +---+-------------------------+
> > | 5 | VOID                    |
> > +---+-------------------------+
> > | 6 | VOID                    |
> > +---+-------------------------+
> > 
> > The above example shows how meta items do not affect packet data matching
> > items, as long as those remain stacked properly. The resulting matching
> > pattern is identical to "TCPv4 as L4".
> > 
> > +----------------+
> > | UDPv6 anywhere |
> > +===+============+
> > | 0 | IPv6       |
> > +---+------------+
> > | 1 | UDP        |
> > +---+------------+
> > 
> > If supported by the PMD, omitting one or several protocol layers at the
> > bottom of the stack as in the above example (missing an Ethernet
> > specification) enables hardware to look anywhere in packets.
> 
> It would be good if the common code can give it as Ethernet, IPV6, UDP
> to PMD(to avoid common code duplication across PMDs)

I left this mostly at PMD's discretion for now. Applications must provide
explicit rules if they need a consistent behavior. PMDs may not support this
at all, I've just documented what applications should expect when attempting
this kind of pattern.

> > It is unspecified whether the payload of supported encapsulations
> > (e.g. VXLAN inner packet) is matched by such a pattern, which may apply to
> > inner, outer or both packets.
> 
> a separate flow type enumeration may fix that problem. like "Inner-IPV6"
> mentioned above.

Not sure about that, for the same reason as above. Which "inner" level would
be matched by such a pattern? Note that it could have started with VXLAN
followed by ETH and then IPv6 if the application cared.

This is basically the ability to remain vague about a rule. I didn't want to
forbid it outright because I'm sure there are possible use cases:

- PMD validation and debugging.

- Rough filtering according to protocols a packet might contain somewhere
  (think of the network admins who cannot stand anything other than packets
  addressed to TCP port 80).

> > +---------------------+
> > | Invalid, missing L3 |
> > +===+=================+
> > | 0 | Ethernet        |
> > +---+-----------------+
> > | 1 | UDP             |
> > +---+-----------------+
> > 
> > The above pattern is invalid due to a missing L3 specification between L2
> > and L4. It is only allowed at the bottom and at the top of the stack.
> > 
> 
> > ``SIGNATURE``
> > ^^^^^^^^^^^^^
> > 
> > Requests hash-based signature dispatching for this rule.
> > 
> > Considering this is a global setting on devices that support it, all
> > subsequent filter rules may have to be created with it as well.
> 
> Can you describe the use case for this and how its different from
> existing rte_eth devel RSS settings.

Erm, well, this is my attempt at reimplementing RTE_FDIR_MODE_SIGNATURE
without being too sure of what it actually /does/. So far this definition
hasn't raised any eyebrows.

By default this API works like RTE_FDIR_MODE_PERFECT, where protocol headers
are matched with specific patterns. I understand this signature mode as not
giving the same results (a packet that would have matched a pattern in
perfect mode may not match in signature mode and vice versa) as there is
some signature involved at some point for some reason. OK, I really have no
idea.

I'm confident it is not related to RSS though. Perhaps people more familiar
with this mode (ixgbe maintainers) should comment. Maybe this mode is not
all that useful anymore.

> > - Only ``spec`` needs to be defined, ``mask`` is ignored.
> > 
> > +--------------------+
> > | SIGNATURE          |
> > +==========+=========+
> > | ``spec`` | TBD     |
> > +----------+---------+
> > | ``mask`` | ignored |
> > +----------+---------+
> > 
> 
> > 
> > ``ETH``
> > ^^^^^^^
> > 
> > Matches an Ethernet header.
> > 
> > - ``dst``: destination MAC.
> > - ``src``: source MAC.
> > - ``type``: EtherType.
> > - ``tags``: number of 802.1Q/ad tags defined.
> > - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> > 
> >  - ``tpid``: Tag protocol identifier.
> >  - ``tci``: Tag control information.
> 
> Find below the other L2 layer attributes are useful in HW classification,
> 
> - HiGig headers
> - DSA Headers
> - MPLS
> 
> May be we need to intrdouce a separate flow type with spec to add the support. Right?

Yeah, while I'm only familiar with MPLS, I think it's better to let these
so-called "2.5" protocol layers have their own specifications. I think
missing protocols will appear at the same time as PMD support.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-21  8:13 ` Rahul Lakkireddy
@ 2016-07-21 17:07   ` Adrien Mazarguil
  2016-07-25 11:32     ` Rahul Lakkireddy
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-21 17:07 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley, Jing Chen,
	Konstantin Ananyev, Matej Vido, Alejandro Lucero, Sony Chacko,
	Jerin Jacob, Pablo de Lara, Olga Shern, Kumar Sanghvi,
	Nirranjan Kirubaharan, Indranil Choudhury

Hi Rahul,

Please see below.

On Thu, Jul 21, 2016 at 01:43:37PM +0530, Rahul Lakkireddy wrote:
> Hi Adrien,
> 
> The proposal looks very good.  It satisfies most of the features
> supported by Chelsio NICs.  We are looking for suggestions on exposing
> more additional features supported by Chelsio NICs via this API.
> 
> Chelsio NICs have two regions in which filters can be placed -
> Maskfull and Maskless regions.  As their names imply, maskfull region
> can accept masks to match a range of values; whereas, maskless region
> don't accept any masks and hence perform a more strict exact-matches.
> Filters without masks can also be placed in maskfull region.  By
> default, maskless region have higher priority over the maskfull region.
> However, the priority between the two regions is configurable.

I understand this configuration affects the entire device. Just to be clear,
assuming some filters are already configured, are they affected by a change
of region priority later?

> Please suggest on how we can let the apps configure in which region
> filters must be placed and set the corresponding priority accordingly
> via this API.

Okay. Applications, like customers, are always right.

With this in mind, PMDs are not allowed to break existing flow rules, and
face two options when applications provide a flow specification that would
break an existing rule:

- Refuse to create it (easiest).

- Find a workaround to apply it anyway (possibly quite complicated).

The reason you have these two regions is performance right? Otherwise I'd
just say put everything in the maskfull region.

PMDs are allowed to rearrange existing rules or change device parameters as
long as existing constraints are satisfied. In my opinion it does not matter
which region has the highest default priority. Applications always want the
best performance so the first created rule should be in the most appropriate
region.

If a subsequent rule requires it to be in the other region but the
application specified the wrong priority for this to work, then the PMD can
either choose to swap region priorities on the device (assuming it does not
affect other rules), or destroy and recreate the original rule in a way that
satisfies all constraints (i.e. moving conflicting rules from the maskless
region to the maskfull one).

Going further, when subsequent rules get destroyed the PMD should ideally
move back maskfull rules back into the maskless region for better
performance.

This is only a suggestion. PMDs have the right to say "no" at some point.

More important in my opinion is to make sure applications can create a given
set of flow rules in any order. If rules a/b/c can be created, then it won't
make sense from an application point of view if c/a/b for some reason cannot
and the PMD maintainers will rightfully get a bug report.

> More comments below.
> 
> On Tuesday, July 07/05/16, 2016 at 20:16:46 +0200, Adrien Mazarguil wrote:
> > Hi All,
> > 
> [...]
> 
> > 
> > ``ETH``
> > ^^^^^^^
> > 
> > Matches an Ethernet header.
> > 
> > - ``dst``: destination MAC.
> > - ``src``: source MAC.
> > - ``type``: EtherType.
> > - ``tags``: number of 802.1Q/ad tags defined.
> > - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> > 
> >  - ``tpid``: Tag protocol identifier.
> >  - ``tci``: Tag control information.
> > 
> > ``IPV4``
> > ^^^^^^^^
> > 
> > Matches an IPv4 header.
> > 
> > - ``src``: source IP address.
> > - ``dst``: destination IP address.
> > - ``tos``: ToS/DSCP field.
> > - ``ttl``: TTL field.
> > - ``proto``: protocol number for the next layer.
> > 
> > ``IPV6``
> > ^^^^^^^^
> > 
> > Matches an IPv6 header.
> > 
> > - ``src``: source IP address.
> > - ``dst``: destination IP address.
> > - ``tc``: traffic class field.
> > - ``nh``: Next header field (protocol).
> > - ``hop_limit``: hop limit field (TTL).
> > 
> > ``ICMP``
> > ^^^^^^^^
> > 
> > Matches an ICMP header.
> > 
> > - TBD.
> > 
> > ``UDP``
> > ^^^^^^^
> > 
> > Matches a UDP header.
> > 
> > - ``sport``: source port.
> > - ``dport``: destination port.
> > - ``length``: UDP length.
> > - ``checksum``: UDP checksum.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``TCP``
> > ^^^^^^^
> > 
> > Matches a TCP header.
> > 
> > - ``sport``: source port.
> > - ``dport``: destination port.
> > - All other TCP fields and bits.
> > 
> > ``VXLAN``
> > ^^^^^^^^^
> > 
> > Matches a VXLAN header.
> > 
> > - TBD.
> > 
> 
> In addition to above matches, Chelsio NICs have some additional
> features:
> 
> - Match based on unicast DST-MAC, multicast DST-MAC, broadcast DST-MAC.
>   Also, there is a match criteria available called 'promisc' - which
>   matches packets that are not destined for the interface, but had
>   been received by the hardware due to interface being in promiscuous
>   mode.

There is no unicast/multicast/broadcast distinction in the ETH pattern item,
those are covered by the specified MAC address.

Now about this "promisc" match criteria, it can be added as a new meta
pattern item (4.1.3 Meta item types). Do you want it to be defined from the
start or add it later with the related code in your PMD?

> - Match FCoE packets.

The "oE" part is covered by the ETH item (using the proper Ethertype), but a
new pattern item should probably be added for the FC protocol itself. As
above, I suggest adding it later.

> - Match IP Fragmented packets.

It seems that I missed quite a few fields in the IPv4 item definition
(struct rte_flow_item_ipv4). It should be in there.

> - Match range of physical ports on the NIC in a single rule via masks.
>   For ex: match all UDP packets coming on ports 3 and 4 out of 4
>   ports available on the NIC.

Applications create flow rules per port, I'd thus suggest that the PMD
should detect identical rules created on different ports and aggregate them
as a single HW rule automatically.

If you think this approach is not right, the alternative is a meta pattern
item that provides a list of ports. I'm not sure this is the right approach
considering it would most likely not be supported by most NICs. Applications
may not request it explicitly.

> - Match range of Physical Functions (PFs) on the NIC in a single rule
>   via masks. For ex: match all traffic coming on several PFs.

The PF and VF pattern items assume there is a single PF associated with a
DPDK port. VFs are identified with an ID. I basically took the same
definitions as the existing filter types, perhaps this is not enough for
Chelsio adapters.

Do you expose more than one PF for a DPDK port?

Anyway, I'd suggest the same approach as above, automatic aggregation of
rules for performance reasons, otherwise new or updated PF/VF pattern items,
in which case it would be great if you could provide ideal structure
definitions for this use case.

> Please suggest on how we can expose the above features to DPDK apps via
> this API.
> 
> [...]
> 
> > 
> > Actions
> > ~~~~~~~
> > 
> > Each possible action is represented by a type. Some have associated
> > configuration structures. Several actions combined in a list can be affected
> > to a flow rule. That list is not ordered.
> > 
> > At least one action must be defined in a filter rule in order to do
> > something with matched packets.
> > 
> > - Actions are defined with ``struct rte_flow_action``.
> > - A list of actions is defined with ``struct rte_flow_actions``.
> > 
> > They fall in three categories:
> > 
> > - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
> >   processing matched packets by subsequent flow rules, unless overridden
> >   with PASSTHRU.
> > 
> > - Non terminating actions (PASSTHRU, DUP) that leave matched packets up for
> >   additional processing by subsequent flow rules.
> > 
> > - Other non terminating meta actions that do not affect the fate of packets
> >   (END, VOID, ID, COUNT).
> > 
> > When several actions are combined in a flow rule, they should all have
> > different types (e.g. dropping a packet twice is not possible). However
> > considering the VOID type is an exception to this rule, the defined behavior
> > is for PMDs to only take into account the last action of a given type found
> > in the list. PMDs still perform error checking on the entire list.
> > 
> > *Note that PASSTHRU is the only action able to override a terminating rule.*
> > 
> 
> Chelsio NICs can support an action 'switch' which can re-direct
> matched packets from one port to another port in hardware.  In addition,
> it can also optionally:
> 
> 1. Perform header rewrites (src-mac/dst-mac rewrite, src-mac/dst-mac
>    swap, vlan add/remove/rewrite).
> 
> 2. Perform NAT'ing in hardware (4-tuple rewrite).
> 
> before sending it out on the wire [1].
> 
> To meet the above requirements, we'd need a way to pass sub-actions
> to action 'switch' and a way to pass extra info (such as new
> src-mac/dst-mac, new vlan, new 4-tuple for NAT) to rewrite
> corresponding fields.
> 
> We're looking for suggestions on how we can achieve action 'switch'
> in this new API.
> 
> From our understanding of this API, we could just expand
> rte_flow_action_type with an additional action type
> RTE_FLOW_ACTION_TYPE_SWITCH and define several sub-actions such as:
> 
> enum rte_flow_action_switch_type {
>         RTE_FLOW_ACTION_SWITCH_TYPE_NONE,
> 	RTE_FLOW_ACTION_SWITCH_TYPE_MAC_REWRITE,
> 	RTE_FLOW_ACTION_SWITCH_TYPE_MAC_SWAP,
> 	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_INSERT,
> 	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_DELETE,
> 	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_REWRITE,
> 	RTE_FLOW_ACTION_SWITCH_TYPE_NAT_REWRITE,
> };
> 
> and then define an rte_flow_action_switch as follows:
> 
> struct rte_flow_action_switch {
> 	enum rte_flow_action_switch_type type; /* sub actions to perform */
> 	uint16_t port;	/* Destination physical port to switch packet */
> 	enum rte_flow_item_type	 type; /* Fields to rewrite */
> 	const void *switch_spec;
> 	/* Holds info to rewrite matched flows */
> };
> 
> Does the above approach sound right with respect to this new API?

It does. Not sure I'd go with a sublevel of actions for switching types
though. Think more generic, MAC swapping, MAC rewrite and so on could be
defined as separate actions usable on their own by all PMDs, so you'd
combine a SWITCH action with these.

Also be careful with the "port" field definition. I'm sure the Chelsio
switch cannot make a packet leave through a port of a Mellanox device. DPDK
applications are aware of a single "port namespace", so it has to be
translated by the PMD to a physical port on the same Chelio adapter,
otherwise rule creation should fail.

> [...]
> 
> > 
> > ``COUNT``
> > ^^^^^^^^^
> > 
> > Enables hits counter for this rule.
> > 
> > This counter can be retrieved and reset through ``rte_flow_query()``, see
> > ``struct rte_flow_query_count``.
> > 
> > - Counters can be retrieved with ``rte_flow_query()``.
> > - No configurable property.
> > 
> > +---------------+
> > | COUNT         |
> > +===============+
> > | no properties |
> > +---------------+
> > 
> > Query structure to retrieve and reset the flow rule hits counter:
> > 
> > +------------------------------------------------+
> > | COUNT query                                    |
> > +===========+=====+==============================+
> > | ``reset`` | in  | reset counter after query    |
> > +-----------+-----+------------------------------+
> > | ``hits``  | out | number of hits for this flow |
> > +-----------+-----+------------------------------+
> > 
> 
> Chelsio NICs can also count the number of bytes that hit the rule.
> So, need a counter "bytes".

As well as the number of packets, right? Anyway it makes sense, I'll add
a "bytes" field.

> [1]  http://www.dpdk.org/ml/archives/dev/2016-February/032605.html

Wow. I've not paid much attention to this thread at the time. Tweaking FDIR
this much was quite an achievement. Now I feel lazy with my proposal for a
brand new API instead, thanks.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-21 11:06                     ` Chandran, Sugesh
@ 2016-07-21 13:37                       ` Adrien Mazarguil
  2016-07-22 16:32                         ` Chandran, Sugesh
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-21 13:37 UTC (permalink / raw)
  To: Chandran, Sugesh
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey

Hi Sugesh,

I do not have much to add, please see below.

On Thu, Jul 21, 2016 at 11:06:52AM +0000, Chandran, Sugesh wrote:
[...]
> > > RSS hashing support :- Just to confirm, the RSS flow action allows
> > > application to decide the header fields to produce the hash. This
> > > gives programmability on load sharing across different queues. The
> > > application can program the NIC to calculate the RSS hash only using
> > > mac or mac+ ip or ip only using this.
> > 
> > I'd say yes but from your summary, I'm not sure we share the same idea of
> > what the RSS action is supposed to do, so here is mine.
> > 
> > Like all flow rules, the pattern part of the RSS action only filters the packets
> > on which the action will be performed.
> > 
> > The rss_conf parameter (struct rte_eth_rss_conf) only provides a key and a
> > RSS hash function to use (ETH_RSS_IPV4, ETH_RSS_NONFRAG_IPV6_UDP,
> > etc).
> > 
> > Nothing prevents the RSS hash function from being applied to protocol
> > headers which are not necessarily present in the flow rule pattern. These are
> > two independent things, e.g. you could have a pattern matching IPv4 packets
> > yet perform RSS hashing only on UDP headers.
> > 
> > Finally, the RSS action configuration only affects packets coming from this
> > flow rule. It is not performed on the device globally so packets which are not
> > matched are not affected by RSS processing. As a result it might not be
> > possible to configure two flow rules specifying incompatible RSS actions
> > simultaneously if the underlying device supports only a single global RSS
> > context.
> > 
> [Sugesh] thank you for the explanation. This means I can have a rule that matches on
> Every incoming packets(all field wild card rule) and does RSS hash on selected fields,
> MAC only, IP only or IP & MAC?

Yes, I guess it could even replace the current method for configuring RSS on
a device in a more versatile fashion, but this is a topic for another
debate.

Let's implement this API first!

> This can be useful to do a packet lookup in software by just using
> Only hash. 

Not sure to fully understand your idea, but I'm positive it could be done
somehow :)

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-21  3:18             ` Lu, Wenzhuo
@ 2016-07-21 12:47               ` Adrien Mazarguil
  2016-07-22  1:38                 ` Lu, Wenzhuo
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-21 12:47 UTC (permalink / raw)
  To: Lu, Wenzhuo
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Wenzhuo,

It seems that we agree on about everything now, just a few more comments
below after snipping the now irrelevant parts.

On Thu, Jul 21, 2016 at 03:18:11AM +0000, Lu, Wenzhuo wrote:
[...]
> > > > > Does it mean PMD should store and maintain all the rules? Why not
> > > > > let rte do
> > > > that? I think if PMD maintain all the rules, it means every kind of
> > > > NIC should have a copy of code for the rules. But if rte do that,
> > > > only one copy of code need to be maintained, right?
> > > >
> > > > I've considered having rules stored in a common format understood at
> > > > the RTE level and not specific to each PMD and decided that the
> > > > opaque rte_flow pointer was a better choice for the following reasons:
> > > >
> > > > - Even though flow rules management is done in the control path, processing
> > > >   must be as fast as possible. Letting PMDs store flow rules using their own
> > > >   internal representation gives them the chance to achieve better
> > > >   performance.
> > > Not quite understand. I think we're talking about maintain the rules by SW. I
> > don’t think there's something need to be optimized according to specific NICs. If
> > we need to optimize the code, I think we need to consider the CPU, OS ... and
> > some common means. I'm wrong?
> > 
> > Perhaps we were talking about different things, here I was only explaining why
> > rte_flow (the result of creating a flow rule) should be opaque and fully managed
> > by the PMD. More on the SW side of things below.
> > 
> > > > - An opaque context managed by PMDs would probably have to be stored
> > > >   somewhere as well anyway.
> > > >
> > > > - PMDs may not need to allocate/store anything at all if they exclusively
> > > >   rely on HW state for everything. In my opinion, the generic API has enough
> > > >   constraints for this to work and maintain consistency between flow
> > > >   rules. Note this is currently how most PMDs implement FDIR and other
> > > >   filter types.
> > > Yes, the rules are stored by HW. But considering stop/start the device, the
> > rules in HW will lose. we have to store the rules by SW and re-program them
> > when restarting the device.
> > 
> > Assume a HW capable of keeping flow rules programmed even during a
> > stop/start cycle (e.g. mlx4/mlx5 may be able to do it from DPDK point of view),
> > don't you think it is more efficient to standardize on this behavior and let PMDs
> > restore flow rules for HW that do not support it regardless of whether it would
> > be done by RTE or the application (SW)?
> Didn’t know that. As some NICs have already had the ability to keep the rules during a stop/start cycle, maybe it could be a trend :)

Well yeah, if you are wondering about that, these PMDs do not have the same
definition for port stop/start as lower level PMDs like ixgbe and i40e. In
the mlx4/mlx5 cases, most control path operations (queue creation,
destruction and general management) end up performed by kernel
drivers. Stopping a port does not really shut it down as the kernel still
manages its own netdevice independently.

[...]
> > > > - The flow rules format described in this specification (pattern / actions)
> > > >   will be used by applications directly, and will be free to arrange them in
> > > >   lists, trees or in any other way if they need to keep flow specifications
> > > >   around for further processing.
> > > Who will create the lists, trees or something else? According to previous
> > discussion, I think the APP will program the rules one by one. So if APP organize
> > the rules to lists, trees..., PMD doesn’t know that.
> > > And you said " Given that the opaque rte_flow pointer associated with a flow
> > rule is to be stored by the application ". I'm lost here.
> > 
> > I guess that's because we're discussing two different things, flow rule
> > specifications and flow rule objects. Let me sum it up:
> > 
> > - Flow rule specifications are the patterns/actions combinations provided by
> >   applications to rte_flow_create(). Applications can store those as needed
> >   and organize them as they wish (hash, tree, list). Neither PMDs nor RTE
> >   will do it for them.
> > 
> > - Flow rule objects (struct rte_flow *) are generated when a flow rule is
> >   created. Applications must keep these around if they want to manipulate
> >   them later (i.e. destroy or query existing rules).
> Thanks for this clarification. So the specifications can be different with objects, right? The specifications are what the APP wants, the objects are what the APP really gets. As rte_flow_create can fail. Right?

Yes, precisely. Apps are also free to keep specifications around even in the
event of a flow creation failure. I think a generic software fallback will
be provided at some point.

[...]
> > What we seem to not agree about is that you think RTE should be responsible
> > for restoring flow rules of devices that lose them when stopped. I think doing so
> > is unfair to devices for which it is not the case and not really nice to applications,
> > so my opinion is that the PMD is responsible for restoring flow rules however it
> > wants. It is free to use RTE helpers to keep their track, as long as it's all managed
> > internally.
> What I think is RTE can store the flow rules and recreate them after restarting, in the function like rte_dev_start, so APP knows nothing about it. But according to the discussing above, I think the design doesn't support it, right?

Yes. Right now the design explictly states that PMDs are on their own
regarding this (4.3 Behavior). While it could be modified, I really think it
would be less efficient for the reasons stated above.

> RTE doesn't store the flow rules objects and event it stores them, there's no way designed to re-program the objects. And also considering some HW doesn't need to be re-programed. I think it's OK  to let PMD maintain the rules as the re-programing is a NIC specific requirement.

Great to finally agree on this point.

> > > > Thus from an application point of view, whatever happens when
> > > > stopping and restarting a port should not matter. If a flow rule was
> > > > present before, it must still be present afterwards. If the PMD had
> > > > to destroy flow rules and re-create them, it does not actually matter if they
> > differ slightly at the HW level, as long as:
> > > >
> > > > - Existing opaque flow rule pointers (rte_flow) are still valid to the PMD
> > > >   and refer to the same rules.
> > > >
> > > > - The overall behavior of all rules is the same.
> > > >
> > > > The list of rules you think of (patterns / actions) is maintained by
> > > > applications (not RTE), and only if they need them. RTE would needlessly
> > duplicate this.
> > > As said before, need more details to understand this. Maybe an example
> > > is better :)
> > 
> > The generic format both RTE and applications might understand is the one
> > described in this API (struct rte_flow_pattern and struct rte_flow_actions).
> > 
> > If we wanted RTE to maintain some sort of per-port state for flow rule
> > specifications, it would have to be a copy of these structures arranged somehow
> > (list or something else).
> > 
> > If we consider that PMDs need to keep a context object associated to a flow
> > rule (the opaque struct rte_flow *), then RTE would most likely have to store it
> > along with the flow specification.
> > 
> > Such a list may not be useful to applications (list lookups take time), so they
> > would implement their own redundant method. They might also require extra
> > room to attach some application context to flow rules. A generic list cannot plan
> > for it.
> > 
> > Applications know what they want to do with flow rules and are responsible for
> > managing them efficiently with RTE out of the way.
> > 
> > I'm not sure if this answered your question, if not, please describe a scenario
> > where a RTE-managed list of flow rules would be mandatory.
> Got your point and agree :)

Thanks.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-20 17:10                   ` Adrien Mazarguil
@ 2016-07-21 11:06                     ` Chandran, Sugesh
  2016-07-21 13:37                       ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-21 11:06 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey


Hi Adrien,
Please find my comments below

Regards
_Sugesh


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, July 20, 2016 6:11 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>;
> Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Rasesh Mody <rasesh.mody@qlogic.com>; Ajit
> Khaparde <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>;
> Chilikin, Andrey <andrey.chilikin@intel.com>
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> Hi Sugesh,
> 
> Please see below.
> 
> On Wed, Jul 20, 2016 at 04:32:50PM +0000, Chandran, Sugesh wrote:
> [...]
> > > > How about a hardware flow flag in packet descriptor that set when
> > > > the packets hits any hardware rule. This way software doesn’t
> > > > worry /blocked by a hardware rule . Even though there is an
> > > > additional overhead of validating this flag, software datapath can
> > > > identify the
> > > hardware processed packets easily.
> > > > This way the packets traverses the software fallback path until
> > > > the rule configuration is complete. This flag avoids setting ID
> > > > action for every
> > > hardware flow that are configuring.
> > >
> > > That makes sense. I see it as a sort of single bit ID but it could
> > > be implemented through a different action for less capable devices.
> > > PMDs that support 32 bit IDs could reuse the same code for both
> features.
> > >
> > > I understand you'd prefer having this feature always present,
> > > however we already know that not all PMDs/devices support it, and
> > > like everything else this is a kind of offload that needs to be
> > > explicitly requested by the application as it may not be needed.
> > >
> > > If we go with the separate action, then perhaps it would make sense
> > > to rename "ID" to "MARK" to make things clearer:
> > >
> > >  RTE_FLOW_ACTION_TYPE_FLAG /* Flag packets processed by flow rule.
> > > */
> > >
> > >  RTE_FLOW_ACTION_TYPE_MARK /* Attach a 32 bit value to a packet. */
> > >
> > > I guess the result of the FLAG action would be something in ol_flag.
> > >
> > [Sugesh] This looks fine for me.
> 
> Great, I will update the specification accordingly.
[Sugesh] Thank you!
> 
> > > Thoughts?
> > >
> > [Sugesh] Two more queries that I missed out in the earlier comments
> > are, Support for PTYPE :- Intel NICs can report packet type in mbuf.
> > This can be used by software for the packet processing. Is generic API
> > capable of handling that as well?
> 
> Yes, however no PTYPE action has been defined for this (yet). It is only a
> matter of adding one.
[Sugesh] Thank you for confirming. Its fine for me
> 
> Currently packet type recognition is enabled per port using a separate API, so
> correct me if I'm wrong but I am not aware of any adapter with the ability to
> enable it per flow rule, so I do not think such an action needs to be defined
> from the start. We may add it later.
> 
> > RSS hashing support :- Just to confirm, the RSS flow action allows
> > application to decide the header fields to produce the hash. This
> > gives programmability on load sharing across different queues. The
> > application can program the NIC to calculate the RSS hash only using
> > mac or mac+ ip or ip only using this.
> 
> I'd say yes but from your summary, I'm not sure we share the same idea of
> what the RSS action is supposed to do, so here is mine.
> 
> Like all flow rules, the pattern part of the RSS action only filters the packets
> on which the action will be performed.
> 
> The rss_conf parameter (struct rte_eth_rss_conf) only provides a key and a
> RSS hash function to use (ETH_RSS_IPV4, ETH_RSS_NONFRAG_IPV6_UDP,
> etc).
> 
> Nothing prevents the RSS hash function from being applied to protocol
> headers which are not necessarily present in the flow rule pattern. These are
> two independent things, e.g. you could have a pattern matching IPv4 packets
> yet perform RSS hashing only on UDP headers.
> 
> Finally, the RSS action configuration only affects packets coming from this
> flow rule. It is not performed on the device globally so packets which are not
> matched are not affected by RSS processing. As a result it might not be
> possible to configure two flow rules specifying incompatible RSS actions
> simultaneously if the underlying device supports only a single global RSS
> context.
> 
[Sugesh] thank you for the explanation. This means I can have a rule that matches on
Every incoming packets(all field wild card rule) and does RSS hash on selected fields,
MAC only, IP only or IP & MAC? This can be useful to do a packet lookup in software by just using
Only hash. 
> Are we on the same page?
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-05 18:16 Adrien Mazarguil
                   ` (3 preceding siblings ...)
  2016-07-11 10:41 ` Jerin Jacob
@ 2016-07-21  8:13 ` Rahul Lakkireddy
  2016-07-21 17:07   ` Adrien Mazarguil
  4 siblings, 1 reply; 54+ messages in thread
From: Rahul Lakkireddy @ 2016-07-21  8:13 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Wenzhuo Lu, Jan Medala, John Daley, Jing Chen,
	Konstantin Ananyev, Matej Vido, Alejandro Lucero, Sony Chacko,
	Jerin Jacob, Pablo de Lara, Olga Shern
  Cc: Kumar Sanghvi, Nirranjan Kirubaharan, Indranil Choudhury

Hi Adrien,

The proposal looks very good.  It satisfies most of the features
supported by Chelsio NICs.  We are looking for suggestions on exposing
more additional features supported by Chelsio NICs via this API.

Chelsio NICs have two regions in which filters can be placed -
Maskfull and Maskless regions.  As their names imply, maskfull region
can accept masks to match a range of values; whereas, maskless region
don't accept any masks and hence perform a more strict exact-matches.
Filters without masks can also be placed in maskfull region.  By
default, maskless region have higher priority over the maskfull region.
However, the priority between the two regions is configurable.

Please suggest on how we can let the apps configure in which region
filters must be placed and set the corresponding priority accordingly
via this API.

More comments below.

On Tuesday, July 07/05/16, 2016 at 20:16:46 +0200, Adrien Mazarguil wrote:
> Hi All,
> 
[...]

> 
> ``ETH``
> ^^^^^^^
> 
> Matches an Ethernet header.
> 
> - ``dst``: destination MAC.
> - ``src``: source MAC.
> - ``type``: EtherType.
> - ``tags``: number of 802.1Q/ad tags defined.
> - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> 
>  - ``tpid``: Tag protocol identifier.
>  - ``tci``: Tag control information.
> 
> ``IPV4``
> ^^^^^^^^
> 
> Matches an IPv4 header.
> 
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tos``: ToS/DSCP field.
> - ``ttl``: TTL field.
> - ``proto``: protocol number for the next layer.
> 
> ``IPV6``
> ^^^^^^^^
> 
> Matches an IPv6 header.
> 
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tc``: traffic class field.
> - ``nh``: Next header field (protocol).
> - ``hop_limit``: hop limit field (TTL).
> 
> ``ICMP``
> ^^^^^^^^
> 
> Matches an ICMP header.
> 
> - TBD.
> 
> ``UDP``
> ^^^^^^^
> 
> Matches a UDP header.
> 
> - ``sport``: source port.
> - ``dport``: destination port.
> - ``length``: UDP length.
> - ``checksum``: UDP checksum.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``TCP``
> ^^^^^^^
> 
> Matches a TCP header.
> 
> - ``sport``: source port.
> - ``dport``: destination port.
> - All other TCP fields and bits.
> 
> ``VXLAN``
> ^^^^^^^^^
> 
> Matches a VXLAN header.
> 
> - TBD.
> 

In addition to above matches, Chelsio NICs have some additional
features:

- Match based on unicast DST-MAC, multicast DST-MAC, broadcast DST-MAC.
  Also, there is a match criteria available called 'promisc' - which
  matches packets that are not destined for the interface, but had
  been received by the hardware due to interface being in promiscuous
  mode.

- Match FCoE packets.

- Match IP Fragmented packets.

- Match range of physical ports on the NIC in a single rule via masks.
  For ex: match all UDP packets coming on ports 3 and 4 out of 4
  ports available on the NIC.

- Match range of Physical Functions (PFs) on the NIC in a single rule
  via masks. For ex: match all traffic coming on several PFs.

Please suggest on how we can expose the above features to DPDK apps via
this API.

[...]

> 
> Actions
> ~~~~~~~
> 
> Each possible action is represented by a type. Some have associated
> configuration structures. Several actions combined in a list can be affected
> to a flow rule. That list is not ordered.
> 
> At least one action must be defined in a filter rule in order to do
> something with matched packets.
> 
> - Actions are defined with ``struct rte_flow_action``.
> - A list of actions is defined with ``struct rte_flow_actions``.
> 
> They fall in three categories:
> 
> - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
>   processing matched packets by subsequent flow rules, unless overridden
>   with PASSTHRU.
> 
> - Non terminating actions (PASSTHRU, DUP) that leave matched packets up for
>   additional processing by subsequent flow rules.
> 
> - Other non terminating meta actions that do not affect the fate of packets
>   (END, VOID, ID, COUNT).
> 
> When several actions are combined in a flow rule, they should all have
> different types (e.g. dropping a packet twice is not possible). However
> considering the VOID type is an exception to this rule, the defined behavior
> is for PMDs to only take into account the last action of a given type found
> in the list. PMDs still perform error checking on the entire list.
> 
> *Note that PASSTHRU is the only action able to override a terminating rule.*
> 

Chelsio NICs can support an action 'switch' which can re-direct
matched packets from one port to another port in hardware.  In addition,
it can also optionally:

1. Perform header rewrites (src-mac/dst-mac rewrite, src-mac/dst-mac
   swap, vlan add/remove/rewrite).

2. Perform NAT'ing in hardware (4-tuple rewrite).

before sending it out on the wire [1].

To meet the above requirements, we'd need a way to pass sub-actions
to action 'switch' and a way to pass extra info (such as new
src-mac/dst-mac, new vlan, new 4-tuple for NAT) to rewrite
corresponding fields.

We're looking for suggestions on how we can achieve action 'switch'
in this new API.

>From our understanding of this API, we could just expand
rte_flow_action_type with an additional action type
RTE_FLOW_ACTION_TYPE_SWITCH and define several sub-actions such as:

enum rte_flow_action_switch_type {
        RTE_FLOW_ACTION_SWITCH_TYPE_NONE,
	RTE_FLOW_ACTION_SWITCH_TYPE_MAC_REWRITE,
	RTE_FLOW_ACTION_SWITCH_TYPE_MAC_SWAP,
	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_INSERT,
	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_DELETE,
	RTE_FLOW_ACTION_SWITCH_TYPE_VLAN_REWRITE,
	RTE_FLOW_ACTION_SWITCH_TYPE_NAT_REWRITE,
};

and then define an rte_flow_action_switch as follows:

struct rte_flow_action_switch {
	enum rte_flow_action_switch_type type; /* sub actions to perform */
	uint16_t port;	/* Destination physical port to switch packet */
	enum rte_flow_item_type	 type; /* Fields to rewrite */
	const void *switch_spec;
	/* Holds info to rewrite matched flows */
};

Does the above approach sound right with respect to this new API?

[...]

> 
> ``COUNT``
> ^^^^^^^^^
> 
> Enables hits counter for this rule.
> 
> This counter can be retrieved and reset through ``rte_flow_query()``, see
> ``struct rte_flow_query_count``.
> 
> - Counters can be retrieved with ``rte_flow_query()``.
> - No configurable property.
> 
> +---------------+
> | COUNT         |
> +===============+
> | no properties |
> +---------------+
> 
> Query structure to retrieve and reset the flow rule hits counter:
> 
> +------------------------------------------------+
> | COUNT query                                    |
> +===========+=====+==============================+
> | ``reset`` | in  | reset counter after query    |
> +-----------+-----+------------------------------+
> | ``hits``  | out | number of hits for this flow |
> +-----------+-----+------------------------------+
> 

Chelsio NICs can also count the number of bytes that hit the rule.
So, need a counter "bytes".

[...]

[1]  http://www.dpdk.org/ml/archives/dev/2016-February/032605.html

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-20 10:41           ` Adrien Mazarguil
@ 2016-07-21  3:18             ` Lu, Wenzhuo
  2016-07-21 12:47               ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Lu, Wenzhuo @ 2016-07-21  3:18 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Adrien,

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, July 20, 2016 6:41 PM
> To: Lu, Wenzhuo
> Cc: dev@dpdk.org; Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody;
> Ajit Khaparde; Rahul Lakkireddy; Jan Medala; John Daley; Chen, Jing D; Ananyev,
> Konstantin; Matej Vido; Alejandro Lucero; Sony Chacko; Jerin Jacob; De Lara
> Guarch, Pablo; Olga Shern
> Subject: Re: [RFC] Generic flow director/filtering/classification API
> 
> Hi Wenzhuo,
> 
> On Wed, Jul 20, 2016 at 02:16:51AM +0000, Lu, Wenzhuo wrote:
> [...]
> > > So, today an application cannot combine N-tuple and FDIR flow rules
> > > and get a reliable outcome, unless it is designed for specific
> > > devices with a known behavior.
> > >
> > > > What's the right behavior of PMD if APP want to create a flow
> > > > director rule
> > > which has a higher or even equal priority than an existing n-tuple
> > > rule? Should PMD return fail?
> > >
> > > First remember applications only deal with the generic API, PMDs are
> > > responsible for choosing the most appropriate HW implementation to
> > > use according to the requested flow rules (FDIR, N-tuple or anything else).
> > >
> > > For the specific case of FDIR vs N-tuple, if the underlying HW
> > > supports both I do not see why the PMD would create a N-tuple rule.
> > > Doesn't FDIR support everything N-tuple can do and much more?
> > Talking about the filters, fdir can cover n-tuple. I think that's why i40e only
> supports fdir but not n-tuple. But n-tuple has its own highlight. As we know, at
> least on intel NICs, fdir only supports per device mask. But n-tuple can support
> per rule mask.
> > As every pattern has spec and mask both, we cannot guarantee the masks are
> same. I think ixgbe will try to use n-tuple first if can. Because even the masks are
> different, we can support them all.
> 
> OK, makes sense. In that case existing rules may indeed prevent subsequent
> ones from getting created if their priority is wrong. I do not think there is a way
> around that if the application needs this exact ordering.
Agree. I don’t see any workaround either. PMD has to return fail sometimes.

> 
> > > Assuming such a thing happened anyway, that the PMD had to create a
> > > rule using a high priority filter type and that the application
> > > requests the creation of a rule that can only be done using a lower
> > > priority filter type, but also requested a higher priority for that rule, then yes,
> it should obviously fail.
> > >
> > > That is, unless the PMD can perform some kind of workaround to have both.
> > >
> > > > If so, do we need more fail reasons? According to this RFC, I
> > > > think we need
> > > return " EEXIST: collision with an existing rule. ", but it's not
> > > very clear, APP doesn't know the problem is priority, maybe more detailed
> reason is helpful.
> > >
> > > Possibly, I've defined a basic set of errors, there are quite a
> > > number of errno values to choose from. However I think we should not
> define too many values.
> > > In my opinion the basic set covers every possible failure:
> > >
> > > - EINVAL: invalid format, rule is broken or cannot be understood by the PMD
> > >   anyhow.
> > >
> > > - ENOTSUP: pattern/actions look fine but something in the requested rule is
> > >   not supported and thus cannot be applied.
> > >
> > > - EEXIST: pattern/actions are fine and could have been applied if only some
> > >   other rule did not prevent the PMD to do it (I see it as the closest thing
> > >   to "ETOOBAD" which unfortunately does not exist).
> > >
> > > - ENOMEM: like EEXIST, except it is due to the lack of resources not because
> > >   of another rule. I wasn't sure which of ENOMEM or ENOSPC was better but
> > >   settled on ENOMEM as it is well known. Still open to debate.
> > >
> > > Errno values are only useful to get a rough idea of the reason, and
> > > another mechanism is needed to pinpoint the exact problem for
> > > debugging/reporting purposes, something like:
> > >
> > >  enum rte_flow_error_type {
> > >      RTE_FLOW_ERROR_TYPE_NONE,
> > >      RTE_FLOW_ERROR_TYPE_UNKNOWN,
> > >      RTE_FLOW_ERROR_TYPE_PRIORITY,
> > >      RTE_FLOW_ERROR_TYPE_PATTERN,
> > >      RTE_FLOW_ERROR_TYPE_ACTION,
> > >  };
> > >
> > >  struct rte_flow_error {
> > >      enum rte_flow_error_type type;
> > >      void *offset; /* Points to the exact pattern item or action. */
> > >      const char *message;
> > >  };
> > When we are using a CLI and it fails, normally it will let us know
> > which parameter is not appropriate. So, I think it’s a good idea to
> > have this error structure :)
> 
> Agreed.
> 
> > > Then either provide an optional struct rte_flow_error pointer to
> > > rte_flow_validate(), or a separate function (rte_flow_analyze()?),
> > > since processing this may be quite expensive and applications may
> > > not care about the exact reason.
> > Agree the processing may be too expensive. Maybe we can say it's optional to
> return error details. And that's a good question that what APP should do if
> creating the rule fails. I believe normally it will choose handle the rule by itself.
> But I think it's not bad to feedback more. Or even the APP want to adjust the
> rules, it cannot be an option for lack of info.
> 
> All right then, I'll add it to the specification.
> 
>  int
>  rte_flow_validate(uint8_t port_id,
>                    const struct rte_flow_pattern *pattern,
>                    const struct rte_flow_actions *actions,
>                    struct rte_flow_error *error);
> 
> With error possibly NULL if the application does not care. Is it fine for you?
Yes, it looks good to me. Thanks for that :)

> 
> [...]
> > > > > > > - PMDs, not applications, are responsible for maintaining flow rules
> > > > > > >   configuration when stopping and restarting a port or performing
> other
> > > > > > >   actions which may affect them. They can only be destroyed explicitly.
> > > > > > Don’t understand " They can only be destroyed explicitly."
> > > > >
> > > > > This part says that as long as an application has not called
> > > > > rte_flow_destroy() on a flow rule, it never disappears, whatever
> > > > > happens to the port (stopped, restarted). The application is not
> > > > > responsible for re-creating rules after that.
> > > > >
> > > > > Note that according to the specification, this may translate to
> > > > > not being able to stop a port as long as a flow rule is present,
> > > > > depending on how nice the PMD intends to be with applications.
> > > > > Implementation can be done in small steps with minimal amount of
> > > > > code on
> > > the PMD side.
> > > > Does it mean PMD should store and maintain all the rules? Why not
> > > > let rte do
> > > that? I think if PMD maintain all the rules, it means every kind of
> > > NIC should have a copy of code for the rules. But if rte do that,
> > > only one copy of code need to be maintained, right?
> > >
> > > I've considered having rules stored in a common format understood at
> > > the RTE level and not specific to each PMD and decided that the
> > > opaque rte_flow pointer was a better choice for the following reasons:
> > >
> > > - Even though flow rules management is done in the control path, processing
> > >   must be as fast as possible. Letting PMDs store flow rules using their own
> > >   internal representation gives them the chance to achieve better
> > >   performance.
> > Not quite understand. I think we're talking about maintain the rules by SW. I
> don’t think there's something need to be optimized according to specific NICs. If
> we need to optimize the code, I think we need to consider the CPU, OS ... and
> some common means. I'm wrong?
> 
> Perhaps we were talking about different things, here I was only explaining why
> rte_flow (the result of creating a flow rule) should be opaque and fully managed
> by the PMD. More on the SW side of things below.
> 
> > > - An opaque context managed by PMDs would probably have to be stored
> > >   somewhere as well anyway.
> > >
> > > - PMDs may not need to allocate/store anything at all if they exclusively
> > >   rely on HW state for everything. In my opinion, the generic API has enough
> > >   constraints for this to work and maintain consistency between flow
> > >   rules. Note this is currently how most PMDs implement FDIR and other
> > >   filter types.
> > Yes, the rules are stored by HW. But considering stop/start the device, the
> rules in HW will lose. we have to store the rules by SW and re-program them
> when restarting the device.
> 
> Assume a HW capable of keeping flow rules programmed even during a
> stop/start cycle (e.g. mlx4/mlx5 may be able to do it from DPDK point of view),
> don't you think it is more efficient to standardize on this behavior and let PMDs
> restore flow rules for HW that do not support it regardless of whether it would
> be done by RTE or the application (SW)?
Didn’t know that. As some NICs have already had the ability to keep the rules during a stop/start cycle, maybe it could be a trend :)

> 
> > And in existing code, we store the filters by SW at least on Intel NICs. But I
> think we cannot reuse them, because considering the priority and which
> category of filter should be chosen, I think we need a whole new table for
> generic API. I think it’s what's designed now, right?
> 
> So I understand you'd want RTE to help your PMD keep track of the flow rules it
> created?
Yes. But as you said before, it’s not a good idea for mlx4/mlx5, because their HW doesn't need SW to re-program the rules after stopping/starting. If we make it a common mechanism, it just wastes time for mlx4/mlx5.

> 
> Nothing wrong with that, all I'm saying is that it should be entirely optional. RTE
> should not automatically maintain a list. PMDs have to call RTE helpers if they
> need help to maintain a context. These helpers are not defined in this API yet
> because it is difficult to know what will be useful in advance.
> 
> > > - RTE can (and will) provide helpers to avoid most of the code redundancy,
> > >   PMDs are free to use them or manage everything by themselves.
> > >
> > > - Given that the opaque rte_flow pointer associated with a flow rule is to
> > >   be stored by the application, PMDs do not even have to keep references to
> > >   them.
> > Don’t understand. More details?
> 
> In an application:
> 
>  rte_flow *foo = rte_flow_create(...);
> 
> In the above example, foo cannot be dereferenced by the application nor RTE,
> only the PMD is aware of its contents. This object can only be used with
> rte_flow*() functions.
> 
> PMDs are thus free to make this object grow as needed when adding internal
> features without breaking any kind of public API/ABI.
> 
> What I meant is, given that the application is supposed to store foo somewhere
> in order to destroy it later, the PMD does not have to keep track of that pointer
> assuming it does not need to access it later on its own for some reason.
> 
> > > - The flow rules format described in this specification (pattern / actions)
> > >   will be used by applications directly, and will be free to arrange them in
> > >   lists, trees or in any other way if they need to keep flow specifications
> > >   around for further processing.
> > Who will create the lists, trees or something else? According to previous
> discussion, I think the APP will program the rules one by one. So if APP organize
> the rules to lists, trees..., PMD doesn’t know that.
> > And you said " Given that the opaque rte_flow pointer associated with a flow
> rule is to be stored by the application ". I'm lost here.
> 
> I guess that's because we're discussing two different things, flow rule
> specifications and flow rule objects. Let me sum it up:
> 
> - Flow rule specifications are the patterns/actions combinations provided by
>   applications to rte_flow_create(). Applications can store those as needed
>   and organize them as they wish (hash, tree, list). Neither PMDs nor RTE
>   will do it for them.
> 
> - Flow rule objects (struct rte_flow *) are generated when a flow rule is
>   created. Applications must keep these around if they want to manipulate
>   them later (i.e. destroy or query existing rules).
Thanks for this clarification. So the specifications can be different with objects, right? The specifications are what the APP wants, the objects are what the APP really gets. As rte_flow_create can fail. Right?

> 
> Then PMDs *may* need to keep and arrange flow rule objects internally for
> management purposes. Could be because HW requires it, detecting conflicting
> rules, managing priorities and so on. Possible reasons are not described in this
> API because these are thought as PMD-specific needs.
Got it.

> 
> > > > When the port is stopped and restarted, rte can reconfigure the
> > > > rules. Is the
> > > concern that PMD may adjust the sequence of the rules according to
> > > the priority, so every NIC has a different list of rules? But PMD
> > > can adjust them again when rte reconfiguring the rules.
> > >
> > > What about PMDs able to stop and restart ports without destroying
> > > their own flow rules? If we assume flow rules must be destroyed when
> > > stopping a port, these PMDs are needlessly penalized with slower
> > > stop/start cycles. Think about it assuming thousands of flow rules.
> > I believe the rules maintained by SW should not be destroyed, because they're
> used to be re-programed when the device starts again.
> 
> Do we agree that applications should not care? Flow rules configured before
> stopping a port must still be there after restarting it.
Yes, agree.

> 
> What we seem to not agree about is that you think RTE should be responsible
> for restoring flow rules of devices that lose them when stopped. I think doing so
> is unfair to devices for which it is not the case and not really nice to applications,
> so my opinion is that the PMD is responsible for restoring flow rules however it
> wants. It is free to use RTE helpers to keep their track, as long as it's all managed
> internally.
What I think is RTE can store the flow rules and recreate them after restarting, in the function like rte_dev_start, so APP knows nothing about it. But according to the discussing above, I think the design doesn't support it, right?
RTE doesn't store the flow rules objects and event it stores them, there's no way designed to re-program the objects. And also considering some HW doesn't need to be re-programed. I think it's OK  to let PMD maintain the rules as the re-programing is a NIC specific requirement.

> 
> > > Thus from an application point of view, whatever happens when
> > > stopping and restarting a port should not matter. If a flow rule was
> > > present before, it must still be present afterwards. If the PMD had
> > > to destroy flow rules and re-create them, it does not actually matter if they
> differ slightly at the HW level, as long as:
> > >
> > > - Existing opaque flow rule pointers (rte_flow) are still valid to the PMD
> > >   and refer to the same rules.
> > >
> > > - The overall behavior of all rules is the same.
> > >
> > > The list of rules you think of (patterns / actions) is maintained by
> > > applications (not RTE), and only if they need them. RTE would needlessly
> duplicate this.
> > As said before, need more details to understand this. Maybe an example
> > is better :)
> 
> The generic format both RTE and applications might understand is the one
> described in this API (struct rte_flow_pattern and struct rte_flow_actions).
> 
> If we wanted RTE to maintain some sort of per-port state for flow rule
> specifications, it would have to be a copy of these structures arranged somehow
> (list or something else).
> 
> If we consider that PMDs need to keep a context object associated to a flow
> rule (the opaque struct rte_flow *), then RTE would most likely have to store it
> along with the flow specification.
> 
> Such a list may not be useful to applications (list lookups take time), so they
> would implement their own redundant method. They might also require extra
> room to attach some application context to flow rules. A generic list cannot plan
> for it.
> 
> Applications know what they want to do with flow rules and are responsible for
> managing them efficiently with RTE out of the way.
> 
> I'm not sure if this answered your question, if not, please describe a scenario
> where a RTE-managed list of flow rules would be mandatory.
Got your point and agree :)

> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-20 16:32                 ` Chandran, Sugesh
@ 2016-07-20 17:10                   ` Adrien Mazarguil
  2016-07-21 11:06                     ` Chandran, Sugesh
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-20 17:10 UTC (permalink / raw)
  To: Chandran, Sugesh
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey

Hi Sugesh,

Please see below.

On Wed, Jul 20, 2016 at 04:32:50PM +0000, Chandran, Sugesh wrote:
[...]
> > > How about a hardware flow flag in packet descriptor that set when the
> > > packets hits any hardware rule. This way software doesn’t worry
> > > /blocked by a hardware rule . Even though there is an additional
> > > overhead of validating this flag, software datapath can identify the
> > hardware processed packets easily.
> > > This way the packets traverses the software fallback path until the
> > > rule configuration is complete. This flag avoids setting ID action for every
> > hardware flow that are configuring.
> > 
> > That makes sense. I see it as a sort of single bit ID but it could be
> > implemented through a different action for less capable devices. PMDs that
> > support 32 bit IDs could reuse the same code for both features.
> > 
> > I understand you'd prefer having this feature always present, however we
> > already know that not all PMDs/devices support it, and like everything else
> > this is a kind of offload that needs to be explicitly requested by the
> > application as it may not be needed.
> > 
> > If we go with the separate action, then perhaps it would make sense to
> > rename "ID" to "MARK" to make things clearer:
> > 
> >  RTE_FLOW_ACTION_TYPE_FLAG /* Flag packets processed by flow rule. */
> > 
> >  RTE_FLOW_ACTION_TYPE_MARK /* Attach a 32 bit value to a packet. */
> > 
> > I guess the result of the FLAG action would be something in ol_flag.
> > 
> [Sugesh] This looks fine for me.

Great, I will update the specification accordingly.

> > Thoughts?
> > 
> [Sugesh] Two more queries that I missed out in the earlier comments are,
> Support for PTYPE :- Intel NICs can report packet type in mbuf.
> This can be used by software for the packet processing. Is generic API
> capable of handling that as well? 

Yes, however no PTYPE action has been defined for this (yet). It is only a
matter of adding one.

Currently packet type recognition is enabled per port using a separate API,
so correct me if I'm wrong but I am not aware of any adapter with the
ability to enable it per flow rule, so I do not think such an action needs
to be defined from the start. We may add it later.

> RSS hashing support :- Just to confirm, the RSS flow action allows application
> to decide the header fields to produce the hash. This gives
> programmability on load sharing across different queues. The
> application can program the NIC to calculate the RSS hash only using mac or mac+ ip or 
> ip only using this.

I'd say yes but from your summary, I'm not sure we share the same idea of
what the RSS action is supposed to do, so here is mine.

Like all flow rules, the pattern part of the RSS action only filters the
packets on which the action will be performed.

The rss_conf parameter (struct rte_eth_rss_conf) only provides a key and a
RSS hash function to use (ETH_RSS_IPV4, ETH_RSS_NONFRAG_IPV6_UDP, etc).

Nothing prevents the RSS hash function from being applied to protocol
headers which are not necessarily present in the flow rule pattern. These
are two independent things, e.g. you could have a pattern matching IPv4
packets yet perform RSS hashing only on UDP headers.

Finally, the RSS action configuration only affects packets coming from this
flow rule. It is not performed on the device globally so packets which are
not matched are not affected by RSS processing. As a result it might not be
possible to configure two flow rules specifying incompatible RSS actions
simultaneously if the underlying device supports only a single global RSS
context.

Are we on the same page?

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-18 15:00               ` Adrien Mazarguil
@ 2016-07-20 16:32                 ` Chandran, Sugesh
  2016-07-20 17:10                   ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-20 16:32 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey

Hi Adrien,

Sorry for the late reply.
Snipped out the parts we agreed.

Regards
_Sugesh


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Monday, July 18, 2016 4:00 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>;
> Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Rasesh Mody <rasesh.mody@qlogic.com>; Ajit
> Khaparde <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>;
> Chilikin, Andrey <andrey.chilikin@intel.com>
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> On Mon, Jul 18, 2016 at 01:26:09PM +0000, Chandran, Sugesh wrote:
> > Hi Adrien,
> > Thank you for getting back on this.
> > Please find my comments below.
> 
> Hi Sugesh,
> 
> Same for me, removed again the parts we agree on.
> 
> [...]
> > > For your above example, the application cannot assume a rule is
> > > added/deleted as long as the PMD has not completed the related
> > > operation, which means keeping the SW rule/fallback in place in the
> > > meantime. Should handle security concerns as long as after removing
> > > a rule, packets end up in a default queue entirely processed by SW.
> > > Obviously this may worsen response time.
> > >
> > > The ID action can help with this. By knowing which rule a received
> > > packet is associated with, processing can be temporarily offloaded
> > > by another thread without much complexity.
> > [Sugesh] Setting ID for every flow may not viable especially when the
> > size of ID is small(just only 8 bits). I am not sure is this a valid case though.
> 
> Agreed, I'm not saying this solution works for all devices, particularly those
> that do not support ID at all.
> 
> > How about a hardware flow flag in packet descriptor that set when the
> > packets hits any hardware rule. This way software doesn’t worry
> > /blocked by a hardware rule . Even though there is an additional
> > overhead of validating this flag, software datapath can identify the
> hardware processed packets easily.
> > This way the packets traverses the software fallback path until the
> > rule configuration is complete. This flag avoids setting ID action for every
> hardware flow that are configuring.
> 
> That makes sense. I see it as a sort of single bit ID but it could be
> implemented through a different action for less capable devices. PMDs that
> support 32 bit IDs could reuse the same code for both features.
> 
> I understand you'd prefer having this feature always present, however we
> already know that not all PMDs/devices support it, and like everything else
> this is a kind of offload that needs to be explicitly requested by the
> application as it may not be needed.
> 
> If we go with the separate action, then perhaps it would make sense to
> rename "ID" to "MARK" to make things clearer:
> 
>  RTE_FLOW_ACTION_TYPE_FLAG /* Flag packets processed by flow rule. */
> 
>  RTE_FLOW_ACTION_TYPE_MARK /* Attach a 32 bit value to a packet. */
> 
> I guess the result of the FLAG action would be something in ol_flag.
> 
[Sugesh] This looks fine for me.
> Thoughts?
> 
[Sugesh] Two more queries that I missed out in the earlier comments are,
Support for PTYPE :- Intel NICs can report packet type in mbuf.
This can be used by software for the packet processing. Is generic API
capable of handling that as well? 
RSS hashing support :- Just to confirm, the RSS flow action allows application
to decide the header fields to produce the hash. This gives
programmability on load sharing across different queues. The
application can program the NIC to calculate the RSS hash only using mac or mac+ ip or 
ip only using this.


> > > I think applications have to implement SW fallbacks all the time, as
> > > even some sort of guarantee on the flow rule processing time may not
> > > be enough to avoid misdirected packets and related security issues.
> > [Sugesh] Software fallback will be there always. However I am little
> > bit confused on the way software going to identify the packets that
> > are already hardware processed . I feel we need some notification in the
> packet itself, when a hardware rule hits. ID/flag/any other options?
> 
> Yeah I think so too, as long as it is optional because we cannot assume all
> PMDs will support it.
> 
> > > Let's wait for applications to start using this API and then
> > > consider an extra set of asynchronous / real-time functions when the
> > > need arises. It should not impact the way rules are specified
> > [Sugesh] Sure. I think the rule definition may not impact with this.
> 
> Thanks for your comments.
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-20  2:16         ` Lu, Wenzhuo
@ 2016-07-20 10:41           ` Adrien Mazarguil
  2016-07-21  3:18             ` Lu, Wenzhuo
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-20 10:41 UTC (permalink / raw)
  To: Lu, Wenzhuo
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Wenzhuo,

On Wed, Jul 20, 2016 at 02:16:51AM +0000, Lu, Wenzhuo wrote:
[...]
> > So, today an application cannot combine N-tuple and FDIR flow rules and get a
> > reliable outcome, unless it is designed for specific devices with a known
> > behavior.
> > 
> > > What's the right behavior of PMD if APP want to create a flow director rule
> > which has a higher or even equal priority than an existing n-tuple rule? Should
> > PMD return fail?
> > 
> > First remember applications only deal with the generic API, PMDs are
> > responsible for choosing the most appropriate HW implementation to use
> > according to the requested flow rules (FDIR, N-tuple or anything else).
> > 
> > For the specific case of FDIR vs N-tuple, if the underlying HW supports both I do
> > not see why the PMD would create a N-tuple rule. Doesn't FDIR support
> > everything N-tuple can do and much more?
> Talking about the filters, fdir can cover n-tuple. I think that's why i40e only supports fdir but not n-tuple. But n-tuple has its own highlight. As we know, at least on intel NICs, fdir only supports per device mask. But n-tuple can support per rule mask.
> As every pattern has spec and mask both, we cannot guarantee the masks are same. I think ixgbe will try to use n-tuple first if can. Because even the masks are different, we can support them all.

OK, makes sense. In that case existing rules may indeed prevent subsequent
ones from getting created if their priority is wrong. I do not think there
is a way around that if the application needs this exact ordering.

> > Assuming such a thing happened anyway, that the PMD had to create a rule
> > using a high priority filter type and that the application requests the creation of a
> > rule that can only be done using a lower priority filter type, but also requested a
> > higher priority for that rule, then yes, it should obviously fail.
> > 
> > That is, unless the PMD can perform some kind of workaround to have both.
> > 
> > > If so, do we need more fail reasons? According to this RFC, I think we need
> > return " EEXIST: collision with an existing rule. ", but it's not very clear, APP
> > doesn't know the problem is priority, maybe more detailed reason is helpful.
> > 
> > Possibly, I've defined a basic set of errors, there are quite a number of errno
> > values to choose from. However I think we should not define too many values.
> > In my opinion the basic set covers every possible failure:
> > 
> > - EINVAL: invalid format, rule is broken or cannot be understood by the PMD
> >   anyhow.
> > 
> > - ENOTSUP: pattern/actions look fine but something in the requested rule is
> >   not supported and thus cannot be applied.
> > 
> > - EEXIST: pattern/actions are fine and could have been applied if only some
> >   other rule did not prevent the PMD to do it (I see it as the closest thing
> >   to "ETOOBAD" which unfortunately does not exist).
> > 
> > - ENOMEM: like EEXIST, except it is due to the lack of resources not because
> >   of another rule. I wasn't sure which of ENOMEM or ENOSPC was better but
> >   settled on ENOMEM as it is well known. Still open to debate.
> > 
> > Errno values are only useful to get a rough idea of the reason, and another
> > mechanism is needed to pinpoint the exact problem for debugging/reporting
> > purposes, something like:
> > 
> >  enum rte_flow_error_type {
> >      RTE_FLOW_ERROR_TYPE_NONE,
> >      RTE_FLOW_ERROR_TYPE_UNKNOWN,
> >      RTE_FLOW_ERROR_TYPE_PRIORITY,
> >      RTE_FLOW_ERROR_TYPE_PATTERN,
> >      RTE_FLOW_ERROR_TYPE_ACTION,
> >  };
> > 
> >  struct rte_flow_error {
> >      enum rte_flow_error_type type;
> >      void *offset; /* Points to the exact pattern item or action. */
> >      const char *message;
> >  };
> When we are using a CLI and it fails, normally it will let us know which parameter is not appropriate. So, I think it’s a good idea to have this error structure :)

Agreed.

> > Then either provide an optional struct rte_flow_error pointer to
> > rte_flow_validate(), or a separate function (rte_flow_analyze()?), since
> > processing this may be quite expensive and applications may not care about the
> > exact reason.
> Agree the processing may be too expensive. Maybe we can say it's optional to return error details. And that's a good question that what APP should do if creating the rule fails. I believe normally it will choose handle the rule by itself. But I think it's not bad to feedback more. Or even the APP want to adjust the rules, it cannot be an option for lack of info.

All right then, I'll add it to the specification.

 int
 rte_flow_validate(uint8_t port_id,
                   const struct rte_flow_pattern *pattern,
                   const struct rte_flow_actions *actions,
                   struct rte_flow_error *error);

With error possibly NULL if the application does not care. Is it fine for
you?

[...]
> > > > > > - PMDs, not applications, are responsible for maintaining flow rules
> > > > > >   configuration when stopping and restarting a port or performing other
> > > > > >   actions which may affect them. They can only be destroyed explicitly.
> > > > > Don’t understand " They can only be destroyed explicitly."
> > > >
> > > > This part says that as long as an application has not called
> > > > rte_flow_destroy() on a flow rule, it never disappears, whatever
> > > > happens to the port (stopped, restarted). The application is not
> > > > responsible for re-creating rules after that.
> > > >
> > > > Note that according to the specification, this may translate to not
> > > > being able to stop a port as long as a flow rule is present,
> > > > depending on how nice the PMD intends to be with applications.
> > > > Implementation can be done in small steps with minimal amount of code on
> > the PMD side.
> > > Does it mean PMD should store and maintain all the rules? Why not let rte do
> > that? I think if PMD maintain all the rules, it means every kind of NIC should have
> > a copy of code for the rules. But if rte do that, only one copy of code need to be
> > maintained, right?
> > 
> > I've considered having rules stored in a common format understood at the RTE
> > level and not specific to each PMD and decided that the opaque rte_flow pointer
> > was a better choice for the following reasons:
> > 
> > - Even though flow rules management is done in the control path, processing
> >   must be as fast as possible. Letting PMDs store flow rules using their own
> >   internal representation gives them the chance to achieve better
> >   performance.
> Not quite understand. I think we're talking about maintain the rules by SW. I don’t think there's something need to be optimized according to specific NICs. If we need to optimize the code, I think we need to consider the CPU, OS ... and some common means. I'm wrong?

Perhaps we were talking about different things, here I was only explaining
why rte_flow (the result of creating a flow rule) should be opaque and fully
managed by the PMD. More on the SW side of things below.

> > - An opaque context managed by PMDs would probably have to be stored
> >   somewhere as well anyway.
> > 
> > - PMDs may not need to allocate/store anything at all if they exclusively
> >   rely on HW state for everything. In my opinion, the generic API has enough
> >   constraints for this to work and maintain consistency between flow
> >   rules. Note this is currently how most PMDs implement FDIR and other
> >   filter types.
> Yes, the rules are stored by HW. But considering stop/start the device, the rules in HW will lose. we have to store the rules by SW and re-program them when restarting the device.

Assume a HW capable of keeping flow rules programmed even during a
stop/start cycle (e.g. mlx4/mlx5 may be able to do it from DPDK point of
view), don't you think it is more efficient to standardize on this behavior
and let PMDs restore flow rules for HW that do not support it regardless of
whether it would be done by RTE or the application (SW)?

> And in existing code, we store the filters by SW at least on Intel NICs. But I think we cannot reuse them, because considering the priority and which category of filter should be chosen, I think we need a whole new table for generic API. I think it’s what's designed now, right?

So I understand you'd want RTE to help your PMD keep track of the flow rules
it created?

Nothing wrong with that, all I'm saying is that it should be entirely
optional. RTE should not automatically maintain a list. PMDs have to call
RTE helpers if they need help to maintain a context. These helpers are not
defined in this API yet because it is difficult to know what will be useful
in advance.

> > - RTE can (and will) provide helpers to avoid most of the code redundancy,
> >   PMDs are free to use them or manage everything by themselves.
> > 
> > - Given that the opaque rte_flow pointer associated with a flow rule is to
> >   be stored by the application, PMDs do not even have to keep references to
> >   them.
> Don’t understand. More details?

In an application:

 rte_flow *foo = rte_flow_create(...);

In the above example, foo cannot be dereferenced by the application nor RTE,
only the PMD is aware of its contents. This object can only be used with
rte_flow*() functions.

PMDs are thus free to make this object grow as needed when adding internal
features without breaking any kind of public API/ABI.

What I meant is, given that the application is supposed to store foo
somewhere in order to destroy it later, the PMD does not have to keep track
of that pointer assuming it does not need to access it later on its own for
some reason.

> > - The flow rules format described in this specification (pattern / actions)
> >   will be used by applications directly, and will be free to arrange them in
> >   lists, trees or in any other way if they need to keep flow specifications
> >   around for further processing.
> Who will create the lists, trees or something else? According to previous discussion, I think the APP will program the rules one by one. So if APP organize the rules to lists, trees..., PMD doesn’t know that. 
> And you said " Given that the opaque rte_flow pointer associated with a flow rule is to be stored by the application ". I'm lost here.

I guess that's because we're discussing two different things, flow rule
specifications and flow rule objects. Let me sum it up:

- Flow rule specifications are the patterns/actions combinations provided by
  applications to rte_flow_create(). Applications can store those as needed
  and organize them as they wish (hash, tree, list). Neither PMDs nor RTE
  will do it for them.

- Flow rule objects (struct rte_flow *) are generated when a flow rule is
  created. Applications must keep these around if they want to manipulate
  them later (i.e. destroy or query existing rules).

Then PMDs *may* need to keep and arrange flow rule objects internally for
management purposes. Could be because HW requires it, detecting conflicting
rules, managing priorities and so on. Possible reasons are not described in
this API because these are thought as PMD-specific needs.

> > > When the port is stopped and restarted, rte can reconfigure the rules. Is the
> > concern that PMD may adjust the sequence of the rules according to the priority,
> > so every NIC has a different list of rules? But PMD can adjust them again when
> > rte reconfiguring the rules.
> > 
> > What about PMDs able to stop and restart ports without destroying their own
> > flow rules? If we assume flow rules must be destroyed when stopping a port,
> > these PMDs are needlessly penalized with slower stop/start cycles. Think about
> > it assuming thousands of flow rules.
> I believe the rules maintained by SW should not be destroyed, because they're used to be re-programed when the device starts again.

Do we agree that applications should not care? Flow rules configured before
stopping a port must still be there after restarting it.

What we seem to not agree about is that you think RTE should be responsible
for restoring flow rules of devices that lose them when stopped. I think
doing so is unfair to devices for which it is not the case and not really
nice to applications, so my opinion is that the PMD is responsible for
restoring flow rules however it wants. It is free to use RTE helpers to keep
their track, as long as it's all managed internally.

> > Thus from an application point of view, whatever happens when stopping and
> > restarting a port should not matter. If a flow rule was present before, it must
> > still be present afterwards. If the PMD had to destroy flow rules and re-create
> > them, it does not actually matter if they differ slightly at the HW level, as long as:
> > 
> > - Existing opaque flow rule pointers (rte_flow) are still valid to the PMD
> >   and refer to the same rules.
> > 
> > - The overall behavior of all rules is the same.
> > 
> > The list of rules you think of (patterns / actions) is maintained by applications
> > (not RTE), and only if they need them. RTE would needlessly duplicate this.
> As said before, need more details to understand this. Maybe an example is better :)

The generic format both RTE and applications might understand is the one
described in this API (struct rte_flow_pattern and struct
rte_flow_actions).

If we wanted RTE to maintain some sort of per-port state for flow rule
specifications, it would have to be a copy of these structures arranged
somehow (list or something else).

If we consider that PMDs need to keep a context object associated to a flow
rule (the opaque struct rte_flow *), then RTE would most likely have to
store it along with the flow specification.

Such a list may not be useful to applications (list lookups take time), so
they would implement their own redundant method. They might also require
extra room to attach some application context to flow rules. A generic list
cannot plan for it.

Applications know what they want to do with flow rules and are responsible
for managing them efficiently with RTE out of the way.

I'm not sure if this answered your question, if not, please describe a
scenario where a RTE-managed list of flow rules would be mandatory.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-19 13:12       ` Adrien Mazarguil
@ 2016-07-20  2:16         ` Lu, Wenzhuo
  2016-07-20 10:41           ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Lu, Wenzhuo @ 2016-07-20  2:16 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Adrien,


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Tuesday, July 19, 2016 9:12 PM
> To: Lu, Wenzhuo
> Cc: dev@dpdk.org; Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody;
> Ajit Khaparde; Rahul Lakkireddy; Jan Medala; John Daley; Chen, Jing D; Ananyev,
> Konstantin; Matej Vido; Alejandro Lucero; Sony Chacko; Jerin Jacob; De Lara
> Guarch, Pablo; Olga Shern
> Subject: Re: [RFC] Generic flow director/filtering/classification API
> 
> On Tue, Jul 19, 2016 at 08:11:48AM +0000, Lu, Wenzhuo wrote:
> > Hi Adrien,
> > Thanks for your clarification.  Most of my questions are clear, but still
> something may need to be discussed, comment below.
> 
> Hi Wenzhuo,
> 
> Please see below.
> 
> [...]
> > > > > Requirements for a new API:
> > > > >
> > > > > - Flexible and extensible without causing API/ABI problems for existing
> > > > >   applications.
> > > > > - Should be unambiguous and easy to use.
> > > > > - Support existing filtering features and actions listed in `Filter types`_.
> > > > > - Support packet alteration.
> > > > > - In case of overlapping filters, their priority should be well documented.
> > > > Does that mean we don't guarantee the consistent of priority? The
> > > > priority can
> > > be different on different NICs. So the behavior of the actions  can be
> different.
> > > Right?
> > >
> > > No, the intent is precisely to define what happens in order to get a
> > > consistent result across different devices, and document cases with
> undefined behavior.
> > > There must be no room left for interpretation.
> > >
> > > For example, the API must describe what happens when two overlapping
> > > filters (e.g. one matching an Ethernet header, another one matching
> > > an IP header) match a given packet at a given priority level.
> > >
> > > It is documented in section 4.1.1 (priorities) as "undefined behavior".
> > > Applications remain free to do it and deal with consequences, at
> > > least they know they cannot expect a consistent outcome, unless they
> > > use different priority levels for both rules, see also 4.4.5 (flow rules priority).
> > >
> > > > Seems the users still need to aware the some details of the HW? Do
> > > > we need
> > > to add the negotiation for the priority?
> > >
> > > Priorities as defined in this document may not be directly mappable
> > > to HW capabilities (e.g. HW does not support enough priorities, or
> > > that some corner case make them not work as described), in which
> > > case the PMD may choose to simulate priorities (again 4.4.5), as
> > > long as the end result follows the specification.
> > >
> > > So users must not be aware of some HW details, the PMD does and must
> > > perform the needed workarounds to suit their expectations. Users may
> > > only be impacted by errors while attempting to create rules that are
> > > either unsupported or would cause them (or existing rules) to diverge from
> the spec.
> > The problem is sometime the priority of the filters is fixed according
> > to
> > > HW's implementation. For example, on ixgbe, n-tuple has a higher
> > > priority than flow director.
> 
> As a side note I did not know that N-tuple had a higher priority than flow
> director on ixgbe, priorities among filter types do not seem to be documented at
> all in DPDK. This is one of the reasons I think we need a generic API to handle
> flow configuration.
Totally agree with you. We haven't documented the info well enough. And even we do that, users have to study the details of every NIC, it can still make the filters very hard to use. I believe a generic API is very helpful here :)

> 
> 
> So, today an application cannot combine N-tuple and FDIR flow rules and get a
> reliable outcome, unless it is designed for specific devices with a known
> behavior.
> 
> > What's the right behavior of PMD if APP want to create a flow director rule
> which has a higher or even equal priority than an existing n-tuple rule? Should
> PMD return fail?
> 
> First remember applications only deal with the generic API, PMDs are
> responsible for choosing the most appropriate HW implementation to use
> according to the requested flow rules (FDIR, N-tuple or anything else).
> 
> For the specific case of FDIR vs N-tuple, if the underlying HW supports both I do
> not see why the PMD would create a N-tuple rule. Doesn't FDIR support
> everything N-tuple can do and much more?
Talking about the filters, fdir can cover n-tuple. I think that's why i40e only supports fdir but not n-tuple. But n-tuple has its own highlight. As we know, at least on intel NICs, fdir only supports per device mask. But n-tuple can support per rule mask.
As every pattern has spec and mask both, we cannot guarantee the masks are same. I think ixgbe will try to use n-tuple first if can. Because even the masks are different, we can support them all.

> 
> Assuming such a thing happened anyway, that the PMD had to create a rule
> using a high priority filter type and that the application requests the creation of a
> rule that can only be done using a lower priority filter type, but also requested a
> higher priority for that rule, then yes, it should obviously fail.
> 
> That is, unless the PMD can perform some kind of workaround to have both.
> 
> > If so, do we need more fail reasons? According to this RFC, I think we need
> return " EEXIST: collision with an existing rule. ", but it's not very clear, APP
> doesn't know the problem is priority, maybe more detailed reason is helpful.
> 
> Possibly, I've defined a basic set of errors, there are quite a number of errno
> values to choose from. However I think we should not define too many values.
> In my opinion the basic set covers every possible failure:
> 
> - EINVAL: invalid format, rule is broken or cannot be understood by the PMD
>   anyhow.
> 
> - ENOTSUP: pattern/actions look fine but something in the requested rule is
>   not supported and thus cannot be applied.
> 
> - EEXIST: pattern/actions are fine and could have been applied if only some
>   other rule did not prevent the PMD to do it (I see it as the closest thing
>   to "ETOOBAD" which unfortunately does not exist).
> 
> - ENOMEM: like EEXIST, except it is due to the lack of resources not because
>   of another rule. I wasn't sure which of ENOMEM or ENOSPC was better but
>   settled on ENOMEM as it is well known. Still open to debate.
> 
> Errno values are only useful to get a rough idea of the reason, and another
> mechanism is needed to pinpoint the exact problem for debugging/reporting
> purposes, something like:
> 
>  enum rte_flow_error_type {
>      RTE_FLOW_ERROR_TYPE_NONE,
>      RTE_FLOW_ERROR_TYPE_UNKNOWN,
>      RTE_FLOW_ERROR_TYPE_PRIORITY,
>      RTE_FLOW_ERROR_TYPE_PATTERN,
>      RTE_FLOW_ERROR_TYPE_ACTION,
>  };
> 
>  struct rte_flow_error {
>      enum rte_flow_error_type type;
>      void *offset; /* Points to the exact pattern item or action. */
>      const char *message;
>  };
When we are using a CLI and it fails, normally it will let us know which parameter is not appropriate. So, I think it’s a good idea to have this error structure :)

> 
> Then either provide an optional struct rte_flow_error pointer to
> rte_flow_validate(), or a separate function (rte_flow_analyze()?), since
> processing this may be quite expensive and applications may not care about the
> exact reason.
Agree the processing may be too expensive. Maybe we can say it's optional to return error details. And that's a good question that what APP should do if creating the rule fails. I believe normally it will choose handle the rule by itself. But I think it's not bad to feedback more. Or even the APP want to adjust the rules, it cannot be an option for lack of info.

> 
> What do you suggest?
> 
> > > > > Behavior
> > > > > --------
> > > > >
> > > > > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> > > > >   returned).
> > > > >
> > > > > - There is no provision for reentrancy/multi-thread safety, although
> nothing
> > > > >   should prevent different devices from being configured at the same
> > > > >   time. PMDs may protect their control path functions accordingly.
> > > > >
> > > > > - Stopping the data path (TX/RX) should not be necessary when
> > > > > managing
> > > flow
> > > > >   rules. If this cannot be achieved naturally or with workarounds (such as
> > > > >   temporarily replacing the burst function pointers), an appropriate error
> > > > >   code must be returned (``EBUSY``).
> > > > PMD cannot stop the data path without adding lock. So I think if
> > > > some rules
> > > cannot be applied without stopping rx/tx, PMD has to return fail.
> > > > Or let the APP to stop the data path.
> > >
> > > Agreed, that is the intent. If the PMD cannot touch flow rules for
> > > some reason even after trying really hard, then it just returns EBUSY.
> > >
> > > Perhaps we should write down that applications may get a different
> > > outcome after stopping the data path if they get EBUSY?
> > Agree, it's better to describe more about the APP. BTW, I checked the
> > behavior of ixgbe/igb, I think we can add/delete filters during
> > runtime. Hopefully we'll not hit too many EBUSY problems on other NICs
> > :)
> 
> OK, I will add it.
> 
> > > > > - PMDs, not applications, are responsible for maintaining flow rules
> > > > >   configuration when stopping and restarting a port or performing other
> > > > >   actions which may affect them. They can only be destroyed explicitly.
> > > > Don’t understand " They can only be destroyed explicitly."
> > >
> > > This part says that as long as an application has not called
> > > rte_flow_destroy() on a flow rule, it never disappears, whatever
> > > happens to the port (stopped, restarted). The application is not
> > > responsible for re-creating rules after that.
> > >
> > > Note that according to the specification, this may translate to not
> > > being able to stop a port as long as a flow rule is present,
> > > depending on how nice the PMD intends to be with applications.
> > > Implementation can be done in small steps with minimal amount of code on
> the PMD side.
> > Does it mean PMD should store and maintain all the rules? Why not let rte do
> that? I think if PMD maintain all the rules, it means every kind of NIC should have
> a copy of code for the rules. But if rte do that, only one copy of code need to be
> maintained, right?
> 
> I've considered having rules stored in a common format understood at the RTE
> level and not specific to each PMD and decided that the opaque rte_flow pointer
> was a better choice for the following reasons:
> 
> - Even though flow rules management is done in the control path, processing
>   must be as fast as possible. Letting PMDs store flow rules using their own
>   internal representation gives them the chance to achieve better
>   performance.
Not quite understand. I think we're talking about maintain the rules by SW. I don’t think there's something need to be optimized according to specific NICs. If we need to optimize the code, I think we need to consider the CPU, OS ... and some common means. I'm wrong?

> 
> - An opaque context managed by PMDs would probably have to be stored
>   somewhere as well anyway.
> 
> - PMDs may not need to allocate/store anything at all if they exclusively
>   rely on HW state for everything. In my opinion, the generic API has enough
>   constraints for this to work and maintain consistency between flow
>   rules. Note this is currently how most PMDs implement FDIR and other
>   filter types.
Yes, the rules are stored by HW. But considering stop/start the device, the rules in HW will lose. we have to store the rules by SW and re-program them when restarting the device.
And in existing code, we store the filters by SW at least on Intel NICs. But I think we cannot reuse them, because considering the priority and which category of filter should be chosen, I think we need a whole new table for generic API. I think it’s what's designed now, right?

> 
> - RTE can (and will) provide helpers to avoid most of the code redundancy,
>   PMDs are free to use them or manage everything by themselves.
> 
> - Given that the opaque rte_flow pointer associated with a flow rule is to
>   be stored by the application, PMDs do not even have to keep references to
>   them.
Don’t understand. More details?

> 
> - The flow rules format described in this specification (pattern / actions)
>   will be used by applications directly, and will be free to arrange them in
>   lists, trees or in any other way if they need to keep flow specifications
>   around for further processing.
Who will create the lists, trees or something else? According to previous discussion, I think the APP will program the rules one by one. So if APP organize the rules to lists, trees..., PMD doesn’t know that. 
And you said " Given that the opaque rte_flow pointer associated with a flow rule is to be stored by the application ". I'm lost here.

> 
> > When the port is stopped and restarted, rte can reconfigure the rules. Is the
> concern that PMD may adjust the sequence of the rules according to the priority,
> so every NIC has a different list of rules? But PMD can adjust them again when
> rte reconfiguring the rules.
> 
> What about PMDs able to stop and restart ports without destroying their own
> flow rules? If we assume flow rules must be destroyed when stopping a port,
> these PMDs are needlessly penalized with slower stop/start cycles. Think about
> it assuming thousands of flow rules.
I believe the rules maintained by SW should not be destroyed, because they're used to be re-programed when the device starts again.

> 
> Thus from an application point of view, whatever happens when stopping and
> restarting a port should not matter. If a flow rule was present before, it must
> still be present afterwards. If the PMD had to destroy flow rules and re-create
> them, it does not actually matter if they differ slightly at the HW level, as long as:
> 
> - Existing opaque flow rule pointers (rte_flow) are still valid to the PMD
>   and refer to the same rules.
> 
> - The overall behavior of all rules is the same.
> 
> The list of rules you think of (patterns / actions) is maintained by applications
> (not RTE), and only if they need them. RTE would needlessly duplicate this.
As said before, need more details to understand this. Maybe an example is better :)

> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-19  8:11     ` Lu, Wenzhuo
@ 2016-07-19 13:12       ` Adrien Mazarguil
  2016-07-20  2:16         ` Lu, Wenzhuo
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-19 13:12 UTC (permalink / raw)
  To: Lu, Wenzhuo
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

On Tue, Jul 19, 2016 at 08:11:48AM +0000, Lu, Wenzhuo wrote:
> Hi Adrien,
> Thanks for your clarification.  Most of my questions are clear, but still something may need to be discussed, comment below.

Hi Wenzhuo,

Please see below.

[...]
> > > > Requirements for a new API:
> > > >
> > > > - Flexible and extensible without causing API/ABI problems for existing
> > > >   applications.
> > > > - Should be unambiguous and easy to use.
> > > > - Support existing filtering features and actions listed in `Filter types`_.
> > > > - Support packet alteration.
> > > > - In case of overlapping filters, their priority should be well documented.
> > > Does that mean we don't guarantee the consistent of priority? The priority can
> > be different on different NICs. So the behavior of the actions  can be different.
> > Right?
> > 
> > No, the intent is precisely to define what happens in order to get a consistent
> > result across different devices, and document cases with undefined behavior.
> > There must be no room left for interpretation.
> > 
> > For example, the API must describe what happens when two overlapping filters
> > (e.g. one matching an Ethernet header, another one matching an IP header)
> > match a given packet at a given priority level.
> > 
> > It is documented in section 4.1.1 (priorities) as "undefined behavior".
> > Applications remain free to do it and deal with consequences, at least they know
> > they cannot expect a consistent outcome, unless they use different priority
> > levels for both rules, see also 4.4.5 (flow rules priority).
> > 
> > > Seems the users still need to aware the some details of the HW? Do we need
> > to add the negotiation for the priority?
> > 
> > Priorities as defined in this document may not be directly mappable to HW
> > capabilities (e.g. HW does not support enough priorities, or that some corner
> > case make them not work as described), in which case the PMD may choose to
> > simulate priorities (again 4.4.5), as long as the end result follows the
> > specification.
> > 
> > So users must not be aware of some HW details, the PMD does and must
> > perform the needed workarounds to suit their expectations. Users may only be
> > impacted by errors while attempting to create rules that are either unsupported
> > or would cause them (or existing rules) to diverge from the spec.
> The problem is sometime the priority of the filters is fixed according to
> > HW's implementation. For example, on ixgbe, n-tuple has a higher
> > priority than flow director.

As a side note I did not know that N-tuple had a higher priority than flow
director on ixgbe, priorities among filter types do not seem to be
documented at all in DPDK. This is one of the reasons I think we need a
generic API to handle flow configuration.

So, today an application cannot combine N-tuple and FDIR flow rules and get
a reliable outcome, unless it is designed for specific devices with a known
behavior.

> What's the right behavior of PMD if APP want to create a flow director rule which has a higher or even equal priority than an existing n-tuple rule? Should PMD return fail? 

First remember applications only deal with the generic API, PMDs are
responsible for choosing the most appropriate HW implementation to use
according to the requested flow rules (FDIR, N-tuple or anything else).

For the specific case of FDIR vs N-tuple, if the underlying HW supports both
I do not see why the PMD would create a N-tuple rule. Doesn't FDIR support
everything N-tuple can do and much more?

Assuming such a thing happened anyway, that the PMD had to create a rule
using a high priority filter type and that the application requests the
creation of a rule that can only be done using a lower priority filter type,
but also requested a higher priority for that rule, then yes, it should
obviously fail.

That is, unless the PMD can perform some kind of workaround to have both.

> If so, do we need more fail reasons? According to this RFC, I think we need return " EEXIST: collision with an existing rule. ", but it's not very clear, APP doesn't know the problem is priority, maybe more detailed reason is helpful.

Possibly, I've defined a basic set of errors, there are quite a number of
errno values to choose from. However I think we should not define too many
values. In my opinion the basic set covers every possible failure:

- EINVAL: invalid format, rule is broken or cannot be understood by the PMD
  anyhow.

- ENOTSUP: pattern/actions look fine but something in the requested rule is
  not supported and thus cannot be applied.

- EEXIST: pattern/actions are fine and could have been applied if only some
  other rule did not prevent the PMD to do it (I see it as the closest thing
  to "ETOOBAD" which unfortunately does not exist).

- ENOMEM: like EEXIST, except it is due to the lack of resources not because
  of another rule. I wasn't sure which of ENOMEM or ENOSPC was better but
  settled on ENOMEM as it is well known. Still open to debate.

Errno values are only useful to get a rough idea of the reason, and another
mechanism is needed to pinpoint the exact problem for debugging/reporting
purposes, something like:

 enum rte_flow_error_type {
     RTE_FLOW_ERROR_TYPE_NONE,
     RTE_FLOW_ERROR_TYPE_UNKNOWN,
     RTE_FLOW_ERROR_TYPE_PRIORITY,
     RTE_FLOW_ERROR_TYPE_PATTERN,
     RTE_FLOW_ERROR_TYPE_ACTION,
 };

 struct rte_flow_error {
     enum rte_flow_error_type type;
     void *offset; /* Points to the exact pattern item or action. */
     const char *message;
 };

Then either provide an optional struct rte_flow_error pointer to
rte_flow_validate(), or a separate function (rte_flow_analyze()?), since
processing this may be quite expensive and applications may not care about
the exact reason.

What do you suggest?

> > > > Behavior
> > > > --------
> > > >
> > > > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> > > >   returned).
> > > >
> > > > - There is no provision for reentrancy/multi-thread safety, although nothing
> > > >   should prevent different devices from being configured at the same
> > > >   time. PMDs may protect their control path functions accordingly.
> > > >
> > > > - Stopping the data path (TX/RX) should not be necessary when managing
> > flow
> > > >   rules. If this cannot be achieved naturally or with workarounds (such as
> > > >   temporarily replacing the burst function pointers), an appropriate error
> > > >   code must be returned (``EBUSY``).
> > > PMD cannot stop the data path without adding lock. So I think if some rules
> > cannot be applied without stopping rx/tx, PMD has to return fail.
> > > Or let the APP to stop the data path.
> > 
> > Agreed, that is the intent. If the PMD cannot touch flow rules for some reason
> > even after trying really hard, then it just returns EBUSY.
> > 
> > Perhaps we should write down that applications may get a different outcome
> > after stopping the data path if they get EBUSY?
> Agree, it's better to describe more about the APP. BTW, I checked the behavior of ixgbe/igb, I think we can add/delete filters during runtime. Hopefully we'll not hit too many EBUSY problems on other NICs :)

OK, I will add it.

> > > > - PMDs, not applications, are responsible for maintaining flow rules
> > > >   configuration when stopping and restarting a port or performing other
> > > >   actions which may affect them. They can only be destroyed explicitly.
> > > Don’t understand " They can only be destroyed explicitly."
> > 
> > This part says that as long as an application has not called
> > rte_flow_destroy() on a flow rule, it never disappears, whatever happens to the
> > port (stopped, restarted). The application is not responsible for re-creating rules
> > after that.
> > 
> > Note that according to the specification, this may translate to not being able to
> > stop a port as long as a flow rule is present, depending on how nice the PMD
> > intends to be with applications. Implementation can be done in small steps with
> > minimal amount of code on the PMD side.
> Does it mean PMD should store and maintain all the rules? Why not let rte do that? I think if PMD maintain all the rules, it means every kind of NIC should have a copy of code for the rules. But if rte do that, only one copy of code need to be maintained, right?

I've considered having rules stored in a common format understood at the RTE
level and not specific to each PMD and decided that the opaque rte_flow
pointer was a better choice for the following reasons: 

- Even though flow rules management is done in the control path, processing
  must be as fast as possible. Letting PMDs store flow rules using their own
  internal representation gives them the chance to achieve better
  performance.

- An opaque context managed by PMDs would probably have to be stored
  somewhere as well anyway.

- PMDs may not need to allocate/store anything at all if they exclusively
  rely on HW state for everything. In my opinion, the generic API has enough
  constraints for this to work and maintain consistency between flow
  rules. Note this is currently how most PMDs implement FDIR and other
  filter types.

- RTE can (and will) provide helpers to avoid most of the code redundancy,
  PMDs are free to use them or manage everything by themselves.

- Given that the opaque rte_flow pointer associated with a flow rule is to
  be stored by the application, PMDs do not even have to keep references to
  them.

- The flow rules format described in this specification (pattern / actions)
  will be used by applications directly, and will be free to arrange them in
  lists, trees or in any other way if they need to keep flow specifications
  around for further processing.

> When the port is stopped and restarted, rte can reconfigure the rules. Is the concern that PMD may adjust the sequence of the rules according to the priority, so every NIC has a different list of rules? But PMD can adjust them again when rte reconfiguring the rules.

What about PMDs able to stop and restart ports without destroying their own
flow rules? If we assume flow rules must be destroyed when stopping a port,
these PMDs are needlessly penalized with slower stop/start cycles. Think
about it assuming thousands of flow rules.

Thus from an application point of view, whatever happens when stopping and
restarting a port should not matter. If a flow rule was present before, it
must still be present afterwards. If the PMD had to destroy flow rules and
re-create them, it does not actually matter if they differ slightly at the
HW level, as long as:

- Existing opaque flow rule pointers (rte_flow) are still valid to the PMD
  and refer to the same rules.

- The overall behavior of all rules is the same.

The list of rules you think of (patterns / actions) is maintained by
applications (not RTE), and only if they need them. RTE would needlessly
duplicate this.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-07 10:26   ` Adrien Mazarguil
@ 2016-07-19  8:11     ` Lu, Wenzhuo
  2016-07-19 13:12       ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Lu, Wenzhuo @ 2016-07-19  8:11 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Adrien,
Thanks for your clarification.  Most of my questions are clear, but still something may need to be discussed, comment below.


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Thursday, July 7, 2016 6:27 PM
> To: Lu, Wenzhuo
> Cc: dev@dpdk.org; Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody;
> Ajit Khaparde; Rahul Lakkireddy; Jan Medala; John Daley; Chen, Jing D; Ananyev,
> Konstantin; Matej Vido; Alejandro Lucero; Sony Chacko; Jerin Jacob; De Lara
> Guarch, Pablo; Olga Shern
> Subject: Re: [RFC] Generic flow director/filtering/classification API
> 
> Hi Lu Wenzhuo,
> 
> Thanks for your feedback, I'm replying below as well.
> 
> On Thu, Jul 07, 2016 at 07:14:18AM +0000, Lu, Wenzhuo wrote:
> > Hi Adrien,
> > I have some questions, please see inline, thanks.
> >
> > > -----Original Message-----
> > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > Sent: Wednesday, July 6, 2016 2:17 AM
> > > To: dev@dpdk.org
> > > Cc: Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody; Ajit
> > > Khaparde; Rahul Lakkireddy; Lu, Wenzhuo; Jan Medala; John Daley;
> > > Chen, Jing D; Ananyev, Konstantin; Matej Vido; Alejandro Lucero;
> > > Sony Chacko; Jerin Jacob; De Lara Guarch, Pablo; Olga Shern
> > > Subject: [RFC] Generic flow director/filtering/classification API
> > >
> > >
> > > Requirements for a new API:
> > >
> > > - Flexible and extensible without causing API/ABI problems for existing
> > >   applications.
> > > - Should be unambiguous and easy to use.
> > > - Support existing filtering features and actions listed in `Filter types`_.
> > > - Support packet alteration.
> > > - In case of overlapping filters, their priority should be well documented.
> > Does that mean we don't guarantee the consistent of priority? The priority can
> be different on different NICs. So the behavior of the actions  can be different.
> Right?
> 
> No, the intent is precisely to define what happens in order to get a consistent
> result across different devices, and document cases with undefined behavior.
> There must be no room left for interpretation.
> 
> For example, the API must describe what happens when two overlapping filters
> (e.g. one matching an Ethernet header, another one matching an IP header)
> match a given packet at a given priority level.
> 
> It is documented in section 4.1.1 (priorities) as "undefined behavior".
> Applications remain free to do it and deal with consequences, at least they know
> they cannot expect a consistent outcome, unless they use different priority
> levels for both rules, see also 4.4.5 (flow rules priority).
> 
> > Seems the users still need to aware the some details of the HW? Do we need
> to add the negotiation for the priority?
> 
> Priorities as defined in this document may not be directly mappable to HW
> capabilities (e.g. HW does not support enough priorities, or that some corner
> case make them not work as described), in which case the PMD may choose to
> simulate priorities (again 4.4.5), as long as the end result follows the
> specification.
> 
> So users must not be aware of some HW details, the PMD does and must
> perform the needed workarounds to suit their expectations. Users may only be
> impacted by errors while attempting to create rules that are either unsupported
> or would cause them (or existing rules) to diverge from the spec.
The problem is sometime the priority of the filters is fixed according to HW's implementation. For example, on ixgbe, n-tuple has a higher priority than flow director. What's the right behavior of PMD if APP want to create a flow director rule which has a higher or even equal priority than an existing n-tuple rule? Should PMD return fail? 
If so, do we need more fail reasons? According to this RFC, I think we need return " EEXIST: collision with an existing rule. ", but it's not very clear, APP doesn't know the problem is priority, maybe more detailed reason is helpful.


> > > Behavior
> > > --------
> > >
> > > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> > >   returned).
> > >
> > > - There is no provision for reentrancy/multi-thread safety, although nothing
> > >   should prevent different devices from being configured at the same
> > >   time. PMDs may protect their control path functions accordingly.
> > >
> > > - Stopping the data path (TX/RX) should not be necessary when managing
> flow
> > >   rules. If this cannot be achieved naturally or with workarounds (such as
> > >   temporarily replacing the burst function pointers), an appropriate error
> > >   code must be returned (``EBUSY``).
> > PMD cannot stop the data path without adding lock. So I think if some rules
> cannot be applied without stopping rx/tx, PMD has to return fail.
> > Or let the APP to stop the data path.
> 
> Agreed, that is the intent. If the PMD cannot touch flow rules for some reason
> even after trying really hard, then it just returns EBUSY.
> 
> Perhaps we should write down that applications may get a different outcome
> after stopping the data path if they get EBUSY?
Agree, it's better to describe more about the APP. BTW, I checked the behavior of ixgbe/igb, I think we can add/delete filters during runtime. Hopefully we'll not hit too many EBUSY problems on other NICs :)

> 
> > > - PMDs, not applications, are responsible for maintaining flow rules
> > >   configuration when stopping and restarting a port or performing other
> > >   actions which may affect them. They can only be destroyed explicitly.
> > Don’t understand " They can only be destroyed explicitly."
> 
> This part says that as long as an application has not called
> rte_flow_destroy() on a flow rule, it never disappears, whatever happens to the
> port (stopped, restarted). The application is not responsible for re-creating rules
> after that.
> 
> Note that according to the specification, this may translate to not being able to
> stop a port as long as a flow rule is present, depending on how nice the PMD
> intends to be with applications. Implementation can be done in small steps with
> minimal amount of code on the PMD side.
Does it mean PMD should store and maintain all the rules? Why not let rte do that? I think if PMD maintain all the rules, it means every kind of NIC should have a copy of code for the rules. But if rte do that, only one copy of code need to be maintained, right?
When the port is stopped and restarted, rte can reconfigure the rules. Is the concern that PMD may adjust the sequence of the rules according to the priority, so every NIC has a different list of rules? But PMD can adjust them again when rte reconfiguring the rules.

> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-18 13:26             ` Chandran, Sugesh
@ 2016-07-18 15:00               ` Adrien Mazarguil
  2016-07-20 16:32                 ` Chandran, Sugesh
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-18 15:00 UTC (permalink / raw)
  To: Chandran, Sugesh
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey

On Mon, Jul 18, 2016 at 01:26:09PM +0000, Chandran, Sugesh wrote:
> Hi Adrien,
> Thank you for getting back on this.
> Please find my comments below.

Hi Sugesh,

Same for me, removed again the parts we agree on.

[...]
> > > > > > > [Sugesh] Another concern is the cost and time of installing
> > > > > > > these rules in the hardware. Can we make these APIs time
> > > > > > > bound(or at least an option
> > > > > > to
> > > > > > > set the time limit to execute these APIs), so that Application
> > > > > > > doesn’t have to wait so long when installing and deleting
> > > > > > > flows
> > > > > > with
> > > > > > > slow hardware/NIC. What do you think? Most of the datapath
> > > > > > > flow
> > > > > > installations are
> > > > > > > dynamic and triggered only when there is an ingress traffic.
> > > > > > > Delay in flow insertion/deletion have unpredictable
> > > > > > consequences.
> > > > > >
> > > > > > This API is (currently) aimed at the control path only, and must
> > > > > > indeed be assumed to be slow. Creating million of rules may take
> > > > > > quite long as it may involve syscalls and other time-consuming
> > > > > > synchronization things on the PMD side.
> > > > > >
> > > > > > So currently there is no plan to have rules added from the data
> > > > > > path with time constraints. I think it would be implemented
> > > > > > through a different set of functions anyway.
> > > > > >
> > > > > > I do not think adding time limits is practical, even specifying
> > > > > > in the API that creating a single flow rule must take less than
> > > > > > a maximum number of seconds in order to be effective is too much
> > > > > > of a constraint (applications that create all flows during init
> > > > > > may not care after
> > > > all).
> > > > > >
> > > > > > You should consider in any case that modifying flow rules will
> > > > > > always be slower than receiving packets, there is no way around
> > > > > > that. Applications have to live with it and provide a software
> > > > > > fallback for incoming packets while managing flow rules.
> > > > > >
> > > > > > Moreover, think about what happens when you hit the maximum
> > > > number
> > > > > > of flow rules and cannot create any more. Applications need to
> > > > > > implement some kind of fallback in their data path.
> > > > > >
> > > > > > Offloading flows in HW is also only useful if they live much
> > > > > > longer than the time taken to create and delete them. Perhaps
> > > > > > applications may choose to do so after detecting long lived
> > > > > > flows such as TCP sessions.
> > > > > >
> > > > > > You may have one separate control thread dedicated to manage
> > > > > > flows and keep your normal control thread unaffected by delays.
> > > > > > Several threads can even be dedicated, one per device.
> > > > > [Sugesh] I agree that the flow insertion cannot be as fast as the
> > > > > packet receiving rate.  From application point of view the problem
> > > > > will be when hardware flow insertion takes longer than software
> > > > > flow insertion. At least application has to know the cost of
> > > > > inserting/deleting a rule in hardware beforehand. Otherwise how
> > > > > application can choose the right flow candidate for hardware. My
> > > > > point
> > > > here is application is expecting a deterministic behavior from a
> > > > classifier while inserting and deleting rules.
> > > >
> > > > Understood, however it will be difficult to estimate, particularly
> > > > if a PMD must rearrange flow rules to make room for a new one due to
> > > > priority levels collision or some other HW-related reason. I mean,
> > > > spent time cannot be assumed to be constant, even PMDs cannot know
> > > > in advance because it also depends on the performance of the host CPU.
> > > >
> > > > Such applications may find it easier to measure elapsed time for the
> > > > rules they create, make statistics and extrapolate from this
> > > > information for future rules. I do not think the PMD can help much here.
> > > [Sugesh] From an application point of view this can be an issue.
> > > Even there is a security concern when we program a short lived flow.
> > > Lets consider the case,
> > >
> > > 1) Control plane programs the hardware with Queue termination flow.
> > > 2) Software dataplane programmed to treat the packets from the specific
> > queue accordingly.
> > > 3) Remove the flow from the hardware. (Lets consider this is a long wait
> > process..).
> > > Or even there is a chance that hardware take more time to report the
> > > status than removing it physically . Now the packets in the queue no longer
> > consider as matched/flow hit.
> > > . This is due to the software dataplane update is yet to happen.
> > > We must need a way to sync between software datapath and classifier
> > > APIs even though they are both programmed from a different control
> > thread.
> > >
> > > Are we saying these APIs are only meant for user defined static flows??
> > 
> > No, that is definitely not the intent. These are good points.
> > 
> > With the specified API, applications may have to adapt their logic and take
> > extra precautions in order to remain on the safe side at all times.
> > 
> > For your above example, the application cannot assume a rule is
> > added/deleted as long as the PMD has not completed the related operation,
> > which means keeping the SW rule/fallback in place in the meantime. Should
> > handle security concerns as long as after removing a rule, packets end up in a
> > default queue entirely processed by SW. Obviously this may worsen
> > response time.
> > 
> > The ID action can help with this. By knowing which rule a received packet is
> > associated with, processing can be temporarily offloaded by another thread
> > without much complexity.
> [Sugesh] Setting ID for every flow may not viable especially when the size of ID
> is small(just only 8 bits). I am not sure is this a valid case though.

Agreed, I'm not saying this solution works for all devices, particularly
those that do not support ID at all.

> How about a hardware flow flag in packet descriptor that set when the
> packets hits any hardware rule. This way software doesn’t worry /blocked by a
> hardware rule . Even though there is an additional overhead of validating this flag,
> software datapath can identify the hardware processed packets easily.
> This way the packets traverses the software fallback path until the rule configuration is
> complete. This flag avoids setting ID action for every hardware flow that are configuring.

That makes sense. I see it as a sort of single bit ID but it could be
implemented through a different action for less capable devices. PMDs that
support 32 bit IDs could reuse the same code for both features.

I understand you'd prefer having this feature always present, however we
already know that not all PMDs/devices support it, and like everything else
this is a kind of offload that needs to be explicitly requested by the
application as it may not be needed.

If we go with the separate action, then perhaps it would make sense to
rename "ID" to "MARK" to make things clearer:

 RTE_FLOW_ACTION_TYPE_FLAG /* Flag packets processed by flow rule. */

 RTE_FLOW_ACTION_TYPE_MARK /* Attach a 32 bit value to a packet. */

I guess the result of the FLAG action would be something in ol_flag.

Thoughts?

> > I think applications have to implement SW fallbacks all the time, as even
> > some sort of guarantee on the flow rule processing time may not be enough
> > to avoid misdirected packets and related security issues.
> [Sugesh] Software fallback will be there always. However I am little bit confused on
> the way software going to identify the packets that are already hardware processed . I feel we need some
> notification in the packet itself, when a hardware rule hits. ID/flag/any other options?

Yeah I think so too, as long as it is optional because we cannot assume all
PMDs will support it.

> > Let's wait for applications to start using this API and then consider an extra
> > set of asynchronous / real-time functions when the need arises. It should not
> > impact the way rules are specified
> [Sugesh] Sure. I think the rule definition may not impact with this.

Thanks for your comments.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-15 15:04           ` Adrien Mazarguil
@ 2016-07-18 13:26             ` Chandran, Sugesh
  2016-07-18 15:00               ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-18 13:26 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey

Hi Adrien,
Thank you for getting back on this.
Please find my comments below.

Regards
_Sugesh


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Friday, July 15, 2016 4:04 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>;
> Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Rasesh Mody <rasesh.mody@qlogic.com>; Ajit
> Khaparde <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>;
> Chilikin, Andrey <andrey.chilikin@intel.com>
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> On Fri, Jul 15, 2016 at 09:23:26AM +0000, Chandran, Sugesh wrote:
> > Thank you Adrien,
> > Please find below for some more comments/inputs
> >
> > Let me know your thoughts on this.
> 
> Thanks, stripping again non relevant parts.
> 
> [...]
> > > > > > [Sugesh] Is it a limitation to use only 32 bit ID? Is it
> > > > > > possible to have a
> > > > > > 64 bit ID? So that application can use the control plane flow
> > > > > > pointer Itself as an ID. Does it make sense?
> > > > >
> > > > > I've specified a 32 bit ID for now because this is what FDIR
> > > > > supports and also what existing devices can report today AFAIK
> > > > > (i40e and
> > > mlx5).
> > > > >
> > > > > We could use 64 bit for future-proofness in a separate action like
> "ID64"
> > > > > when at least one device supports it.
> > > > >
> > > > > To PMD maintainers: please comment if you know devices that
> > > > > support tagging matching packets with more than 32 bits of
> > > > > user-provided data!
> > > > [Sugesh] I guess the flow director ID is 64 bit , The XL710 datasheet says
> so.
> > > > And in the 'rte_mbuf' structure the 64 bit FDIR-ID is shared with
> > > > rss hash. This can be a software driver limitation that expose
> > > > only 32 bit. Possibly because of cache alignment issues? Since the
> > > > hardware can support 64 bit, I feel it make sense to support 64 bit as
> well.
> > >
> > > I agree we need 64 bit support, but then we also need a solution for
> > > devices that support only 32 bit. Possible methods I can think of:
> > >
> > > - A separate "ID64" action (or a "ID32" one, perhaps with a better name).
> > >
> > > - A single ID action with an unlimited number of bytes to return with
> > >   packets (would actually be a string). PMDs can then refuse to create
> flow
> > >   rules requesting an unsupported number of bytes. Devices
> > > supporting fewer
> > >   than 32 bits are also included this way without the need for yet another
> > >   action.
> > >
> > > Thoughts?
> > [Sugesh] I feel the single ID approach is much better. But I would say
> > a fixed size ID is easy to handle at upper layers. Say PMD returns
> > 64bit ID in which MSBs are masked out, based on how many bits the
> hardware can support.
> > PMD can refuse the unsupported number of bytes when requested. So
> the
> > size of ID going to be a parameter to program the flow.
> > What do you think?
> 
> What you suggest if I am not mistaken is:
> 
>  struct rte_flow_action_id {
>      uint64_t id;
>      uint64_t mask; /* either a bit-mask or a prefix/suffix length? */  };
> 
> I think in this case a mask is more versatile than a prefix/suffix length as the
> value itself comes in an unknown endian (from PMD's POV). It also allows
> specific bits to be taken into account, like when HW only supports 32 bit, with
> some black magic the full original 64 bit value can be restored as long as the
> application only cares about at most 32 bits anywhere.
> 
> However I do not think many applications "won't care" about specific bits in a
> given value and having to provide a properly crafted mask will be a hassle,
> they will just fill it with ones and hope for the best. As a result they won't
> take advantage of this feature or will stick to 32 bits all the time, or whatever
> happens to be the least common denominator.
> 
> My previous suggestion was:
> 
>  struct rte_flow_action_id {
>      uint8_t size; /* number of bytes in id[] */
>      uint8_t id[];
>  };
> 
> It does not solve the issue if an application requests more bytes than
> supported, however as a string, there is no endianness ambiguity and these
> bytes are copied as-is to the related mbuf field as if done through memcpy()
> possibly with some padding to fill the entire 64 bit field (copied bytes thus
> starting from MSB for big-endian machines, LSB for little-endian ones). The
> value itself remains opaque to the PMD.
> 
> One issue is the flexible array approach makes static initialization more
> complicated. Maybe it is not worth the trouble since according to Andrey,
> even X710 reports at most 32 bits of user data.
> 
> So what should we do? Fixed 32 bits ID for now to keep things simple, then
> another action for 64 bits later when necessary?
[Sugesh] I agree with you. We could keep things simple by having 32 bit ID now.
I mixed up the size of ID with flexible payload size. Sorry about that.
In the future, we could add an action for 64 bit if necessary.

> 
> > > [...]
> > > > > > [Sugesh] Another concern is the cost and time of installing
> > > > > > these rules in the hardware. Can we make these APIs time
> > > > > > bound(or at least an option
> > > > > to
> > > > > > set the time limit to execute these APIs), so that Application
> > > > > > doesn’t have to wait so long when installing and deleting
> > > > > > flows
> > > > > with
> > > > > > slow hardware/NIC. What do you think? Most of the datapath
> > > > > > flow
> > > > > installations are
> > > > > > dynamic and triggered only when there is an ingress traffic.
> > > > > > Delay in flow insertion/deletion have unpredictable
> > > > > consequences.
> > > > >
> > > > > This API is (currently) aimed at the control path only, and must
> > > > > indeed be assumed to be slow. Creating million of rules may take
> > > > > quite long as it may involve syscalls and other time-consuming
> > > > > synchronization things on the PMD side.
> > > > >
> > > > > So currently there is no plan to have rules added from the data
> > > > > path with time constraints. I think it would be implemented
> > > > > through a different set of functions anyway.
> > > > >
> > > > > I do not think adding time limits is practical, even specifying
> > > > > in the API that creating a single flow rule must take less than
> > > > > a maximum number of seconds in order to be effective is too much
> > > > > of a constraint (applications that create all flows during init
> > > > > may not care after
> > > all).
> > > > >
> > > > > You should consider in any case that modifying flow rules will
> > > > > always be slower than receiving packets, there is no way around
> > > > > that. Applications have to live with it and provide a software
> > > > > fallback for incoming packets while managing flow rules.
> > > > >
> > > > > Moreover, think about what happens when you hit the maximum
> > > number
> > > > > of flow rules and cannot create any more. Applications need to
> > > > > implement some kind of fallback in their data path.
> > > > >
> > > > > Offloading flows in HW is also only useful if they live much
> > > > > longer than the time taken to create and delete them. Perhaps
> > > > > applications may choose to do so after detecting long lived
> > > > > flows such as TCP sessions.
> > > > >
> > > > > You may have one separate control thread dedicated to manage
> > > > > flows and keep your normal control thread unaffected by delays.
> > > > > Several threads can even be dedicated, one per device.
> > > > [Sugesh] I agree that the flow insertion cannot be as fast as the
> > > > packet receiving rate.  From application point of view the problem
> > > > will be when hardware flow insertion takes longer than software
> > > > flow insertion. At least application has to know the cost of
> > > > inserting/deleting a rule in hardware beforehand. Otherwise how
> > > > application can choose the right flow candidate for hardware. My
> > > > point
> > > here is application is expecting a deterministic behavior from a
> > > classifier while inserting and deleting rules.
> > >
> > > Understood, however it will be difficult to estimate, particularly
> > > if a PMD must rearrange flow rules to make room for a new one due to
> > > priority levels collision or some other HW-related reason. I mean,
> > > spent time cannot be assumed to be constant, even PMDs cannot know
> > > in advance because it also depends on the performance of the host CPU.
> > >
> > > Such applications may find it easier to measure elapsed time for the
> > > rules they create, make statistics and extrapolate from this
> > > information for future rules. I do not think the PMD can help much here.
> > [Sugesh] From an application point of view this can be an issue.
> > Even there is a security concern when we program a short lived flow.
> > Lets consider the case,
> >
> > 1) Control plane programs the hardware with Queue termination flow.
> > 2) Software dataplane programmed to treat the packets from the specific
> queue accordingly.
> > 3) Remove the flow from the hardware. (Lets consider this is a long wait
> process..).
> > Or even there is a chance that hardware take more time to report the
> > status than removing it physically . Now the packets in the queue no longer
> consider as matched/flow hit.
> > . This is due to the software dataplane update is yet to happen.
> > We must need a way to sync between software datapath and classifier
> > APIs even though they are both programmed from a different control
> thread.
> >
> > Are we saying these APIs are only meant for user defined static flows??
> 
> No, that is definitely not the intent. These are good points.
> 
> With the specified API, applications may have to adapt their logic and take
> extra precautions in order to remain on the safe side at all times.
> 
> For your above example, the application cannot assume a rule is
> added/deleted as long as the PMD has not completed the related operation,
> which means keeping the SW rule/fallback in place in the meantime. Should
> handle security concerns as long as after removing a rule, packets end up in a
> default queue entirely processed by SW. Obviously this may worsen
> response time.
> 
> The ID action can help with this. By knowing which rule a received packet is
> associated with, processing can be temporarily offloaded by another thread
> without much complexity.
[Sugesh] Setting ID for every flow may not viable especially when the size of ID
is small(just only 8 bits). I am not sure is this a valid case though.

How about a hardware flow flag in packet descriptor that set when the
packets hits any hardware rule. This way software doesn’t worry /blocked by a
hardware rule . Even though there is an additional overhead of validating this flag,
software datapath can identify the hardware processed packets easily.
This way the packets traverses the software fallback path until the rule configuration is
complete. This flag avoids setting ID action for every hardware flow that are configuring.

> 
> I think applications have to implement SW fallbacks all the time, as even
> some sort of guarantee on the flow rule processing time may not be enough
> to avoid misdirected packets and related security issues.
[Sugesh] Software fallback will be there always. However I am little bit confused on
the way software going to identify the packets that are already hardware processed . I feel we need some
notification in the packet itself, when a hardware rule hits. ID/flag/any other options?
> 
> Let's wait for applications to start using this API and then consider an extra
> set of asynchronous / real-time functions when the need arises. It should not
> impact the way rules are specified
[Sugesh] Sure. I think the rule definition may not impact with this.
.
> 
> > > > > > [Sugesh] Another query is on the synchronization part. What if
> > > > > > same rules
> > > > > are
> > > > > > handled from different threads? Is application responsible for
> > > > > > handling the
> > > > > concurrent
> > > > > > hardware programming?
> > > > >
> > > > > Like most (if not all) DPDK APIs, applications are responsible
> > > > > for managing locking issues as decribed in 4.3 (Behavior). Since
> > > > > this is a control path API and applications usually have a
> > > > > single control thread, locking should not be necessary in most cases.
> > > > >
> > > > > Regarding my above comment about using several control threads
> > > > > to manage different devices, section 4.3 says:
> > > > >
> > > > >  "There is no provision for reentrancy/multi-thread safety,
> > > > > although nothing  should prevent different devices from being
> > > > > configured at the same  time. PMDs may protect their control
> > > > > path functions
> > > accordingly."
> > > > >
> > > > > I'd like to emphasize it is not "per port" but "per device",
> > > > > since in a few cases a configurable resource is shared by several ports.
> > > > > It may be difficult for applications to determine which ports
> > > > > are shared by a given device but this falls outside the scope of this
> API.
> > > > >
> > > > > Do you think adding the guarantee that it is always safe to
> > > > > configure two different ports simultaneously without locking
> > > > > from the application side is necessary? In which case the PMD
> > > > > would be responsible for locking shared resources.
> > > > [Sugesh] This would be little bit complicated when some of ports
> > > > are not under DPDK itself(what if one port is managed by Kernel)
> > > > Or ports are tied by different application. Locking in PMD helps
> > > > when the ports are accessed by multiple DPDK application. However
> > > > what if the port itself
> > > not under DPDK?
> > >
> > > Well, either we do not care about what happens outside of the DPDK
> > > context, or PMDs must find a way to satisfy everyone. I'm not a fan
> > > of locking either but it would be nice if flow rules configuration
> > > could be attempted on different ports simultaneously without the
> > > risk of wrecking anything, so that applications do not need to care.
> > >
> > > Possible cases for a dual port device with global flow rule settings
> > > affecting both ports:
> > >
> > > 1) ports 1 & 2 are managed by DPDK: this is the easy case, a rule that
> needs
> > >    to alter a global setting necessary for an existing rule on any port is
> > >    not allowed (EEXIST). PMD must maintain a device context common to
> both
> > >    ports in order for this to work. This context is either under lock, or
> > >    the first port on which a flow rule is created owns all future flow
> > >    rules.
> > >
> > > 2) port 1 is managed by DPDK, port 2 by something else, the PMD is
> aware of
> > >    it and knows that port 2 may modify the global context: no flow rules
> can
> > >    be created from the DPDK application due to safety issues (EBUSY?).
> > >
> > > 3) port 1 is managed by DPDK, port 2 by something else, the PMD is
> aware of
> > >    it and knows that port 2 will not modify flow rules: PMD should not care,
> > >    no lock necessary.
> > >
> > > 4) port 1 is managed by DPDK, port 2 by something else and the PMD is
> not
> > >    aware of it: either flow rules cannot be created ever at all, or we say
> > >    it is user's reponsibility to make sure this does not happen.
> > >
> > > Considering that most control operations performed by DPDK affect
> > > the device regardless of other applications, I think 1) is the only
> > > case that should be defined, otherwise 4), defined as user's
> responsibility.
> 
> No more comments on this part? What do you suggest?
[Sugesh] I agree with your suggestions. I feel this is the best that can offer.

> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-15 10:02           ` Chilikin, Andrey
@ 2016-07-18 13:26             ` Chandran, Sugesh
  0 siblings, 0 replies; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-18 13:26 UTC (permalink / raw)
  To: Chilikin, Andrey, 'Adrien Mazarguil'
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

Hi Andrey,

Regards
_Sugesh


> -----Original Message-----
> From: Chilikin, Andrey
> Sent: Friday, July 15, 2016 11:02 AM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>; 'Adrien Mazarguil'
> <adrien.mazarguil@6wind.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>;
> Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Rasesh Mody <rasesh.mody@qlogic.com>; Ajit
> Khaparde <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>
> Subject: RE: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> Hi Sugesh,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Chandran, Sugesh
> > Sent: Friday, July 15, 2016 10:23 AM
> > To: 'Adrien Mazarguil' <adrien.mazarguil@6wind.com>
> 
> <snip>
> 
> > > > > To PMD maintainers: please comment if you know devices that
> > > > > support tagging matching packets with more than 32 bits of
> > > > > user-provided data!
> > > > [Sugesh] I guess the flow director ID is 64 bit , The XL710 datasheet says
> so.
> > > > And in the 'rte_mbuf' structure the 64 bit FDIR-ID is shared with
> > > > rss hash. This can be a software driver limitation that expose
> > > > only
> > > > 32 bit. Possibly because of cache alignment issues? Since the
> > > > hardware can support 64 bit, I feel it make sense to support 64 bit as
> well.
> 
> XL710 supports 32bit FDIR ID only, I believe you mix it up with flexible payload
> data which can take 0, 4 or 8 bytes of the RX descriptor.
[Sugesh] Thank you for correcting Andrey.
Its my mistake..I mixed up with flexible payload data. The FDIR ID is only 32 bit.

> 
> Regards,
> Andrey

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-15  9:23         ` Chandran, Sugesh
  2016-07-15 10:02           ` Chilikin, Andrey
@ 2016-07-15 15:04           ` Adrien Mazarguil
  2016-07-18 13:26             ` Chandran, Sugesh
  1 sibling, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-15 15:04 UTC (permalink / raw)
  To: Chandran, Sugesh
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern, Chilikin, Andrey

On Fri, Jul 15, 2016 at 09:23:26AM +0000, Chandran, Sugesh wrote:
> Thank you Adrien,
> Please find below for some more comments/inputs
> 
> Let me know your thoughts on this.

Thanks, stripping again non relevant parts.

[...]
> > > > > [Sugesh] Is it a limitation to use only 32 bit ID? Is it possible
> > > > > to have a
> > > > > 64 bit ID? So that application can use the control plane flow
> > > > > pointer Itself as an ID. Does it make sense?
> > > >
> > > > I've specified a 32 bit ID for now because this is what FDIR
> > > > supports and also what existing devices can report today AFAIK (i40e and
> > mlx5).
> > > >
> > > > We could use 64 bit for future-proofness in a separate action like "ID64"
> > > > when at least one device supports it.
> > > >
> > > > To PMD maintainers: please comment if you know devices that support
> > > > tagging matching packets with more than 32 bits of user-provided
> > > > data!
> > > [Sugesh] I guess the flow director ID is 64 bit , The XL710 datasheet says so.
> > > And in the 'rte_mbuf' structure the 64 bit FDIR-ID is shared with rss
> > > hash. This can be a software driver limitation that expose only 32
> > > bit. Possibly because of cache alignment issues? Since the hardware
> > > can support 64 bit, I feel it make sense to support 64 bit as well.
> > 
> > I agree we need 64 bit support, but then we also need a solution for devices
> > that support only 32 bit. Possible methods I can think of:
> > 
> > - A separate "ID64" action (or a "ID32" one, perhaps with a better name).
> > 
> > - A single ID action with an unlimited number of bytes to return with
> >   packets (would actually be a string). PMDs can then refuse to create flow
> >   rules requesting an unsupported number of bytes. Devices supporting
> > fewer
> >   than 32 bits are also included this way without the need for yet another
> >   action.
> > 
> > Thoughts?
> [Sugesh] I feel the single ID approach is much better. But I would say a fixed size ID
> is easy to handle at upper layers. Say PMD returns 64bit ID in which MSBs 
> are masked out, based on how many bits the hardware can support. 
> PMD can refuse the unsupported number of bytes when requested. So the size
> of ID going to be a parameter to program the flow.
> What do you think?

What you suggest if I am not mistaken is:

 struct rte_flow_action_id {
     uint64_t id;
     uint64_t mask; /* either a bit-mask or a prefix/suffix length? */
 };

I think in this case a mask is more versatile than a prefix/suffix length as
the value itself comes in an unknown endian (from PMD's POV). It also allows
specific bits to be taken into account, like when HW only supports 32 bit,
with some black magic the full original 64 bit value can be restored as long
as the application only cares about at most 32 bits anywhere.

However I do not think many applications "won't care" about specific bits in
a given value and having to provide a properly crafted mask will be a
hassle, they will just fill it with ones and hope for the best. As a result
they won't take advantage of this feature or will stick to 32 bits all the
time, or whatever happens to be the least common denominator.

My previous suggestion was:

 struct rte_flow_action_id {
     uint8_t size; /* number of bytes in id[] */
     uint8_t id[];
 };

It does not solve the issue if an application requests more bytes than
supported, however as a string, there is no endianness ambiguity and these
bytes are copied as-is to the related mbuf field as if done through memcpy()
possibly with some padding to fill the entire 64 bit field (copied bytes
thus starting from MSB for big-endian machines, LSB for little-endian
ones). The value itself remains opaque to the PMD.

One issue is the flexible array approach makes static initialization more
complicated. Maybe it is not worth the trouble since according to Andrey,
even X710 reports at most 32 bits of user data.

So what should we do? Fixed 32 bits ID for now to keep things simple, then
another action for 64 bits later when necessary?

> > [...]
> > > > > [Sugesh] Another concern is the cost and time of installing these
> > > > > rules in the hardware. Can we make these APIs time bound(or at
> > > > > least an option
> > > > to
> > > > > set the time limit to execute these APIs), so that Application
> > > > > doesn’t have to wait so long when installing and deleting flows
> > > > with
> > > > > slow hardware/NIC. What do you think? Most of the datapath flow
> > > > installations are
> > > > > dynamic and triggered only when there is an ingress traffic. Delay
> > > > > in flow insertion/deletion have unpredictable
> > > > consequences.
> > > >
> > > > This API is (currently) aimed at the control path only, and must
> > > > indeed be assumed to be slow. Creating million of rules may take
> > > > quite long as it may involve syscalls and other time-consuming
> > > > synchronization things on the PMD side.
> > > >
> > > > So currently there is no plan to have rules added from the data path
> > > > with time constraints. I think it would be implemented through a
> > > > different set of functions anyway.
> > > >
> > > > I do not think adding time limits is practical, even specifying in
> > > > the API that creating a single flow rule must take less than a
> > > > maximum number of seconds in order to be effective is too much of a
> > > > constraint (applications that create all flows during init may not care after
> > all).
> > > >
> > > > You should consider in any case that modifying flow rules will
> > > > always be slower than receiving packets, there is no way around
> > > > that. Applications have to live with it and provide a software
> > > > fallback for incoming packets while managing flow rules.
> > > >
> > > > Moreover, think about what happens when you hit the maximum
> > number
> > > > of flow rules and cannot create any more. Applications need to
> > > > implement some kind of fallback in their data path.
> > > >
> > > > Offloading flows in HW is also only useful if they live much longer
> > > > than the time taken to create and delete them. Perhaps applications
> > > > may choose to do so after detecting long lived flows such as TCP
> > > > sessions.
> > > >
> > > > You may have one separate control thread dedicated to manage flows
> > > > and keep your normal control thread unaffected by delays. Several
> > > > threads can even be dedicated, one per device.
> > > [Sugesh] I agree that the flow insertion cannot be as fast as the
> > > packet receiving rate.  From application point of view the problem
> > > will be when hardware flow insertion takes longer than software flow
> > > insertion. At least application has to know the cost of
> > > inserting/deleting a rule in hardware beforehand. Otherwise how
> > > application can choose the right flow candidate for hardware. My point
> > here is application is expecting a deterministic behavior from a classifier while
> > inserting and deleting rules.
> > 
> > Understood, however it will be difficult to estimate, particularly if a PMD
> > must rearrange flow rules to make room for a new one due to priority levels
> > collision or some other HW-related reason. I mean, spent time cannot be
> > assumed to be constant, even PMDs cannot know in advance because it also
> > depends on the performance of the host CPU.
> > 
> > Such applications may find it easier to measure elapsed time for the rules
> > they create, make statistics and extrapolate from this information for future
> > rules. I do not think the PMD can help much here.
> [Sugesh] From an application point of view this can be an issue. 
> Even there is a security concern when we program a short lived flow. Lets consider the case, 
> 
> 1) Control plane programs the hardware with Queue termination flow.
> 2) Software dataplane programmed to treat the packets from the specific queue accordingly.
> 3) Remove the flow from the hardware. (Lets consider this is a long wait process..). 
> Or even there is a chance that hardware take more time to report the status than removing it 
> physically . Now the packets in the queue no longer consider as matched/flow hit.
> . This is due to the software dataplane update is yet to happen.
> We must need a way to sync between software datapath and classifier APIs even though 
> they are both programmed from a different control thread.
> 
> Are we saying these APIs are only meant for user defined static flows??

No, that is definitely not the intent. These are good points.

With the specified API, applications may have to adapt their logic and take
extra precautions in order to remain on the safe side at all times.

For your above example, the application cannot assume a rule is
added/deleted as long as the PMD has not completed the related operation,
which means keeping the SW rule/fallback in place in the meantime. Should
handle security concerns as long as after removing a rule, packets end up in
a default queue entirely processed by SW. Obviously this may worsen response
time.

The ID action can help with this. By knowing which rule a received packet is
associated with, processing can be temporarily offloaded by another thread
without much complexity.

I think applications have to implement SW fallbacks all the time, as even
some sort of guarantee on the flow rule processing time may not be enough to
avoid misdirected packets and related security issues.

Let's wait for applications to start using this API and then consider an
extra set of asynchronous / real-time functions when the need arises. It
should not impact the way rules are specified.

> > > > > [Sugesh] Another query is on the synchronization part. What if
> > > > > same rules
> > > > are
> > > > > handled from different threads? Is application responsible for
> > > > > handling the
> > > > concurrent
> > > > > hardware programming?
> > > >
> > > > Like most (if not all) DPDK APIs, applications are responsible for
> > > > managing locking issues as decribed in 4.3 (Behavior). Since this is
> > > > a control path API and applications usually have a single control
> > > > thread, locking should not be necessary in most cases.
> > > >
> > > > Regarding my above comment about using several control threads to
> > > > manage different devices, section 4.3 says:
> > > >
> > > >  "There is no provision for reentrancy/multi-thread safety, although
> > > > nothing  should prevent different devices from being configured at
> > > > the same  time. PMDs may protect their control path functions
> > accordingly."
> > > >
> > > > I'd like to emphasize it is not "per port" but "per device", since
> > > > in a few cases a configurable resource is shared by several ports.
> > > > It may be difficult for applications to determine which ports are
> > > > shared by a given device but this falls outside the scope of this API.
> > > >
> > > > Do you think adding the guarantee that it is always safe to
> > > > configure two different ports simultaneously without locking from
> > > > the application side is necessary? In which case the PMD would be
> > > > responsible for locking shared resources.
> > > [Sugesh] This would be little bit complicated when some of ports are
> > > not under DPDK itself(what if one port is managed by Kernel) Or ports
> > > are tied by different application. Locking in PMD helps when the ports
> > > are accessed by multiple DPDK application. However what if the port itself
> > not under DPDK?
> > 
> > Well, either we do not care about what happens outside of the DPDK
> > context, or PMDs must find a way to satisfy everyone. I'm not a fan of locking
> > either but it would be nice if flow rules configuration could be attempted on
> > different ports simultaneously without the risk of wrecking anything, so that
> > applications do not need to care.
> > 
> > Possible cases for a dual port device with global flow rule settings affecting
> > both ports:
> > 
> > 1) ports 1 & 2 are managed by DPDK: this is the easy case, a rule that needs
> >    to alter a global setting necessary for an existing rule on any port is
> >    not allowed (EEXIST). PMD must maintain a device context common to both
> >    ports in order for this to work. This context is either under lock, or
> >    the first port on which a flow rule is created owns all future flow
> >    rules.
> > 
> > 2) port 1 is managed by DPDK, port 2 by something else, the PMD is aware of
> >    it and knows that port 2 may modify the global context: no flow rules can
> >    be created from the DPDK application due to safety issues (EBUSY?).
> > 
> > 3) port 1 is managed by DPDK, port 2 by something else, the PMD is aware of
> >    it and knows that port 2 will not modify flow rules: PMD should not care,
> >    no lock necessary.
> > 
> > 4) port 1 is managed by DPDK, port 2 by something else and the PMD is not
> >    aware of it: either flow rules cannot be created ever at all, or we say
> >    it is user's reponsibility to make sure this does not happen.
> > 
> > Considering that most control operations performed by DPDK affect the
> > device regardless of other applications, I think 1) is the only case that should
> > be defined, otherwise 4), defined as user's responsibility.

No more comments on this part? What do you suggest?

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-15  9:23         ` Chandran, Sugesh
@ 2016-07-15 10:02           ` Chilikin, Andrey
  2016-07-18 13:26             ` Chandran, Sugesh
  2016-07-15 15:04           ` Adrien Mazarguil
  1 sibling, 1 reply; 54+ messages in thread
From: Chilikin, Andrey @ 2016-07-15 10:02 UTC (permalink / raw)
  To: Chandran, Sugesh, 'Adrien Mazarguil'
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

Hi Sugesh,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Chandran, Sugesh
> Sent: Friday, July 15, 2016 10:23 AM
> To: 'Adrien Mazarguil' <adrien.mazarguil@6wind.com>

<snip>

> > > > To PMD maintainers: please comment if you know devices that
> > > > support tagging matching packets with more than 32 bits of
> > > > user-provided data!
> > > [Sugesh] I guess the flow director ID is 64 bit , The XL710 datasheet says so.
> > > And in the 'rte_mbuf' structure the 64 bit FDIR-ID is shared with
> > > rss hash. This can be a software driver limitation that expose only
> > > 32 bit. Possibly because of cache alignment issues? Since the
> > > hardware can support 64 bit, I feel it make sense to support 64 bit as well.

XL710 supports 32bit FDIR ID only, I believe you mix it up with flexible payload data which can take 0, 4 or 8 bytes of the RX descriptor.

Regards,
Andrey

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-13 20:03       ` Adrien Mazarguil
@ 2016-07-15  9:23         ` Chandran, Sugesh
  2016-07-15 10:02           ` Chilikin, Andrey
  2016-07-15 15:04           ` Adrien Mazarguil
  0 siblings, 2 replies; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-15  9:23 UTC (permalink / raw)
  To: 'Adrien Mazarguil'
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

Thank you Adrien,
Please find below for some more comments/inputs

Let me know your thoughts on this.


Regards
_Sugesh


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, July 13, 2016 9:03 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>;
> Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Rasesh Mody <rasesh.mody@qlogic.com>; Ajit
> Khaparde <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> On Mon, Jul 11, 2016 at 10:42:36AM +0000, Chandran, Sugesh wrote:
> > Hi Adrien,
> >
> > Thank you for your response,
> > Please see my comments inline.
> 
> Hi Sugesh,
> 
> Sorry for the delay, please see my answers inline as well.
> 
> [...]
> > > > > Flow director
> > > > > -------------
> > > > >
> > > > > Flow director (FDIR) is the name of the most capable filter
> > > > > type, which covers most features offered by others. As such, it
> > > > > is the most
> > > widespread
> > > > > in PMDs that support filtering (i.e. all of them besides **e1000**).
> > > > >
> > > > > It is also the only type that allows an arbitrary 32 bits value
> > > > > provided by applications to be attached to a filter and returned
> > > > > with matching packets instead of relying on the destination queue to
> recognize flows.
> > > > >
> > > > > Unfortunately, even FDIR requires applications to be aware of
> > > > > low-level capabilities and limitations (most of which come
> > > > > directly from **ixgbe**
> > > and
> > > > > **i40e**):
> > > > >
> > > > > - Bitmasks are set globally per device (port?), not per filter.
> > > > [Sugesh] This means application cannot define filters that matches
> > > > on
> > > arbitrary different offsets?
> > > > If that’s the case, I assume the application has to program
> > > > bitmask in
> > > advance. Otherwise how
> > > > the API framework deduce this bitmask information from the rules??
> > > > Its
> > > not very clear to me
> > > > that how application pass down the bitmask information for
> > > > multiple filters
> > > on same port?
> > >
> > > This is my understanding of how flow director currently works,
> > > perhaps someome more familiar with it can answer this question better
> than I could.
> > >
> > > Let me take an example, if particular device can only handle a
> > > single IPv4 mask common to all flow rules (say only to match
> > > destination addresses), updating that mask to also match the source
> > > address affects all defined and future flow rules simultaneously.
> > >
> > > That is how FDIR currently works and I think it is wrong, as it
> > > penalizes devices that do support individual bit-masks per rule, and
> > > is a little awkward from an application point of view.
> > >
> > > What I suggest for the new API instead is the ability to specify one
> > > bit-mask per rule, and let the PMD deal with HW limitations by
> > > automatically configuring global bitmasks from the first added rule,
> > > then refusing to add subsequent rules if they specify a conflicting
> > > bit-mask. Existing rules remain unaffected that way, and
> > > applications do not have to be extra cautious.
> > >
> > [Sugesh] The issue with that approach is, the hardware simply discards
> > the rule when it is a super set of first one eventhough the hardware
> > is capable of handling it. How its guaranteed the first rule will set
> > the bitmask for all the subsequent rules.
> 
> Just to clarify, the API only says that new rules cannot affect existing ones
> (which I think makes sense from a user's perspective), so as long as the PMD
> does whatever is needed to make all rules work together, there should not
> be any problem with this approach.
> 
> Even if the PMD has to temporarily remove an existing rule and reconfigure
> global masks in order to add subsequent rules, it is fine as long as packets
> aren't misdirected in the meantime (they may be dropped if there is no
> other choice).
[Sugesh] I feel this is fine. Thank you for confirming.
> 
> > How about having a CLASSIFER_TYPE for the classifier. Every port can
> > have set of supported flow types(for eg: L3_TYPE, L4_TYPE,
> > L4_TYPE_8BYTE_FLEX,
> > L4_TYPE_16BYTE_FLEX) based on the underlying FDIR support. Application
> > can query this and set the type accordingly while initializing the
> > port. This way the first rule need not set all the bits that may needed in the
> future rules.
> 
> Again from a user's POV, I think doing so would add unwanted HW-specific
> complexity.
> 
> However this concern can be handled through a different approach. Let's say
> user creates a pattern that only specifies a IP header with a given bit-mask.
> 
> In FDIR language this translates to:
> 
> - Set global mask for IPv4 accordingly, remaining global masks all zeroed
>   (assumed default value).
> 
> - Create an IPv4 flow.
> 
> From now on, all rules specifying a IPv4 header must have this exact bitmask
> (implicitly or explicitly), otherwise they cannot be created, i.e. the global
> bitmask for IPv4 becomes immutable.
> 
> Now user creates a TCPv4 rule (as long as it uses the same IPv4 mask), to
> handle this FDIR would:
> 
> - Keep global immutable mask for IPv4 unchanged, set global TCP mask
>   according to the flow rule.
> 
> - Create a TCPv4 flow.
> 
> From this point on, like IPv4, subsequent TCP rules must have this exact
> bitmask and so on as the global bitmask becomes immutable.
> 
> Basically, only protocol bit-masks affected by existing flow rules are
> immutable, others can be changed later. Global flow masks for protocols
> become mutable again when no existing flow rule uses them.
> 
> Does it look fine for you?
[Sugesh] This looks fine for me. 
> 
> [...]
> > > > > +--------------------------+
> > > > > | Copy to queue 8          |
> > > > > +==========+===============+
> > > > > | PASSTHRU |               |
> > > > > +----------+-----------+---+
> > > > > | QUEUE    | ``queue`` | 8 |
> > > > > +----------+-----------+---+
> > > > >
> > > > > ``ID``
> > > > > ^^^^^^
> > > > >
> > > > > Attaches a 32 bit value to packets.
> > > > >
> > > > > +----------------------------------------------+
> > > > > | ID                                           |
> > > > > +========+=====================================+
> > > > > | ``id`` | 32 bit value to return with packets |
> > > > > +--------+-------------------------------------+
> > > > >
> > > > [Sugesh] I assume the application has to program the flow with a
> > > > unique ID and matching packets are stamped with this ID when
> > > > reporting to the software. The uniqueness of ID is NOT guaranteed
> > > > by the API framework. Correct me if I am wrong here.
> > >
> > > You are right, if the way I wrote it is not clear enough, I'm open
> > > to suggestions to improve it.
> > [Sugesh] I guess its fine and would like to confirm the same. Perhaps
> > it would be nice to mention that the IDs are application defined.
> 
> OK, I will make it clearer.
> 
> > > > [Sugesh] Is it a limitation to use only 32 bit ID? Is it possible
> > > > to have a
> > > > 64 bit ID? So that application can use the control plane flow
> > > > pointer Itself as an ID. Does it make sense?
> > >
> > > I've specified a 32 bit ID for now because this is what FDIR
> > > supports and also what existing devices can report today AFAIK (i40e and
> mlx5).
> > >
> > > We could use 64 bit for future-proofness in a separate action like "ID64"
> > > when at least one device supports it.
> > >
> > > To PMD maintainers: please comment if you know devices that support
> > > tagging matching packets with more than 32 bits of user-provided
> > > data!
> > [Sugesh] I guess the flow director ID is 64 bit , The XL710 datasheet says so.
> > And in the 'rte_mbuf' structure the 64 bit FDIR-ID is shared with rss
> > hash. This can be a software driver limitation that expose only 32
> > bit. Possibly because of cache alignment issues? Since the hardware
> > can support 64 bit, I feel it make sense to support 64 bit as well.
> 
> I agree we need 64 bit support, but then we also need a solution for devices
> that support only 32 bit. Possible methods I can think of:
> 
> - A separate "ID64" action (or a "ID32" one, perhaps with a better name).
> 
> - A single ID action with an unlimited number of bytes to return with
>   packets (would actually be a string). PMDs can then refuse to create flow
>   rules requesting an unsupported number of bytes. Devices supporting
> fewer
>   than 32 bits are also included this way without the need for yet another
>   action.
> 
> Thoughts?
[Sugesh] I feel the single ID approach is much better. But I would say a fixed size ID
is easy to handle at upper layers. Say PMD returns 64bit ID in which MSBs 
are masked out, based on how many bits the hardware can support. 
PMD can refuse the unsupported number of bytes when requested. So the size
of ID going to be a parameter to program the flow.
What do you think?
> 
> [...]
> > > > [Sugesh] Another concern is the cost and time of installing these
> > > > rules in the hardware. Can we make these APIs time bound(or at
> > > > least an option
> > > to
> > > > set the time limit to execute these APIs), so that Application
> > > > doesn’t have to wait so long when installing and deleting flows
> > > with
> > > > slow hardware/NIC. What do you think? Most of the datapath flow
> > > installations are
> > > > dynamic and triggered only when there is an ingress traffic. Delay
> > > > in flow insertion/deletion have unpredictable
> > > consequences.
> > >
> > > This API is (currently) aimed at the control path only, and must
> > > indeed be assumed to be slow. Creating million of rules may take
> > > quite long as it may involve syscalls and other time-consuming
> > > synchronization things on the PMD side.
> > >
> > > So currently there is no plan to have rules added from the data path
> > > with time constraints. I think it would be implemented through a
> > > different set of functions anyway.
> > >
> > > I do not think adding time limits is practical, even specifying in
> > > the API that creating a single flow rule must take less than a
> > > maximum number of seconds in order to be effective is too much of a
> > > constraint (applications that create all flows during init may not care after
> all).
> > >
> > > You should consider in any case that modifying flow rules will
> > > always be slower than receiving packets, there is no way around
> > > that. Applications have to live with it and provide a software
> > > fallback for incoming packets while managing flow rules.
> > >
> > > Moreover, think about what happens when you hit the maximum
> number
> > > of flow rules and cannot create any more. Applications need to
> > > implement some kind of fallback in their data path.
> > >
> > > Offloading flows in HW is also only useful if they live much longer
> > > than the time taken to create and delete them. Perhaps applications
> > > may choose to do so after detecting long lived flows such as TCP
> > > sessions.
> > >
> > > You may have one separate control thread dedicated to manage flows
> > > and keep your normal control thread unaffected by delays. Several
> > > threads can even be dedicated, one per device.
> > [Sugesh] I agree that the flow insertion cannot be as fast as the
> > packet receiving rate.  From application point of view the problem
> > will be when hardware flow insertion takes longer than software flow
> > insertion. At least application has to know the cost of
> > inserting/deleting a rule in hardware beforehand. Otherwise how
> > application can choose the right flow candidate for hardware. My point
> here is application is expecting a deterministic behavior from a classifier while
> inserting and deleting rules.
> 
> Understood, however it will be difficult to estimate, particularly if a PMD
> must rearrange flow rules to make room for a new one due to priority levels
> collision or some other HW-related reason. I mean, spent time cannot be
> assumed to be constant, even PMDs cannot know in advance because it also
> depends on the performance of the host CPU.
> 
> Such applications may find it easier to measure elapsed time for the rules
> they create, make statistics and extrapolate from this information for future
> rules. I do not think the PMD can help much here.
[Sugesh] From an application point of view this can be an issue. 
Even there is a security concern when we program a short lived flow. Lets consider the case, 

1) Control plane programs the hardware with Queue termination flow.
2) Software dataplane programmed to treat the packets from the specific queue accordingly.
3) Remove the flow from the hardware. (Lets consider this is a long wait process..). 
Or even there is a chance that hardware take more time to report the status than removing it 
physically . Now the packets in the queue no longer consider as matched/flow hit.
. This is due to the software dataplane update is yet to happen.
We must need a way to sync between software datapath and classifier APIs even though 
they are both programmed from a different control thread.

Are we saying these APIs are only meant for user defined static flows??


> 
> > > > [Sugesh] Another query is on the synchronization part. What if
> > > > same rules
> > > are
> > > > handled from different threads? Is application responsible for
> > > > handling the
> > > concurrent
> > > > hardware programming?
> > >
> > > Like most (if not all) DPDK APIs, applications are responsible for
> > > managing locking issues as decribed in 4.3 (Behavior). Since this is
> > > a control path API and applications usually have a single control
> > > thread, locking should not be necessary in most cases.
> > >
> > > Regarding my above comment about using several control threads to
> > > manage different devices, section 4.3 says:
> > >
> > >  "There is no provision for reentrancy/multi-thread safety, although
> > > nothing  should prevent different devices from being configured at
> > > the same  time. PMDs may protect their control path functions
> accordingly."
> > >
> > > I'd like to emphasize it is not "per port" but "per device", since
> > > in a few cases a configurable resource is shared by several ports.
> > > It may be difficult for applications to determine which ports are
> > > shared by a given device but this falls outside the scope of this API.
> > >
> > > Do you think adding the guarantee that it is always safe to
> > > configure two different ports simultaneously without locking from
> > > the application side is necessary? In which case the PMD would be
> > > responsible for locking shared resources.
> > [Sugesh] This would be little bit complicated when some of ports are
> > not under DPDK itself(what if one port is managed by Kernel) Or ports
> > are tied by different application. Locking in PMD helps when the ports
> > are accessed by multiple DPDK application. However what if the port itself
> not under DPDK?
> 
> Well, either we do not care about what happens outside of the DPDK
> context, or PMDs must find a way to satisfy everyone. I'm not a fan of locking
> either but it would be nice if flow rules configuration could be attempted on
> different ports simultaneously without the risk of wrecking anything, so that
> applications do not need to care.
> 
> Possible cases for a dual port device with global flow rule settings affecting
> both ports:
> 
> 1) ports 1 & 2 are managed by DPDK: this is the easy case, a rule that needs
>    to alter a global setting necessary for an existing rule on any port is
>    not allowed (EEXIST). PMD must maintain a device context common to both
>    ports in order for this to work. This context is either under lock, or
>    the first port on which a flow rule is created owns all future flow
>    rules.
> 
> 2) port 1 is managed by DPDK, port 2 by something else, the PMD is aware of
>    it and knows that port 2 may modify the global context: no flow rules can
>    be created from the DPDK application due to safety issues (EBUSY?).
> 
> 3) port 1 is managed by DPDK, port 2 by something else, the PMD is aware of
>    it and knows that port 2 will not modify flow rules: PMD should not care,
>    no lock necessary.
> 
> 4) port 1 is managed by DPDK, port 2 by something else and the PMD is not
>    aware of it: either flow rules cannot be created ever at all, or we say
>    it is user's reponsibility to make sure this does not happen.
> 
> Considering that most control operations performed by DPDK affect the
> device regardless of other applications, I think 1) is the only case that should
> be defined, otherwise 4), defined as user's responsibility.
> 
> > > > > Destruction
> > > > > ~~~~~~~~~~~
> > > > >
> > > > > Flow rules destruction is not automatic, and a queue should not
> > > > > be
> > > released
> > > > > if any are still attached to it. Applications must take care of
> > > > > performing this step before releasing resources.
> > > > >
> > > > > ::
> > > > >
> > > > >  int
> > > > >  rte_flow_destroy(uint8_t port_id,
> > > > >                   struct rte_flow *flow);
> > > > >
> > > > >
> > > > [Sugesh] I would suggest having a clean-up API is really useful as
> > > > the
> > > releasing of
> > > > Queue(is it applicable for releasing of port too?) is not
> > > > guaranteeing the
> > > automatic flow
> > > > destruction.
> > >
> > > Would something like rte_flow_flush(port_id) do the trick? I wanted
> > > to emphasize in this first draft that applications should really
> > > keep the flow pointers around in order to manage/destroy them. It is
> > > their responsibility, not PMD's.
> > [Sugesh] Thanks, I think the flush call will do.
> 
> Noted, will add it.
> 
> > > > This way application can initialize the port, clean-up all the
> > > > existing rules and create new rules  on a clean slate.
> > >
> > > No resource can be released as long as a flow rule is using it (bad
> > > things may happen otherwise), all flow rules must be destroyed
> > > first, thus none can possibly remain after initializing a port. It
> > > is assumed that PMDs do automatic clean up during init if necessary to
> ensure this.
> > [Sugesh] That will do.
> 
> I will make it more explicit as well.
> 
> [...]
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-11 10:42     ` Chandran, Sugesh
@ 2016-07-13 20:03       ` Adrien Mazarguil
  2016-07-15  9:23         ` Chandran, Sugesh
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-13 20:03 UTC (permalink / raw)
  To: Chandran, Sugesh
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

On Mon, Jul 11, 2016 at 10:42:36AM +0000, Chandran, Sugesh wrote:
> Hi Adrien,
> 
> Thank you for your response,
> Please see my comments inline.

Hi Sugesh,

Sorry for the delay, please see my answers inline as well.

[...]
> > > > Flow director
> > > > -------------
> > > >
> > > > Flow director (FDIR) is the name of the most capable filter type, which
> > > > covers most features offered by others. As such, it is the most
> > widespread
> > > > in PMDs that support filtering (i.e. all of them besides **e1000**).
> > > >
> > > > It is also the only type that allows an arbitrary 32 bits value provided by
> > > > applications to be attached to a filter and returned with matching packets
> > > > instead of relying on the destination queue to recognize flows.
> > > >
> > > > Unfortunately, even FDIR requires applications to be aware of low-level
> > > > capabilities and limitations (most of which come directly from **ixgbe**
> > and
> > > > **i40e**):
> > > >
> > > > - Bitmasks are set globally per device (port?), not per filter.
> > > [Sugesh] This means application cannot define filters that matches on
> > arbitrary different offsets?
> > > If that’s the case, I assume the application has to program bitmask in
> > advance. Otherwise how
> > > the API framework deduce this bitmask information from the rules?? Its
> > not very clear to me
> > > that how application pass down the bitmask information for multiple filters
> > on same port?
> > 
> > This is my understanding of how flow director currently works, perhaps
> > someome more familiar with it can answer this question better than I could.
> > 
> > Let me take an example, if particular device can only handle a single IPv4
> > mask common to all flow rules (say only to match destination addresses),
> > updating that mask to also match the source address affects all defined and
> > future flow rules simultaneously.
> > 
> > That is how FDIR currently works and I think it is wrong, as it penalizes
> > devices that do support individual bit-masks per rule, and is a little
> > awkward from an application point of view.
> > 
> > What I suggest for the new API instead is the ability to specify one
> > bit-mask per rule, and let the PMD deal with HW limitations by automatically
> > configuring global bitmasks from the first added rule, then refusing to add
> > subsequent rules if they specify a conflicting bit-mask. Existing rules
> > remain unaffected that way, and applications do not have to be extra
> > cautious.
> > 
> [Sugesh] The issue with that approach is, the hardware simply discards the rule
> when it is a super set of first one eventhough the hardware is capable of 
> handling it. How its guaranteed the first rule will set the bitmask for all the
> subsequent rules. 

Just to clarify, the API only says that new rules cannot affect existing
ones (which I think makes sense from a user's perspective), so as long as
the PMD does whatever is needed to make all rules work together, there
should not be any problem with this approach.

Even if the PMD has to temporarily remove an existing rule and reconfigure
global masks in order to add subsequent rules, it is fine as long as packets
aren't misdirected in the meantime (they may be dropped if there is no other
choice).

> How about having a CLASSIFER_TYPE for the classifier. Every port can have 
> set of supported flow types(for eg: L3_TYPE, L4_TYPE, L4_TYPE_8BYTE_FLEX,
> L4_TYPE_16BYTE_FLEX) based on the underlying FDIR support. Application can query 
> this and set the type accordingly while initializing the port. This way the first rule need 
> not set all the bits that may needed in the future rules. 

Again from a user's POV, I think doing so would add unwanted HW-specific
complexity. 

However this concern can be handled through a different approach. Let's say
user creates a pattern that only specifies a IP header with a given
bit-mask.

In FDIR language this translates to:

- Set global mask for IPv4 accordingly, remaining global masks all zeroed
  (assumed default value).

- Create an IPv4 flow.

>From now on, all rules specifying a IPv4 header must have this exact
bitmask (implicitly or explicitly), otherwise they cannot be created,
i.e. the global bitmask for IPv4 becomes immutable.

Now user creates a TCPv4 rule (as long as it uses the same IPv4 mask), to
handle this FDIR would:

- Keep global immutable mask for IPv4 unchanged, set global TCP mask
  according to the flow rule.

- Create a TCPv4 flow.

>From this point on, like IPv4, subsequent TCP rules must have this exact
bitmask and so on as the global bitmask becomes immutable.

Basically, only protocol bit-masks affected by existing flow rules are
immutable, others can be changed later. Global flow masks for protocols
become mutable again when no existing flow rule uses them.

Does it look fine for you?

[...]
> > > > +--------------------------+
> > > > | Copy to queue 8          |
> > > > +==========+===============+
> > > > | PASSTHRU |               |
> > > > +----------+-----------+---+
> > > > | QUEUE    | ``queue`` | 8 |
> > > > +----------+-----------+---+
> > > >
> > > > ``ID``
> > > > ^^^^^^
> > > >
> > > > Attaches a 32 bit value to packets.
> > > >
> > > > +----------------------------------------------+
> > > > | ID                                           |
> > > > +========+=====================================+
> > > > | ``id`` | 32 bit value to return with packets |
> > > > +--------+-------------------------------------+
> > > >
> > > [Sugesh] I assume the application has to program the flow
> > > with a unique ID and matching packets are stamped with this ID
> > > when reporting to the software. The uniqueness of ID is NOT
> > > guaranteed by the API framework. Correct me if I am wrong here.
> > 
> > You are right, if the way I wrote it is not clear enough, I'm open to
> > suggestions to improve it.
> [Sugesh] I guess its fine and would like to confirm the same. Perhaps
> it would be nice to mention that the IDs are application defined.

OK, I will make it clearer.

> > > [Sugesh] Is it a limitation to use only 32 bit ID? Is it possible to have a
> > > 64 bit ID? So that application can use the control plane flow pointer
> > > Itself as an ID. Does it make sense?
> > 
> > I've specified a 32 bit ID for now because this is what FDIR supports and
> > also what existing devices can report today AFAIK (i40e and mlx5).
> > 
> > We could use 64 bit for future-proofness in a separate action like "ID64"
> > when at least one device supports it.
> > 
> > To PMD maintainers: please comment if you know devices that support
> > tagging
> > matching packets with more than 32 bits of user-provided data!
> [Sugesh] I guess the flow director ID is 64 bit , The XL710 datasheet says so.
> And in the 'rte_mbuf' structure the 64 bit FDIR-ID is shared with rss hash. This can be
> a software driver limitation that expose only 32 bit. Possibly because of cache 
> alignment issues? Since the hardware can support 64 bit, I feel it make sense 
> to support 64 bit as well.

I agree we need 64 bit support, but then we also need a solution for devices
that support only 32 bit. Possible methods I can think of:

- A separate "ID64" action (or a "ID32" one, perhaps with a better name).

- A single ID action with an unlimited number of bytes to return with
  packets (would actually be a string). PMDs can then refuse to create flow
  rules requesting an unsupported number of bytes. Devices supporting fewer
  than 32 bits are also included this way without the need for yet another
  action.

Thoughts?

[...]
> > > [Sugesh] Another concern is the cost and time of installing these rules
> > > in the hardware. Can we make these APIs time bound(or at least an option
> > to
> > > set the time limit to execute these APIs), so that
> > > Application doesn’t have to wait so long when installing and deleting flows
> > with
> > > slow hardware/NIC. What do you think? Most of the datapath flow
> > installations are
> > > dynamic and triggered only when there is
> > > an ingress traffic. Delay in flow insertion/deletion have unpredictable
> > consequences.
> > 
> > This API is (currently) aimed at the control path only, and must indeed be
> > assumed to be slow. Creating million of rules may take quite long as it may
> > involve syscalls and other time-consuming synchronization things on the
> > PMD
> > side.
> > 
> > So currently there is no plan to have rules added from the data path with
> > time constraints. I think it would be implemented through a different set of
> > functions anyway.
> > 
> > I do not think adding time limits is practical, even specifying in the API
> > that creating a single flow rule must take less than a maximum number of
> > seconds in order to be effective is too much of a constraint (applications
> > that create all flows during init may not care after all).
> > 
> > You should consider in any case that modifying flow rules will always be
> > slower than receiving packets, there is no way around that. Applications
> > have to live with it and provide a software fallback for incoming packets
> > while managing flow rules.
> > 
> > Moreover, think about what happens when you hit the maximum number of
> > flow
> > rules and cannot create any more. Applications need to implement some
> > kind
> > of fallback in their data path.
> > 
> > Offloading flows in HW is also only useful if they live much longer than the
> > time taken to create and delete them. Perhaps applications may choose to
> > do
> > so after detecting long lived flows such as TCP sessions.
> > 
> > You may have one separate control thread dedicated to manage flows and
> > keep your normal control thread unaffected by delays. Several threads can
> > even be dedicated, one per device.
> [Sugesh] I agree that the flow insertion cannot be as fast as the packet receiving 
> rate.  From application point of view the problem will be when hardware flow 
> insertion takes longer than software flow insertion. At least application has to know
> the cost of inserting/deleting a rule in hardware beforehand. Otherwise how application
> can choose the right flow candidate for hardware. My point here is application is expecting 
> a deterministic behavior from a classifier while inserting and deleting rules.

Understood, however it will be difficult to estimate, particularly if a PMD
must rearrange flow rules to make room for a new one due to priority levels
collision or some other HW-related reason. I mean, spent time cannot be
assumed to be constant, even PMDs cannot know in advance because it also
depends on the performance of the host CPU.

Such applications may find it easier to measure elapsed time for the rules
they create, make statistics and extrapolate from this information for
future rules. I do not think the PMD can help much here.

> > > [Sugesh] Another query is on the synchronization part. What if same rules
> > are
> > > handled from different threads? Is application responsible for handling the
> > concurrent
> > > hardware programming?
> > 
> > Like most (if not all) DPDK APIs, applications are responsible for managing
> > locking issues as decribed in 4.3 (Behavior). Since this is a control path
> > API and applications usually have a single control thread, locking should
> > not be necessary in most cases.
> > 
> > Regarding my above comment about using several control threads to
> > manage
> > different devices, section 4.3 says:
> > 
> >  "There is no provision for reentrancy/multi-thread safety, although nothing
> >  should prevent different devices from being configured at the same
> >  time. PMDs may protect their control path functions accordingly."
> > 
> > I'd like to emphasize it is not "per port" but "per device", since in a few
> > cases a configurable resource is shared by several ports. It may be
> > difficult for applications to determine which ports are shared by a given
> > device but this falls outside the scope of this API.
> > 
> > Do you think adding the guarantee that it is always safe to configure two
> > different ports simultaneously without locking from the application side is
> > necessary? In which case the PMD would be responsible for locking shared
> > resources.
> [Sugesh] This would be little bit complicated when some of ports are not under 
> DPDK itself(what if one port is managed by Kernel) Or ports are tied by 
> different application. Locking in PMD helps when the ports are accessed by 
> multiple DPDK application. However what if the port itself not under DPDK?

Well, either we do not care about what happens outside of the DPDK context,
or PMDs must find a way to satisfy everyone. I'm not a fan of locking either
but it would be nice if flow rules configuration could be attempted on
different ports simultaneously without the risk of wrecking anything, so
that applications do not need to care.

Possible cases for a dual port device with global flow rule settings
affecting both ports:

1) ports 1 & 2 are managed by DPDK: this is the easy case, a rule that needs
   to alter a global setting necessary for an existing rule on any port is
   not allowed (EEXIST). PMD must maintain a device context common to both
   ports in order for this to work. This context is either under lock, or
   the first port on which a flow rule is created owns all future flow
   rules.

2) port 1 is managed by DPDK, port 2 by something else, the PMD is aware of
   it and knows that port 2 may modify the global context: no flow rules can
   be created from the DPDK application due to safety issues (EBUSY?).

3) port 1 is managed by DPDK, port 2 by something else, the PMD is aware of
   it and knows that port 2 will not modify flow rules: PMD should not care,
   no lock necessary.

4) port 1 is managed by DPDK, port 2 by something else and the PMD is not
   aware of it: either flow rules cannot be created ever at all, or we say
   it is user's reponsibility to make sure this does not happen.

Considering that most control operations performed by DPDK affect the device
regardless of other applications, I think 1) is the only case that should be
defined, otherwise 4), defined as user's responsibility.

> > > > Destruction
> > > > ~~~~~~~~~~~
> > > >
> > > > Flow rules destruction is not automatic, and a queue should not be
> > released
> > > > if any are still attached to it. Applications must take care of performing
> > > > this step before releasing resources.
> > > >
> > > > ::
> > > >
> > > >  int
> > > >  rte_flow_destroy(uint8_t port_id,
> > > >                   struct rte_flow *flow);
> > > >
> > > >
> > > [Sugesh] I would suggest having a clean-up API is really useful as the
> > releasing of
> > > Queue(is it applicable for releasing of port too?) is not guaranteeing the
> > automatic flow
> > > destruction.
> > 
> > Would something like rte_flow_flush(port_id) do the trick? I wanted to
> > emphasize in this first draft that applications should really keep the flow
> > pointers around in order to manage/destroy them. It is their responsibility,
> > not PMD's.
> [Sugesh] Thanks, I think the flush call will do.

Noted, will add it.

> > > This way application can initialize the port,
> > > clean-up all the existing rules and create new rules  on a clean slate.
> > 
> > No resource can be released as long as a flow rule is using it (bad things
> > may happen otherwise), all flow rules must be destroyed first, thus none can
> > possibly remain after initializing a port. It is assumed that PMDs do
> > automatic clean up during init if necessary to ensure this.
> [Sugesh] That will do.

I will make it more explicit as well.

[...]

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-08 13:03   ` Adrien Mazarguil
@ 2016-07-11 10:42     ` Chandran, Sugesh
  2016-07-13 20:03       ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-11 10:42 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

Hi Adrien,

Thank you for your response,
Please see my comments inline.

Regards
_Sugesh


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Friday, July 8, 2016 2:03 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>;
> Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Rasesh Mody <rasesh.mody@qlogic.com>; Ajit
> Khaparde <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> 
> Hi Sugesh,
> 
> On Thu, Jul 07, 2016 at 11:15:07PM +0000, Chandran, Sugesh wrote:
> > Hi Adrien,
> >
> > Thank you for proposing this. It would be really useful for application such
> as OVS-DPDK.
> > Please find my comments and questions inline below prefixed with
> [Sugesh]. Most of them are from the perspective of enabling these APIs in
> application such as OVS-DPDK.
> 
> Thanks, I'm replying below.
> 
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien
> Mazarguil
> > > Sent: Tuesday, July 5, 2016 7:17 PM
> > > To: dev@dpdk.org
> > > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Rasesh
> > > Mody <rasesh.mody@qlogic.com>; Ajit Khaparde
> > > <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> > > <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>;
> > > Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>;
> Chen,
> > > Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> > > Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> > > <sony.chacko@qlogic.com>; Jerin Jacob
> > > <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>
> > > Subject: [dpdk-dev] [RFC] Generic flow director/filtering/classification
> API
> > >

<<<<<----Snipped out ---->>>>>
> > > Flow director
> > > -------------
> > >
> > > Flow director (FDIR) is the name of the most capable filter type, which
> > > covers most features offered by others. As such, it is the most
> widespread
> > > in PMDs that support filtering (i.e. all of them besides **e1000**).
> > >
> > > It is also the only type that allows an arbitrary 32 bits value provided by
> > > applications to be attached to a filter and returned with matching packets
> > > instead of relying on the destination queue to recognize flows.
> > >
> > > Unfortunately, even FDIR requires applications to be aware of low-level
> > > capabilities and limitations (most of which come directly from **ixgbe**
> and
> > > **i40e**):
> > >
> > > - Bitmasks are set globally per device (port?), not per filter.
> > [Sugesh] This means application cannot define filters that matches on
> arbitrary different offsets?
> > If that’s the case, I assume the application has to program bitmask in
> advance. Otherwise how
> > the API framework deduce this bitmask information from the rules?? Its
> not very clear to me
> > that how application pass down the bitmask information for multiple filters
> on same port?
> 
> This is my understanding of how flow director currently works, perhaps
> someome more familiar with it can answer this question better than I could.
> 
> Let me take an example, if particular device can only handle a single IPv4
> mask common to all flow rules (say only to match destination addresses),
> updating that mask to also match the source address affects all defined and
> future flow rules simultaneously.
> 
> That is how FDIR currently works and I think it is wrong, as it penalizes
> devices that do support individual bit-masks per rule, and is a little
> awkward from an application point of view.
> 
> What I suggest for the new API instead is the ability to specify one
> bit-mask per rule, and let the PMD deal with HW limitations by automatically
> configuring global bitmasks from the first added rule, then refusing to add
> subsequent rules if they specify a conflicting bit-mask. Existing rules
> remain unaffected that way, and applications do not have to be extra
> cautious.
> 
[Sugesh] The issue with that approach is, the hardware simply discards the rule
when it is a super set of first one eventhough the hardware is capable of 
handling it. How its guaranteed the first rule will set the bitmask for all the
subsequent rules. 
How about having a CLASSIFER_TYPE for the classifier. Every port can have 
set of supported flow types(for eg: L3_TYPE, L4_TYPE, L4_TYPE_8BYTE_FLEX,
L4_TYPE_16BYTE_FLEX) based on the underlying FDIR support. Application can query 
this and set the type accordingly while initializing the port. This way the first rule need 
not set all the bits that may needed in the future rules. 
 
> > > ``PASSTHRU``
> > > ^^^^^^^^^^^^
> > >
> > > Leaves packets up for additional processing by subsequent flow rules.
> This
> > > is the default when a rule does not contain a terminating action, but can
> be
> > > specified to force a rule to become non-terminating.
> > >
> > > - No configurable property.
> > >
> > > +---------------+
> > > | PASSTHRU      |
> > > +===============+
> > > | no properties |
> > > +---------------+
> > >
> > > Example to copy a packet to a queue and continue processing by
> subsequent
> > > flow rules:
> > [Sugesh] If a packet get copied to a queue, it’s a termination action.
> > How can its possible to do subsequent action after the packet already
> > moved to the queue. ?How it differs from DUP action?
> >  Am I missing anything here?
> 
> Devices may not support the combination of QUEUE + PASSTHRU (i.e. making
> QUEUE non-terminating). However these same devices may expose the
> ability to
> copy a packet to another (sniffer) queue all while keeping the rule
> terminating (QUEUE + DUP but no PASSTHRU).
> 
> DUP with two rules, assuming priorties and PASSTRHU are supported:
> 
> - pattern X, priority 0; actions: QUEUE 5, PASSTHRU (non-terminating)
> 
> - pattern X, priority 1; actions: QUEUE 6 (terminating)
> 
> DUP with two actions on a single rule and a single priority:
> 
> - pattern X, priority 0; actions: DUP 5, QUEUE 6 (terminating)
> 
> If supported, from an application point of view the end result is similar in
> both cases (note the second case may be implemented by the PMD using
> two HW
> rules internally).
> 
> However the second case does not waste a priority level and clearly states
> the intent to the PMD which is more likely to be supported. If HW supports
> DUP directly it is even faster since there is a single rule. That is why I
> thought having DUP as an action would be useful.
[Sugesh] Thank you for the clarification. It make sense to me now.
> 
> > > +--------------------------+
> > > | Copy to queue 8          |
> > > +==========+===============+
> > > | PASSTHRU |               |
> > > +----------+-----------+---+
> > > | QUEUE    | ``queue`` | 8 |
> > > +----------+-----------+---+
> > >
> > > ``ID``
> > > ^^^^^^
> > >
> > > Attaches a 32 bit value to packets.
> > >
> > > +----------------------------------------------+
> > > | ID                                           |
> > > +========+=====================================+
> > > | ``id`` | 32 bit value to return with packets |
> > > +--------+-------------------------------------+
> > >
> > [Sugesh] I assume the application has to program the flow
> > with a unique ID and matching packets are stamped with this ID
> > when reporting to the software. The uniqueness of ID is NOT
> > guaranteed by the API framework. Correct me if I am wrong here.
> 
> You are right, if the way I wrote it is not clear enough, I'm open to
> suggestions to improve it.
[Sugesh] I guess its fine and would like to confirm the same. Perhaps
it would be nice to mention that the IDs are application defined.

> 
> > [Sugesh] Is it a limitation to use only 32 bit ID? Is it possible to have a
> > 64 bit ID? So that application can use the control plane flow pointer
> > Itself as an ID. Does it make sense?
> 
> I've specified a 32 bit ID for now because this is what FDIR supports and
> also what existing devices can report today AFAIK (i40e and mlx5).
> 
> We could use 64 bit for future-proofness in a separate action like "ID64"
> when at least one device supports it.
> 
> To PMD maintainers: please comment if you know devices that support
> tagging
> matching packets with more than 32 bits of user-provided data!
[Sugesh] I guess the flow director ID is 64 bit , The XL710 datasheet says so.
And in the 'rte_mbuf' structure the 64 bit FDIR-ID is shared with rss hash. This can be
a software driver limitation that expose only 32 bit. Possibly because of cache 
alignment issues? Since the hardware can support 64 bit, I feel it make sense 
to support 64 bit as well.
> 
> > > .. raw:: pdf
> > >
> > >    PageBreak
> > >
> > > ``QUEUE``
> > > ^^^^^^^^^
> > >
> > > Assigns packets to a given queue index.
> > >
> > > - Terminating by default.
> > >
> > > +--------------------------------+
> > > | QUEUE                          |
> > > +===========+====================+
> > > | ``queue`` | queue index to use |
> > > +-----------+--------------------+
> > >
> > > ``DROP``
> > > ^^^^^^^^
> > >
> > > Drop packets.
> > >
> > > - No configurable property.
> > > - Terminating by default.
> > > - PASSTHRU overrides this action if both are specified.
> > >
> > > +---------------+
> > > | DROP          |
> > > +===============+
> > > | no properties |
> > > +---------------+
> > >
> > > ``COUNT``
> > > ^^^^^^^^^
> > >
> > [Sugesh] Should we really have to set count action explicitly for every rule?
> > IMHO it would be great to be an implicit action. Most of the application
> would be
> > interested in the stats of almost all the filters/flows .
> 
> I can see why, but no, it must be explicitly requested because you may want
> to know in advance when it is not supported. Also considering it is
> something else to be done by HW (a separate action), we can assume
> enabling
> this may slow things down a bit.
> 
> HW limitations may also prevent you from having as many flow counters as
> you
> want, in which case you probably want to carefully pick which rules have
> them.
> 
> I think this target is most useful with DROP, VF and PF actions since
> those are currently the only ones where SW may not see the related
> packets.
> 
[Sugesh] Agreed and thanks for the clarification.

> > > Enables hits counter for this rule.
> > >
> > > This counter can be retrieved and reset through ``rte_flow_query()``, see
> > > ``struct rte_flow_query_count``.
> > >
> > > - Counters can be retrieved with ``rte_flow_query()``.
> > > - No configurable property.
> > >
> > > +---------------+
> > > | COUNT         |
> > > +===============+
> > > | no properties |
> > > +---------------+
> > >
> > > Query structure to retrieve and reset the flow rule hits counter:
> > >
> > > +------------------------------------------------+
> > > | COUNT query                                    |
> > > +===========+=====+==============================+
> > > | ``reset`` | in  | reset counter after query    |
> > > +-----------+-----+------------------------------+
> > > | ``hits``  | out | number of hits for this flow |
> > > +-----------+-----+------------------------------+
> > >
<<<<<<<<Snipped out >>>>
> > > ::
> > >
> > >  struct rte_flow *
> > >  rte_flow_create(uint8_t port_id,
> > >                  const struct rte_flow_pattern *pattern,
> > >                  const struct rte_flow_actions *actions);
> > >
> > > Arguments:
> > >
> > > - ``port_id``: port identifier of Ethernet device.
> > > - ``pattern``: pattern specification to add.
> > > - ``actions``: actions associated with the flow definition.
> > >
> > > Return value:
> > >
> > > A valid flow pointer in case of success, NULL otherwise and ``rte_errno`` is
> > > set to the positive version of one of the error codes defined for
> > > ``rte_flow_validate()``.
> > [Sugesh] : Kind of implementation specific query. What if application
> > try to add duplicate rules? Does the API create new flow entry for every
> > API call?
> 
> If an application adds duplicate rules at a given priority level, the second
> one may return an error depending on the PMD. Collisions are sometimes
> trivial to detect (such as the same pattern twice), others not so much (one
> matching an Ethernet header only, the other one matching an IP header
> only).
> 
> Either way if a packet is matched by two rules at a given priority level,
> what happens is described in 3.3 (High level design) and 4.4.1 (Priorities).
> 
> Applications are responsible for not relying on the PMD to detect these, or
> should use a single priority level for each rule to make things clear.
> 
> However since the number of HW priority levels is finite and possibly small,
> they must also make sure not to waste them. My advice is to only use
> priority levels when it cannot be proven that rules do not collide.
> 
> If all you have is perfect matching rules without wildcards and all of them
> match the same number of layers, a single priority level is fine.
> 
[Sugesh] Make sense. Its fine from my prespective.
> > [Sugesh] Another concern is the cost and time of installing these rules
> > in the hardware. Can we make these APIs time bound(or at least an option
> to
> > set the time limit to execute these APIs), so that
> > Application doesn’t have to wait so long when installing and deleting flows
> with
> > slow hardware/NIC. What do you think? Most of the datapath flow
> installations are
> > dynamic and triggered only when there is
> > an ingress traffic. Delay in flow insertion/deletion have unpredictable
> consequences.
> 
> This API is (currently) aimed at the control path only, and must indeed be
> assumed to be slow. Creating million of rules may take quite long as it may
> involve syscalls and other time-consuming synchronization things on the
> PMD
> side.
> 
> So currently there is no plan to have rules added from the data path with
> time constraints. I think it would be implemented through a different set of
> functions anyway.
> 
> I do not think adding time limits is practical, even specifying in the API
> that creating a single flow rule must take less than a maximum number of
> seconds in order to be effective is too much of a constraint (applications
> that create all flows during init may not care after all).
> 
> You should consider in any case that modifying flow rules will always be
> slower than receiving packets, there is no way around that. Applications
> have to live with it and provide a software fallback for incoming packets
> while managing flow rules.
> 
> Moreover, think about what happens when you hit the maximum number of
> flow
> rules and cannot create any more. Applications need to implement some
> kind
> of fallback in their data path.
> 
> Offloading flows in HW is also only useful if they live much longer than the
> time taken to create and delete them. Perhaps applications may choose to
> do
> so after detecting long lived flows such as TCP sessions.
> 
> You may have one separate control thread dedicated to manage flows and
> keep your normal control thread unaffected by delays. Several threads can
> even be dedicated, one per device.
[Sugesh] I agree that the flow insertion cannot be as fast as the packet receiving 
rate.  From application point of view the problem will be when hardware flow 
insertion takes longer than software flow insertion. At least application has to know
the cost of inserting/deleting a rule in hardware beforehand. Otherwise how application
can choose the right flow candidate for hardware. My point here is application is expecting 
a deterministic behavior from a classifier while inserting and deleting rules.
> 
> > [Sugesh] Another query is on the synchronization part. What if same rules
> are
> > handled from different threads? Is application responsible for handling the
> concurrent
> > hardware programming?
> 
> Like most (if not all) DPDK APIs, applications are responsible for managing
> locking issues as decribed in 4.3 (Behavior). Since this is a control path
> API and applications usually have a single control thread, locking should
> not be necessary in most cases.
> 
> Regarding my above comment about using several control threads to
> manage
> different devices, section 4.3 says:
> 
>  "There is no provision for reentrancy/multi-thread safety, although nothing
>  should prevent different devices from being configured at the same
>  time. PMDs may protect their control path functions accordingly."
> 
> I'd like to emphasize it is not "per port" but "per device", since in a few
> cases a configurable resource is shared by several ports. It may be
> difficult for applications to determine which ports are shared by a given
> device but this falls outside the scope of this API.
> 
> Do you think adding the guarantee that it is always safe to configure two
> different ports simultaneously without locking from the application side is
> necessary? In which case the PMD would be responsible for locking shared
> resources.
[Sugesh] This would be little bit complicated when some of ports are not under 
DPDK itself(what if one port is managed by Kernel) Or ports are tied by 
different application. Locking in PMD helps when the ports are accessed by 
multiple DPDK application. However what if the port itself not under DPDK?
> 
> > > Destruction
> > > ~~~~~~~~~~~
> > >
> > > Flow rules destruction is not automatic, and a queue should not be
> released
> > > if any are still attached to it. Applications must take care of performing
> > > this step before releasing resources.
> > >
> > > ::
> > >
> > >  int
> > >  rte_flow_destroy(uint8_t port_id,
> > >                   struct rte_flow *flow);
> > >
> > >
> > [Sugesh] I would suggest having a clean-up API is really useful as the
> releasing of
> > Queue(is it applicable for releasing of port too?) is not guaranteeing the
> automatic flow
> > destruction.
> 
> Would something like rte_flow_flush(port_id) do the trick? I wanted to
> emphasize in this first draft that applications should really keep the flow
> pointers around in order to manage/destroy them. It is their responsibility,
> not PMD's.
[Sugesh] Thanks, I think the flush call will do.
> 
> > This way application can initialize the port,
> > clean-up all the existing rules and create new rules  on a clean slate.
> 
> No resource can be released as long as a flow rule is using it (bad things
> may happen otherwise), all flow rules must be destroyed first, thus none can
> possibly remain after initializing a port. It is assumed that PMDs do
> automatic clean up during init if necessary to ensure this.
[Sugesh] That will do.
> 
> > > Failure to destroy a flow rule may occur when other flow rules depend on
> it,
> > > and destroying it would result in an inconsistent state.
> > >
> > > This function is only guaranteed to succeed if flow rules are destroyed in
> > > reverse order of their creation.
> > >
> > > Arguments:
> > >
> > > - ``port_id``: port identifier of Ethernet device.
> > > - ``flow``: flow rule to destroy.
> > >
> > > Return value:
> > >
> > > - **0** on success, a negative errno value otherwise and ``rte_errno`` is
> > >   set.
> > >
> > > .. raw:: pdf
> > >
> > >    PageBreak
> > >
> > > Query
> > > ~~~~~
> > >
> > > Query an existing flow rule.
> > >
> > > This function allows retrieving flow-specific data such as counters. Data
> > > is gathered by special actions which must be present in the flow rule
> > > definition.
> > >
> > > ::
> > >
> > >  int
> > >  rte_flow_query(uint8_t port_id,
> > >                 struct rte_flow *flow,
> > >                 enum rte_flow_action_type action,
> > >                 void *data);
> > >
> > > Arguments:
> > >
> > > - ``port_id``: port identifier of Ethernet device.
> > > - ``flow``: flow rule to query.
> > > - ``action``: action type to query.
> > > - ``data``: pointer to storage for the associated query data type.
> > >
> > > Return value:
> > >
> > > - **0** on success, a negative errno value otherwise and ``rte_errno`` is
> > >   set.
> > >
> > > .. raw:: pdf
> > >
> > >    PageBreak
> > >
> > > Behavior
> > > --------
> > >
> > > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> > >   returned).
> > >
> > > - There is no provision for reentrancy/multi-thread safety, although
> nothing
> > >   should prevent different devices from being configured at the same
> > >   time. PMDs may protect their control path functions accordingly.
> > >
> > > - Stopping the data path (TX/RX) should not be necessary when managing
> > > flow
> > >   rules. If this cannot be achieved naturally or with workarounds (such as
> > >   temporarily replacing the burst function pointers), an appropriate error
> > >   code must be returned (``EBUSY``).
> > >
> > > - PMDs, not applications, are responsible for maintaining flow rules
> > >   configuration when stopping and restarting a port or performing other
> > >   actions which may affect them. They can only be destroyed explicitly.
> > >
> > > .. raw:: pdf
> > >
> > >    PageBreak
> > >
> > [Sugesh] Query all the rules for a specific port/queue?? Useful when
> adding and
> > deleting ports and queues dynamically according to the need. I am not sure
> > what are the other  different usecases for these APIs. But I feel it makes
> much easier to
> > manage flows from the application. What do you think?
> 
> Not sure, that seems to fall out of the scope of this API. As described,
> applications already store the related rte_flow pointers. Accordingly, they
> know how many rules are associated to a given port. They need both a port
> ID
> and a flow rule pointer to destroy them after all.
> 
> Now perhaps something to convert back an existing rte_flow to a pattern
> and
> a list of actions, however I cannot see an immediate use case for it.
> 
> What you describe seems to be doable through a front-end API, I think
> keeping this one as low-level as possible with only basic actions is better
> right now. I'll keep your suggestion in mind.
[Sugesh] Sure, That will be fine.
> 
> > > Compatibility
> > > -------------
> > >
> > > No known hardware implementation supports all the features described
> in
> > > this
> > > document.
> > >
> > > Unsupported features or combinations are not expected to be fully
> > > emulated
> > > in software by PMDs for performance reasons. Partially supported
> features
> > > may be completed in software as long as hardware performs most of the
> > > work
> > > (such as queue redirection and packet recognition).
> > >
> > > However PMDs are expected to do their best to satisfy application
> requests
> > > by working around hardware limitations as long as doing so does not
> affect
> > > the behavior of existing flow rules.
> > >
> > > The following sections provide a few examples of such cases, they are
> based
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-05 18:16 Adrien Mazarguil
                   ` (2 preceding siblings ...)
  2016-07-08 11:11 ` Liang, Cunming
@ 2016-07-11 10:41 ` Jerin Jacob
  2016-07-21 19:20   ` Adrien Mazarguil
  2016-07-21  8:13 ` Rahul Lakkireddy
  4 siblings, 1 reply; 54+ messages in thread
From: Jerin Jacob @ 2016-07-11 10:41 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu, Jan Medala,
	John Daley, Jing Chen, Konstantin Ananyev, Matej Vido,
	Alejandro Lucero, Sony Chacko, Pablo de Lara, Olga Shern

On Tue, Jul 05, 2016 at 08:16:46PM +0200, Adrien Mazarguil wrote:

Hi Adrien,

Overall this proposal looks very good. I could easily map to the
classification hardware engines I am familiar with.


> Priorities
> ~~~~~~~~~~
> 
> A priority can be assigned to a matching pattern.
> 
> The default priority level is 0 and is also the highest. Support for more
> than a single priority level in hardware is not guaranteed.
> 
> If a packet is matched by several filters at a given priority level, the
> outcome is undefined. It can take any path and can even be duplicated.

In some cases fatal unrecoverable error too

> Matching pattern items for packet data must be naturally stacked (ordered
> from lowest to highest protocol layer), as in the following examples:
> 
> +--------------+
> | TCPv4 as L4  |
> +===+==========+
> | 0 | Ethernet |
> +---+----------+
> | 1 | IPv4     |
> +---+----------+
> | 2 | TCP      |
> +---+----------+
> 
> +----------------+
> | TCPv6 in VXLAN |
> +===+============+
> | 0 | Ethernet   |
> +---+------------+
> | 1 | IPv4       |
> +---+------------+
> | 2 | UDP        |
> +---+------------+
> | 3 | VXLAN      |
> +---+------------+
> | 4 | Ethernet   |
> +---+------------+
> | 5 | IPv6       |
> +---+------------+

How about enumerating as "Inner-IPV6" flow type to avoid any confusion. Though spec
can be same for both IPv6 and Inner-IPV6.

> | 6 | TCP        |
> +---+------------+
> 
> +-----------------------------+
> | TCPv4 as L4 with meta items |
> +===+=========================+
> | 0 | VOID                    |
> +---+-------------------------+
> | 1 | Ethernet                |
> +---+-------------------------+
> | 2 | VOID                    |
> +---+-------------------------+
> | 3 | IPv4                    |
> +---+-------------------------+
> | 4 | TCP                     |
> +---+-------------------------+
> | 5 | VOID                    |
> +---+-------------------------+
> | 6 | VOID                    |
> +---+-------------------------+
> 
> The above example shows how meta items do not affect packet data matching
> items, as long as those remain stacked properly. The resulting matching
> pattern is identical to "TCPv4 as L4".
> 
> +----------------+
> | UDPv6 anywhere |
> +===+============+
> | 0 | IPv6       |
> +---+------------+
> | 1 | UDP        |
> +---+------------+
> 
> If supported by the PMD, omitting one or several protocol layers at the
> bottom of the stack as in the above example (missing an Ethernet
> specification) enables hardware to look anywhere in packets.

It would be good if the common code can give it as Ethernet, IPV6, UDP
to PMD(to avoid common code duplication across PMDs)

> 
> It is unspecified whether the payload of supported encapsulations
> (e.g. VXLAN inner packet) is matched by such a pattern, which may apply to
> inner, outer or both packets.

a separate flow type enumeration may fix that problem. like "Inner-IPV6"
mentioned above.

> 
> +---------------------+
> | Invalid, missing L3 |
> +===+=================+
> | 0 | Ethernet        |
> +---+-----------------+
> | 1 | UDP             |
> +---+-----------------+
> 
> The above pattern is invalid due to a missing L3 specification between L2
> and L4. It is only allowed at the bottom and at the top of the stack.
> 

> ``SIGNATURE``
> ^^^^^^^^^^^^^
> 
> Requests hash-based signature dispatching for this rule.
> 
> Considering this is a global setting on devices that support it, all
> subsequent filter rules may have to be created with it as well.

Can you describe the use case for this and how its different from
existing rte_eth devel RSS settings.

> 
> - Only ``spec`` needs to be defined, ``mask`` is ignored.
> 
> +--------------------+
> | SIGNATURE          |
> +==========+=========+
> | ``spec`` | TBD     |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
> 

> 
> ``ETH``
> ^^^^^^^
> 
> Matches an Ethernet header.
> 
> - ``dst``: destination MAC.
> - ``src``: source MAC.
> - ``type``: EtherType.
> - ``tags``: number of 802.1Q/ad tags defined.
> - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> 
>  - ``tpid``: Tag protocol identifier.
>  - ``tci``: Tag control information.

Find below the other L2 layer attributes are useful in HW classification,

- HiGig headers
- DSA Headers
- MPLS

May be we need to intrdouce a separate flow type with spec to add the support. Right?

Jerin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-11  3:18     ` Liang, Cunming
@ 2016-07-11 10:06       ` Adrien Mazarguil
  0 siblings, 0 replies; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-11 10:06 UTC (permalink / raw)
  To: Liang, Cunming
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

On Mon, Jul 11, 2016 at 03:18:19AM +0000, Liang, Cunming wrote:
[...]
> > > >When several actions are combined in a flow rule, they should all have
> > > >different types (e.g. dropping a packet twice is not possible). However
> > > >considering the VOID type is an exception to this rule, the defined behavior
> > > >is for PMDs to only take into account the last action of a given type found
> > > >in the list. PMDs still perform error checking on the entire list.
> > > >
> > > >*Note that PASSTHRU is the only action able to override a terminating rule.*
> > > [LC] I'm wondering how to address the meta data carried by mbuf, there's no
> > > mentioned here.
> > > For packets hit one specific flow, usually there's something for CPU to
> > > identify the flow.
> > > FDIR and RSS as an example, has id or key in mbuf. In addition, some meta
> > > may pointed by userdata in mbuf.
> > > Any view on it ?
> > 
> > Yes, this is defined as the ID action. It is described in 4.1.6.4 (ID) and
> > there is an example how a FDIR rule would be converted to use it in 5.7
> > (FDIR to most item types → QUEUE, DROP, PASSTHRU).
> [LC] So in RSS cases, the actions would be {RSS, ID}, in which ID represent for RSS key, right?

Well, the ID action is always the same regardless of other actions, it
takes an arbitrary 32 bit value to be returned back as meta data with
matched packets, no side effect on RSS is expected.

If you are talking about the RSS action (4.1.6.9), RSS configuration (key
and algorithm to use) are provided in their specific structure along with
the list of target queues (see struct rte_flow_action_rss in [1]).

Note the RSS action is independent, it is unrelated to the port-wide RSS
configuration. Devices may not be able to support both simultaneously, for
instance creating multiple queues with RSS enabled globally may prevent
requesting a flow rule with a RSS action later. Likewise, such a rule may
possibly be defined only once depending on capabilities.

For devices supporting both, think of it as multiple level RSS. Flow rules
perform RSS on selected packets first, then the default global RSS
configuration takes care of packets that haven't hit a terminating flow
rule. This is the same as the QUEUE action except RSS is additionally
performed to spread packet among several queues.

Thus applications can request RSS with ID to get both RSS _and_ their
arbitrary 32 bit value as meta data. Once again, HW support for this
combination is not mandatory.

PMDs can assist HW to work around such limitations sometimes as described in
4.4.4 (Unsupported actions) as long as the software cost is kept minimal.

[1] https://raw.githubusercontent.com/6WIND/rte_flow/master/rte_flow.h

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-08 13:25   ` Adrien Mazarguil
@ 2016-07-11  3:18     ` Liang, Cunming
  2016-07-11 10:06       ` Adrien Mazarguil
  0 siblings, 1 reply; 54+ messages in thread
From: Liang, Cunming @ 2016-07-11  3:18 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

Hi Adrien,

Thanks so much for the explanation.

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Friday, July 08, 2016 9:26 PM
> To: Liang, Cunming <cunming.liang@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>; Zhang,
> Helin <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Rasesh
> Mody <rasesh.mody@qlogic.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Jan
> Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen, Jing D
> <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob <jerin.jacob@caviumnetworks.com>; De
> Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Olga Shern
> <olgas@mellanox.com>
> Subject: Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
> 
> Hi Cunming,
> 
> I agree with Bruce, I'll start snipping non relevant parts considering the
> size of this message. Please see below.
> 
> On Fri, Jul 08, 2016 at 07:11:28PM +0800, Liang, Cunming wrote:
> [...]
> > >Meta item types
> > >~~~~~~~~~~~~~~~
> > >
> > >These do not match packet data but affect how the pattern is processed, most
> > >of them do not need a specification structure. This particularity allows
> > >them to be specified anywhere without affecting other item types.
> > [LC] For the meta item(END, VOID, INVERT) and some data matching type like
> > ANY and RAW,
> > it's all PMD responsible to understand the key character and to parse the
> > header graph?
> 
> We haven't started on the PMD side of things yet (the public API described
> here does not discuss it), but I think PMDs will see the pattern as-is,
> untranslated (to answer your question, yes, it will be the case for END,
> VOID, INVERT, ANY and RAW like all others).
> 
> However I intend to add private helper functions as needed in librte_ether
> (or anywhere else in EAL) that PMDs can call to ease parsing and validation
> of flow rules, otherwise most of them will end up implementing redundant
> functions.
[LC] Agree, that's very helpful.

> 
> [...]
> > >When several actions are combined in a flow rule, they should all have
> > >different types (e.g. dropping a packet twice is not possible). However
> > >considering the VOID type is an exception to this rule, the defined behavior
> > >is for PMDs to only take into account the last action of a given type found
> > >in the list. PMDs still perform error checking on the entire list.
> > >
> > >*Note that PASSTHRU is the only action able to override a terminating rule.*
> > [LC] I'm wondering how to address the meta data carried by mbuf, there's no
> > mentioned here.
> > For packets hit one specific flow, usually there's something for CPU to
> > identify the flow.
> > FDIR and RSS as an example, has id or key in mbuf. In addition, some meta
> > may pointed by userdata in mbuf.
> > Any view on it ?
> 
> Yes, this is defined as the ID action. It is described in 4.1.6.4 (ID) and
> there is an example how a FDIR rule would be converted to use it in 5.7
> (FDIR to most item types → QUEUE, DROP, PASSTHRU).
[LC] So in RSS cases, the actions would be {RSS, ID}, in which ID represent for RSS key, right?

> 
> It is basically described as a flow rule with two actions: queue redirection
> and packet tagging.
> 
> Does it answer your question?
[LC] I think so, Thanks.
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-08 11:11 ` Liang, Cunming
  2016-07-08 12:38   ` Bruce Richardson
@ 2016-07-08 13:25   ` Adrien Mazarguil
  2016-07-11  3:18     ` Liang, Cunming
  1 sibling, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-08 13:25 UTC (permalink / raw)
  To: Liang, Cunming
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu, Jan Medala,
	John Daley, Jing Chen, Konstantin Ananyev, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, Pablo de Lara,
	Olga Shern

Hi Cunming,

I agree with Bruce, I'll start snipping non relevant parts considering the
size of this message. Please see below.

On Fri, Jul 08, 2016 at 07:11:28PM +0800, Liang, Cunming wrote:
[...]
> >Meta item types
> >~~~~~~~~~~~~~~~
> >
> >These do not match packet data but affect how the pattern is processed, most
> >of them do not need a specification structure. This particularity allows
> >them to be specified anywhere without affecting other item types.
> [LC] For the meta item(END, VOID, INVERT) and some data matching type like
> ANY and RAW,
> it's all PMD responsible to understand the key character and to parse the
> header graph?

We haven't started on the PMD side of things yet (the public API described
here does not discuss it), but I think PMDs will see the pattern as-is,
untranslated (to answer your question, yes, it will be the case for END,
VOID, INVERT, ANY and RAW like all others).

However I intend to add private helper functions as needed in librte_ether
(or anywhere else in EAL) that PMDs can call to ease parsing and validation
of flow rules, otherwise most of them will end up implementing redundant
functions.

[...]
> >When several actions are combined in a flow rule, they should all have
> >different types (e.g. dropping a packet twice is not possible). However
> >considering the VOID type is an exception to this rule, the defined behavior
> >is for PMDs to only take into account the last action of a given type found
> >in the list. PMDs still perform error checking on the entire list.
> >
> >*Note that PASSTHRU is the only action able to override a terminating rule.*
> [LC] I'm wondering how to address the meta data carried by mbuf, there's no
> mentioned here.
> For packets hit one specific flow, usually there's something for CPU to
> identify the flow.
> FDIR and RSS as an example, has id or key in mbuf. In addition, some meta
> may pointed by userdata in mbuf.
> Any view on it ?

Yes, this is defined as the ID action. It is described in 4.1.6.4 (ID) and
there is an example how a FDIR rule would be converted to use it in 5.7
(FDIR to most item types → QUEUE, DROP, PASSTHRU).

It is basically described as a flow rule with two actions: queue redirection
and packet tagging.

Does it answer your question?

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-07 23:15 ` Chandran, Sugesh
@ 2016-07-08 13:03   ` Adrien Mazarguil
  2016-07-11 10:42     ` Chandran, Sugesh
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-08 13:03 UTC (permalink / raw)
  To: Chandran, Sugesh
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

Hi Sugesh,

On Thu, Jul 07, 2016 at 11:15:07PM +0000, Chandran, Sugesh wrote:
> Hi Adrien,
> 
> Thank you for proposing this. It would be really useful for application such as OVS-DPDK.
> Please find my comments and questions inline below prefixed with [Sugesh]. Most of them are from the perspective of enabling these APIs in application such as OVS-DPDK.

Thanks, I'm replying below.

> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> > Sent: Tuesday, July 5, 2016 7:17 PM
> > To: dev@dpdk.org
> > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Zhang, Helin
> > <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Rasesh
> > Mody <rasesh.mody@qlogic.com>; Ajit Khaparde
> > <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> > <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> > Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> > Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> > Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> > <sony.chacko@qlogic.com>; Jerin Jacob
> > <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>
> > Subject: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
> > 
> > Hi All,
> > 
> > First, forgive me for this large message, I know our mailboxes already
> > suffer quite a bit from the amount of traffic on this ML.
> > 
> > This is not exactly yet another thread about how flow director should be
> > extended, rather about a brand new API to handle filtering and
> > classification for incoming packets in the most PMD-generic and
> > application-friendly fashion we can come up with. Reasons described below.
> > 
> > I think this topic is important enough to include both the users of this API
> > as well as PMD maintainers. So far I have CC'ed librte_ether (especially
> > rte_eth_ctrl.h contributors), testpmd and PMD maintainers (with and
> > without
> > a .filter_ctrl implementation), but if you know application maintainers
> > other than testpmd who use FDIR or might be interested in this discussion,
> > feel free to add them.
> > 
> > The issues we found with the current approach are already summarized in
> > the
> > following document, but here is a quick summary for TL;DR folks:
> > 
> > - PMDs do not expose a common set of filter types and even when they do,
> >   their behavior more or less differs.
> > 
> > - Applications need to determine and adapt to device-specific limitations
> >   and quirks on their own, without help from PMDs.
> > 
> > - Writing an application that creates flow rules targeting all devices
> >   supported by DPDK is thus difficult, if not impossible.
> > 
> > - The current API has too many unspecified areas (particularly regarding
> >   side effects of flow rules) that make PMD implementation tricky.
> > 
> > This RFC API handles everything currently supported by .filter_ctrl, the
> > idea being to reimplement all of these to make them fully usable by
> > applications in a more generic and well defined fashion. It has a very small
> > set of mandatory features and an easy method to let applications probe for
> > supported capabilities.
> > 
> > The only downside is more work for the software control side of PMDs
> > because
> > they have to adapt to the API instead of the reverse. I think helpers can be
> > added to EAL to assist with this.
> > 
> > HTML version:
> > 
> >  https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
> > 
> > PDF version:
> > 
> >  https://rawgit.com/6WIND/rte_flow/master/rte_flow.pdf
> > 
> > Related draft header file (for reference while reading the specification):
> > 
> >  https://raw.githubusercontent.com/6WIND/rte_flow/master/rte_flow.h
> > 
> > Git tree for completeness (latest .rst version can be retrieved from here):
> > 
> >  https://github.com/6WIND/rte_flow
> > 
> > What follows is the ReST source of the above, for inline comments and
> > discussion. I intend to update that specification accordingly.
> > 
> > ========================
> > Generic filter interface
> > ========================
> > 
> > .. footer::
> > 
> >    v0.6
> > 
> > .. contents::
> > .. sectnum::
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Overview
> > ========
> > 
> > DPDK provides several competing interfaces added over time to perform
> > packet
> > matching and related actions such as filtering and classification.
> > 
> > They must be extended to implement the features supported by newer
> > devices
> > in order to expose them to applications, however the current design has
> > several drawbacks:
> > 
> > - Complicated filter combinations which have not been hard-coded cannot be
> >   expressed.
> > - Prone to API/ABI breakage when new features must be added to an
> > existing
> >   filter type, which frequently happens.
> > 
> > From an application point of view:
> > 
> > - Having disparate interfaces, all optional and lacking in features does not
> >   make this API easy to use.
> > - Seemingly arbitrary built-in limitations of filter types based on the
> >   device they were initially designed for.
> > - Undefined relationship between different filter types.
> > - High complexity, considerable undocumented and/or undefined behavior.
> > 
> > Considering the growing number of devices supported by DPDK, adding a
> > new
> > filter type each time a new feature must be implemented is not sustainable
> > in the long term. Applications not written to target a specific device
> > cannot really benefit from such an API.
> > 
> > For these reasons, this document defines an extensible unified API that
> > encompasses and supersedes these legacy filter types.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Current API
> > ===========
> > 
> > Rationale
> > ---------
> > 
> > The reason several competing (and mostly overlapping) filtering APIs are
> > present in DPDK is due to its nature as a thin layer between hardware and
> > software.
> > 
> > Each subsequent interface has been added to better match the capabilities
> > and limitations of the latest supported device, which usually happened to
> > need an incompatible configuration approach. Because of this, many ended
> > up
> > device-centric and not usable by applications that were not written for that
> > particular device.
> > 
> > This document is not the first attempt to address this proliferation issue,
> > in fact a lot of work has already been done both to create a more generic
> > interface while somewhat keeping compatibility with legacy ones through a
> > common call interface (``rte_eth_dev_filter_ctrl()`` with the
> > ``.filter_ctrl`` PMD callback in ``rte_ethdev.h``).
> > 
> > Today, these previously incompatible interfaces are known as filter types
> > (``RTE_ETH_FILTER_*`` from ``enum rte_filter_type`` in ``rte_eth_ctrl.h``).
> > 
> > However while trivial to extend with new types, it only shifted the
> > underlying problem as applications still need to be written for one kind of
> > filter type, which, as described in the following sections, is not
> > necessarily implemented by all PMDs that support filtering.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Filter types
> > ------------
> > 
> > This section summarizes the capabilities of each filter type.
> > 
> > Although the following list is exhaustive, the description of individual
> > types may contain inaccuracies due to the lack of documentation or usage
> > examples.
> > 
> > Note: names are prefixed with ``RTE_ETH_FILTER_``.
> > 
> > ``MACVLAN``
> > ~~~~~~~~~~~
> > 
> > Matching:
> > 
> > - L2 source/destination addresses.
> > - Optional 802.1Q VLAN ID.
> > - Masking individual fields on a rule basis is not supported.
> > 
> > Action:
> > 
> > - Packets are redirected either to a given VF device using its ID or to the
> >   PF.
> > 
> > ``ETHERTYPE``
> > ~~~~~~~~~~~~~
> > 
> > Matching:
> > 
> > - L2 source/destination addresses (optional).
> > - Ethertype (no VLAN ID?).
> > - Masking individual fields on a rule basis is not supported.
> > 
> > Action:
> > 
> > - Receive packets on a given queue.
> > - Drop packets.
> > 
> > ``FLEXIBLE``
> > ~~~~~~~~~~~~
> > 
> > Matching:
> > 
> > - At most 128 consecutive bytes anywhere in packets.
> > - Masking is supported with byte granularity.
> > - Priorities are supported (relative to this filter type, undefined
> >   otherwise).
> > 
> > Action:
> > 
> > - Receive packets on a given queue.
> > 
> > ``SYN``
> > ~~~~~~~
> > 
> > Matching:
> > 
> > - TCP SYN packets only.
> > - One high priority bit can be set to give the highest possible priority to
> >   this type when other filters with different types are configured.
> > 
> > Action:
> > 
> > - Receive packets on a given queue.
> > 
> > ``NTUPLE``
> > ~~~~~~~~~~
> > 
> > Matching:
> > 
> > - Source/destination IPv4 addresses (optional in 2-tuple mode).
> > - Source/destination TCP/UDP port (mandatory in 2 and 5-tuple modes).
> > - L4 protocol (2 and 5-tuple modes).
> > - Masking individual fields is supported.
> > - TCP flags.
> > - Up to 7 levels of priority relative to this filter type, undefined
> >   otherwise.
> > - No IPv6.
> > 
> > Action:
> > 
> > - Receive packets on a given queue.
> > 
> > ``TUNNEL``
> > ~~~~~~~~~~
> > 
> > Matching:
> > 
> > - Outer L2 source/destination addresses.
> > - Inner L2 source/destination addresses.
> > - Inner VLAN ID.
> > - IPv4/IPv6 source (destination?) address.
> > - Tunnel type to match (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE,
> > 802.1BR
> >   E-Tag).
> > - Tenant ID for tunneling protocols that have one.
> > - Any combination of the above can be specified.
> > - Masking individual fields on a rule basis is not supported.
> > 
> > Action:
> > 
> > - Receive packets on a given queue.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``FDIR``
> > ~~~~~~~~
> > 
> > Queries:
> > 
> > - Device capabilities and limitations.
> > - Device statistics about configured filters (resource usage, collisions).
> > - Device configuration (matching input set and masks)
> > 
> > Matching:
> > 
> > - Device mode of operation: none (to disable filtering), signature
> >   (hash-based dispatching from masked fields) or perfect (either MAC VLAN
> > or
> >   tunnel).
> > - L2 Ethertype.
> > - Outer L2 destination address (MAC VLAN mode).
> > - Inner L2 destination address, tunnel type (NVGRE, VXLAN) and tunnel ID
> >   (tunnel mode).
> > - IPv4 source/destination addresses, ToS, TTL and protocol fields.
> > - IPv6 source/destination addresses, TC, protocol and hop limits fields.
> > - UDP source/destination IPv4/IPv6 and ports.
> > - TCP source/destination IPv4/IPv6 and ports.
> > - SCTP source/destination IPv4/IPv6, ports and verification tag field.
> > - Note, only one protocol type at once (either only L2 Ethertype, basic
> >   IPv6, IPv4+UDP, IPv4+TCP and so on).
> > - VLAN TCI (extended API).
> > - At most 16 bytes to match in payload (extended API). A global device
> >   look-up table specifies for each possible protocol layer (unknown, raw,
> >   L2, L3, L4) the offset to use for each byte (they do not need to be
> >   contiguous) and the related bitmask.
> > - Whether packet is addressed to PF or VF, in that case its ID can be
> >   matched as well (extended API).
> > - Masking most of the above fields is supported, but simultaneously affects
> >   all filters configured on a device.
> > - Input set can be modified in a similar fashion for a given device to
> >   ignore individual fields of filters (i.e. do not match the destination
> >   address in a IPv4 filter, refer to **RTE_ETH_INPUT_SET_**
> >   macros). Configuring this also affects RSS processing on **i40e**.
> > - Filters can also provide 32 bits of arbitrary data to return as part of
> >   matched packets.
> > 
> > Action:
> > 
> > - **RTE_ETH_FDIR_ACCEPT**: receive (accept) packet on a given queue.
> > - **RTE_ETH_FDIR_REJECT**: drop packet immediately.
> > - **RTE_ETH_FDIR_PASSTHRU**: similar to accept for the last filter in list,
> >   otherwise process it with subsequent filters.
> > - For accepted packets and if requested by filter, either 32 bits of
> >   arbitrary data and four bytes of matched payload (only in case of flex
> >   bytes matching), or eight bytes of matched payload (flex also) are added
> >   to meta data.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``HASH``
> > ~~~~~~~~
> > 
> > Not an actual filter type. Provides and retrieves the global device
> > configuration (per port or entire NIC) for hash functions and their
> > properties.
> > 
> > Hash function selection: "default" (keep current), XOR or Toeplitz.
> > 
> > This function can be configured per flow type (**RTE_ETH_FLOW_**
> > definitions), supported types are:
> > 
> > - Unknown.
> > - Raw.
> > - Fragmented or non-fragmented IPv4.
> > - Non-fragmented IPv4 with L4 (TCP, UDP, SCTP or other).
> > - Fragmented or non-fragmented IPv6.
> > - Non-fragmented IPv6 with L4 (TCP, UDP, SCTP or other).
> > - L2 payload.
> > - IPv6 with extensions.
> > - IPv6 with L4 (TCP, UDP) and extensions.
> > 
> > ``L2_TUNNEL``
> > ~~~~~~~~~~~~~
> > 
> > Matching:
> > 
> > - All packets received on a given port.
> > 
> > Action:
> > 
> > - Add tunnel encapsulation (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE,
> >   802.1BR E-Tag) using the provided Ethertype and tunnel ID (only E-Tag
> >   is implemented at the moment).
> > - VF ID to use for tag insertion (currently unused).
> > - Destination pool for tag based forwarding (pools are IDs that can be
> >   affected to ports, duplication occurs if the same ID is shared by several
> >   ports of the same NIC).
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Driver support
> > --------------
> > 
> > ======== ======= ========= ======== === ====== ====== ==== ====
> > =========
> > Driver   MACVLAN ETHERTYPE FLEXIBLE SYN NTUPLE TUNNEL FDIR HASH
> > L2_TUNNEL
> > ======== ======= ========= ======== === ====== ====== ==== ====
> > =========
> > bnx2x
> > cxgbe
> > e1000            yes       yes      yes yes
> > ena
> > enic                                                  yes
> > fm10k
> > i40e     yes     yes                           yes    yes  yes
> > ixgbe            yes                yes yes           yes       yes
> > mlx4
> > mlx5                                                  yes
> > szedata2
> > ======== ======= ========= ======== === ====== ====== ==== ====
> > =========
> > 
> > Flow director
> > -------------
> > 
> > Flow director (FDIR) is the name of the most capable filter type, which
> > covers most features offered by others. As such, it is the most widespread
> > in PMDs that support filtering (i.e. all of them besides **e1000**).
> > 
> > It is also the only type that allows an arbitrary 32 bits value provided by
> > applications to be attached to a filter and returned with matching packets
> > instead of relying on the destination queue to recognize flows.
> > 
> > Unfortunately, even FDIR requires applications to be aware of low-level
> > capabilities and limitations (most of which come directly from **ixgbe** and
> > **i40e**):
> > 
> > - Bitmasks are set globally per device (port?), not per filter.
> [Sugesh] This means application cannot define filters that matches on arbitrary different offsets?
> If that’s the case, I assume the application has to program bitmask in advance. Otherwise how 
> the API framework deduce this bitmask information from the rules?? Its not very clear to me
> that how application pass down the bitmask information for multiple filters on same port?

This is my understanding of how flow director currently works, perhaps
someome more familiar with it can answer this question better than I could.

Let me take an example, if particular device can only handle a single IPv4
mask common to all flow rules (say only to match destination addresses),
updating that mask to also match the source address affects all defined and
future flow rules simultaneously.

That is how FDIR currently works and I think it is wrong, as it penalizes
devices that do support individual bit-masks per rule, and is a little
awkward from an application point of view.

What I suggest for the new API instead is the ability to specify one
bit-mask per rule, and let the PMD deal with HW limitations by automatically
configuring global bitmasks from the first added rule, then refusing to add
subsequent rules if they specify a conflicting bit-mask. Existing rules
remain unaffected that way, and applications do not have to be extra
cautious.

> > - Configuration state is not expected to be saved by the driver, and
> >   stopping/restarting a port requires the application to perform it again
> >   (API documentation is also unclear about this).
> > - Monolithic approach with ABI issues as soon as a new kind of flow or
> >   combination needs to be supported.
> > - Cryptic global statistics/counters.
> > - Unclear about how priorities are managed; filters seem to be arranged as a
> >   linked list in hardware (possibly related to configuration order).
> > 
> > Packet alteration
> > -----------------
> > 
> > One interesting feature is that the L2 tunnel filter type implements the
> > ability to alter incoming packets through a filter (in this case to
> > encapsulate them), thus the **mlx5** flow encap/decap features are not a
> > foreign concept.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Proposed API
> > ============
> > 
> > Terminology
> > -----------
> > 
> > - **Filtering API**: overall framework affecting the fate of selected
> >   packets, covers everything described in this document.
> > - **Matching pattern**: properties to look for in received packets, a
> >   combination of any number of items.
> > - **Pattern item**: part of a pattern that either matches packet data
> >   (protocol header, payload or derived information), or specifies properties
> >   of the pattern itself.
> > - **Actions**: what needs to be done when a packet matches a pattern.
> > - **Flow rule**: this is the result of combining a *matching pattern* with
> >   *actions*.
> > - **Filter rule**: a less generic term than *flow rule*, can otherwise be
> >   used interchangeably.
> > - **Hit**: a flow rule is said to be *hit* when processing a matching
> >   packet.
> > 
> > Requirements
> > ------------
> > 
> > As described in the previous section, there is a growing need for a common
> > method to configure filtering and related actions in a hardware independent
> > fashion.
> > 
> > The filtering API should not disallow any filter combination by design and
> > must remain as simple as possible to use. It can simply be defined as a
> > method to perform one or several actions on selected packets.
> > 
> > PMDs are aware of the capabilities of the device they manage and should be
> > responsible for preventing unsupported or conflicting combinations.
> > 
> > This approach is fundamentally different as it places most of the burden on
> > the software side of the PMD instead of having device capabilities directly
> > mapped to API functions, then expecting applications to work around
> > ensuing
> > compatibility issues.
> > 
> > Requirements for a new API:
> > 
> > - Flexible and extensible without causing API/ABI problems for existing
> >   applications.
> > - Should be unambiguous and easy to use.
> > - Support existing filtering features and actions listed in `Filter types`_.
> > - Support packet alteration.
> > - In case of overlapping filters, their priority should be well documented.
> > - Support filter queries (for example to retrieve counters).
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > High level design
> > -----------------
> > 
> > The chosen approach to make filtering as generic as possible is by
> > expressing matching patterns through lists of items instead of the flat
> > structures used in DPDK today, enabling combinations that are not
> > predefined
> > and thus being more versatile.
> > 
> > Flow rules can have several distinct actions (such as counting,
> > encapsulating, decapsulating before redirecting packets to a particular
> > queue, etc.), instead of relying on several rules to achieve this and having
> > applications deal with hardware implementation details regarding their
> > order.
> > 
> > Support for different priority levels on a rule basis is provided, for
> > example in order to force a more specific rule come before a more generic
> > one for packets matched by both, however hardware support for more than
> > a
> > single priority level cannot be guaranteed. When supported, the number of
> > available priority levels is usually low, which is why they can also be
> > implemented in software by PMDs (e.g. to simulate missing priority levels by
> > reordering rules).
> > 
> > In order to remain as hardware agnostic as possible, by default all rules
> > are considered to have the same priority, which means that the order
> > between
> > overlapping rules (when a packet is matched by several filters) is
> > undefined, packet duplication may even occur as a result.
> > 
> > PMDs may refuse to create overlapping rules at a given priority level when
> > they can be detected (e.g. if a pattern matches an existing filter).
> > 
> > Thus predictable results for a given priority level can only be achieved
> > with non-overlapping rules, using perfect matching on all protocol layers.
> > 
> > Support for multiple actions per rule may be implemented internally on top
> > of non-default hardware priorities, as a result both features may not be
> > simultaneously available to applications.
> > 
> > Considering that allowed pattern/actions combinations cannot be known in
> > advance and would result in an unpractically large number of capabilities to
> > expose, a method is provided to validate a given rule from the current
> > device configuration state without actually adding it (akin to a "dry run"
> > mode).
> > 
> > This enables applications to check if the rule types they need is supported
> > at initialization time, before starting their data path. This method can be
> > used anytime, its only requirement being that the resources needed by a
> > rule
> > must exist (e.g. a target RX queue must be configured first).
> > 
> > Each defined rule is associated with an opaque handle managed by the PMD,
> > applications are responsible for keeping it. These can be used for queries
> > and rules management, such as retrieving counters or other data and
> > destroying them.
> > 
> > Handles must be destroyed before releasing associated resources such as
> > queues.
> > 
> > Integration
> > -----------
> > 
> > To avoid ABI breakage, this new interface will be implemented through the
> > existing filtering control framework (``rte_eth_dev_filter_ctrl()``) using
> > **RTE_ETH_FILTER_GENERIC** as a new filter type.
> > 
> > However a public front-end API described in `Rules management`_ will
> > be added as the preferred method to use it.
> > 
> > Once discussions with the community have converged to a definite API,
> > legacy
> > filter types should be deprecated and a deadline defined to remove their
> > support entirely.
> > 
> > PMDs will have to be gradually converted to **RTE_ETH_FILTER_GENERIC**
> > or
> > drop filtering support entirely. Less maintained PMDs for older hardware
> > may
> > lose support at this point.
> > 
> > The notion of filter type will then be deprecated and subsequently dropped
> > to avoid confusion between both frameworks.
> > 
> > Implementation details
> > ======================
> > 
> > Flow rule
> > ---------
> > 
> > A flow rule is the combination of a matching pattern with a list of actions,
> > and is the basis of this API.
> > 
> > Priorities
> > ~~~~~~~~~~
> > 
> > A priority can be assigned to a matching pattern.
> > 
> > The default priority level is 0 and is also the highest. Support for more
> > than a single priority level in hardware is not guaranteed.
> > 
> > If a packet is matched by several filters at a given priority level, the
> > outcome is undefined. It can take any path and can even be duplicated.
> > 
> > Matching pattern
> > ~~~~~~~~~~~~~~~~
> > 
> > A matching pattern comprises any number of items of various types.
> > 
> > Items are arranged in a list to form a matching pattern for packets. They
> > fall in two categories:
> > 
> > - Protocol matching (ANY, RAW, ETH, IPV4, IPV6, ICMP, UDP, TCP, VXLAN and
> > so
> >   on), usually associated with a specification structure. These must be
> >   stacked in the same order as the protocol layers to match, starting from
> >   L2.
> > 
> > - Affecting how the pattern is processed (END, VOID, INVERT, PF, VF,
> >   SIGNATURE and so on), often without a specification structure. Since they
> >   are meta data that does not match packet contents, these can be specified
> >   anywhere within item lists without affecting the protocol matching items.
> > 
> > Most item specifications can be optionally paired with a mask to narrow the
> > specific fields or bits to be matched.
> > 
> > - Items are defined with ``struct rte_flow_item``.
> > - Patterns are defined with ``struct rte_flow_pattern``.
> > 
> > Example of an item specification matching an Ethernet header:
> > 
> > +-----------------------------------------+
> > | Ethernet                                |
> > +==========+=========+====================+
> > | ``spec`` | ``src`` | ``00:01:02:03:04`` |
> > |          +---------+--------------------+
> > |          | ``dst`` | ``00:2a:66:00:01`` |
> > +----------+---------+--------------------+
> > | ``mask`` | ``src`` | ``00:ff:ff:ff:00`` |
> > |          +---------+--------------------+
> > |          | ``dst`` | ``00:00:00:00:ff`` |
> > +----------+---------+--------------------+
> > 
> > Non-masked bits stand for any value, Ethernet headers with the following
> > properties are thus matched:
> > 
> > - ``src``: ``??:01:02:03:??``
> > - ``dst``: ``??:??:??:??:01``
> > 
> > Except for meta types that do not need one, ``spec`` must be a valid pointer
> > to a structure of the related item type. A ``mask`` of the same type can be
> > provided to tell which bits in ``spec`` are to be matched.
> > 
> > A mask is normally only needed for ``spec`` fields matching packet data,
> > ignored otherwise. See individual item types for more information.
> > 
> > A ``NULL`` mask pointer is allowed and is similar to matching with a full
> > mask (all ones) ``spec`` fields supported by hardware, the remaining fields
> > are ignored (all zeroes), there is thus no error checking for unsupported
> > fields.
> > 
> > Matching pattern items for packet data must be naturally stacked (ordered
> > from lowest to highest protocol layer), as in the following examples:
> > 
> > +--------------+
> > | TCPv4 as L4  |
> > +===+==========+
> > | 0 | Ethernet |
> > +---+----------+
> > | 1 | IPv4     |
> > +---+----------+
> > | 2 | TCP      |
> > +---+----------+
> > 
> > +----------------+
> > | TCPv6 in VXLAN |
> > +===+============+
> > | 0 | Ethernet   |
> > +---+------------+
> > | 1 | IPv4       |
> > +---+------------+
> > | 2 | UDP        |
> > +---+------------+
> > | 3 | VXLAN      |
> > +---+------------+
> > | 4 | Ethernet   |
> > +---+------------+
> > | 5 | IPv6       |
> > +---+------------+
> > | 6 | TCP        |
> > +---+------------+
> > 
> > +-----------------------------+
> > | TCPv4 as L4 with meta items |
> > +===+=========================+
> > | 0 | VOID                    |
> > +---+-------------------------+
> > | 1 | Ethernet                |
> > +---+-------------------------+
> > | 2 | VOID                    |
> > +---+-------------------------+
> > | 3 | IPv4                    |
> > +---+-------------------------+
> > | 4 | TCP                     |
> > +---+-------------------------+
> > | 5 | VOID                    |
> > +---+-------------------------+
> > | 6 | VOID                    |
> > +---+-------------------------+
> > 
> > The above example shows how meta items do not affect packet data
> > matching
> > items, as long as those remain stacked properly. The resulting matching
> > pattern is identical to "TCPv4 as L4".
> > 
> > +----------------+
> > | UDPv6 anywhere |
> > +===+============+
> > | 0 | IPv6       |
> > +---+------------+
> > | 1 | UDP        |
> > +---+------------+
> > 
> > If supported by the PMD, omitting one or several protocol layers at the
> > bottom of the stack as in the above example (missing an Ethernet
> > specification) enables hardware to look anywhere in packets.
> > 
> > It is unspecified whether the payload of supported encapsulations
> > (e.g. VXLAN inner packet) is matched by such a pattern, which may apply to
> > inner, outer or both packets.
> > 
> > +---------------------+
> > | Invalid, missing L3 |
> > +===+=================+
> > | 0 | Ethernet        |
> > +---+-----------------+
> > | 1 | UDP             |
> > +---+-----------------+
> > 
> > The above pattern is invalid due to a missing L3 specification between L2
> > and L4. It is only allowed at the bottom and at the top of the stack.
> > 
> > Meta item types
> > ~~~~~~~~~~~~~~~
> > 
> > These do not match packet data but affect how the pattern is processed,
> > most
> > of them do not need a specification structure. This particularity allows
> > them to be specified anywhere without affecting other item types.
> > 
> > ``END``
> > ^^^^^^^
> > 
> > End marker for item lists. Prevents further processing of items, thereby
> > ending the pattern.
> > 
> > - Its numeric value is **0** for convenience.
> > - PMD support is mandatory.
> > - Both ``spec`` and ``mask`` are ignored.
> > 
> > +--------------------+
> > | END                |
> > +==========+=========+
> > | ``spec`` | ignored |
> > +----------+---------+
> > | ``mask`` | ignored |
> > +----------+---------+
> > 
> > ``VOID``
> > ^^^^^^^^
> > 
> > Used as a placeholder for convenience. It is ignored and simply discarded by
> > PMDs.
> > 
> > - PMD support is mandatory.
> > - Both ``spec`` and ``mask`` are ignored.
> > 
> > +--------------------+
> > | VOID               |
> > +==========+=========+
> > | ``spec`` | ignored |
> > +----------+---------+
> > | ``mask`` | ignored |
> > +----------+---------+
> > 
> > One usage example for this type is generating rules that share a common
> > prefix quickly without reallocating memory, only by updating item types:
> > 
> > +------------------------+
> > | TCP, UDP or ICMP as L4 |
> > +===+====================+
> > | 0 | Ethernet           |
> > +---+--------------------+
> > | 1 | IPv4               |
> > +---+------+------+------+
> > | 2 | UDP  | VOID | VOID |
> > +---+------+------+------+
> > | 3 | VOID | TCP  | VOID |
> > +---+------+------+------+
> > | 4 | VOID | VOID | ICMP |
> > +---+------+------+------+
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``INVERT``
> > ^^^^^^^^^^
> > 
> > Inverted matching, i.e. process packets that do not match the pattern.
> > 
> > - Both ``spec`` and ``mask`` are ignored.
> > 
> > +--------------------+
> > | INVERT             |
> > +==========+=========+
> > | ``spec`` | ignored |
> > +----------+---------+
> > | ``mask`` | ignored |
> > +----------+---------+
> > 
> > Usage example in order to match non-TCPv4 packets only:
> > 
> > +--------------------+
> > | Anything but TCPv4 |
> > +===+================+
> > | 0 | INVERT         |
> > +---+----------------+
> > | 1 | Ethernet       |
> > +---+----------------+
> > | 2 | IPv4           |
> > +---+----------------+
> > | 3 | TCP            |
> > +---+----------------+
> > 
> > ``PF``
> > ^^^^^^
> > 
> > Matches packets addressed to the physical function of the device.
> > 
> > - Both ``spec`` and ``mask`` are ignored.
> > 
> > +--------------------+
> > | PF                 |
> > +==========+=========+
> > | ``spec`` | ignored |
> > +----------+---------+
> > | ``mask`` | ignored |
> > +----------+---------+
> > 
> > ``VF``
> > ^^^^^^
> > 
> > Matches packets addressed to the given virtual function ID of the device.
> > 
> > - Only ``spec`` needs to be defined, ``mask`` is ignored.
> > 
> > +----------------------------------------+
> > | VF                                     |
> > +==========+=========+===================+
> > | ``spec`` | ``vf``  | destination VF ID |
> > +----------+---------+-------------------+
> > | ``mask`` | ignored                     |
> > +----------+-----------------------------+
> > 
> > ``SIGNATURE``
> > ^^^^^^^^^^^^^
> > 
> > Requests hash-based signature dispatching for this rule.
> > 
> > Considering this is a global setting on devices that support it, all
> > subsequent filter rules may have to be created with it as well.
> > 
> > - Only ``spec`` needs to be defined, ``mask`` is ignored.
> > 
> > +--------------------+
> > | SIGNATURE          |
> > +==========+=========+
> > | ``spec`` | TBD     |
> > +----------+---------+
> > | ``mask`` | ignored |
> > +----------+---------+
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Data matching item types
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > Most of these are basically protocol header definitions with associated
> > bitmasks. They must be specified (stacked) from lowest to highest protocol
> > layer.
> > 
> > The following list is not exhaustive as new protocols will be added in the
> > future.
> > 
> > ``ANY``
> > ^^^^^^^
> > 
> > Matches any protocol in place of the current layer, a single ANY may also
> > stand for several protocol layers.
> > 
> > This is usually specified as the first pattern item when looking for a
> > protocol anywhere in a packet.
> > 
> > - A maximum value of **0** requests matching any number of protocol
> > layers
> >   above or equal to the minimum value, a maximum value lower than the
> >   minimum one is otherwise invalid.
> > - Only ``spec`` needs to be defined, ``mask`` is ignored.
> > 
> > +-----------------------------------------------------------------------+
> > | ANY                                                                   |
> > +==========+=========+====================================
> > ==============+
> > | ``spec`` | ``min`` | minimum number of layers covered                 |
> > |          +---------+--------------------------------------------------+
> > |          | ``max`` | maximum number of layers covered, 0 for infinity |
> > +----------+---------+--------------------------------------------------+
> > | ``mask`` | ignored                                                    |
> > +----------+------------------------------------------------------------+
> > 
> > Example for VXLAN TCP payload matching regardless of outer L3 (IPv4 or
> > IPv6)
> > and L4 (UDP) both matched by the first ANY specification, and inner L3 (IPv4
> > or IPv6) matched by the second ANY specification:
> > 
> > +----------------------------------+
> > | TCP in VXLAN with wildcards      |
> > +===+==============================+
> > | 0 | Ethernet                     |
> > +---+-----+----------+---------+---+
> > | 1 | ANY | ``spec`` | ``min`` | 2 |
> > |   |     |          +---------+---+
> > |   |     |          | ``max`` | 2 |
> > +---+-----+----------+---------+---+
> > | 2 | VXLAN                        |
> > +---+------------------------------+
> > | 3 | Ethernet                     |
> > +---+-----+----------+---------+---+
> > | 4 | ANY | ``spec`` | ``min`` | 1 |
> > |   |     |          +---------+---+
> > |   |     |          | ``max`` | 1 |
> > +---+-----+----------+---------+---+
> > | 5 | TCP                          |
> > +---+------------------------------+
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``RAW``
> > ^^^^^^^
> > 
> > Matches a string of a given length at a given offset (in bytes), or anywhere
> > in the payload of the current protocol layer (including L2 header if used as
> > the first item in the stack).
> > 
> > This does not increment the protocol layer count as it is not a protocol
> > definition. Subsequent RAW items modulate the first absolute one with
> > relative offsets.
> > 
> > - Using **-1** as the ``offset`` of the first RAW item makes its absolute
> >   offset not fixed, i.e. the pattern is searched everywhere.
> > - ``mask`` only affects the pattern.
> > 
> > +--------------------------------------------------------------+
> > | RAW                                                          |
> > +==========+=============+================================
> > =====+
> > | ``spec`` | ``offset``  | absolute or relative pattern offset |
> > |          +-------------+-------------------------------------+
> > |          | ``length``  | pattern length                      |
> > |          +-------------+-------------------------------------+
> > |          | ``pattern`` | byte string of the above length     |
> > +----------+-------------+-------------------------------------+
> > | ``mask`` | ``offset``  | ignored                             |
> > |          +-------------+-------------------------------------+
> > |          | ``length``  | ignored                             |
> > |          +-------------+-------------------------------------+
> > |          | ``pattern`` | bitmask with the same byte length   |
> > +----------+-------------+-------------------------------------+
> > 
> > Example pattern looking for several strings at various offsets of a UDP
> > payload, using combined RAW items:
> > 
> > +------------------------------------------+
> > | UDP payload matching                     |
> > +===+======================================+
> > | 0 | Ethernet                             |
> > +---+--------------------------------------+
> > | 1 | IPv4                                 |
> > +---+--------------------------------------+
> > | 2 | UDP                                  |
> > +---+-----+----------+-------------+-------+
> > | 3 | RAW | ``spec`` | ``offset``  | -1    |
> > |   |     |          +-------------+-------+
> > |   |     |          | ``length``  | 3     |
> > |   |     |          +-------------+-------+
> > |   |     |          | ``pattern`` | "foo" |
> > +---+-----+----------+-------------+-------+
> > | 4 | RAW | ``spec`` | ``offset``  | 20    |
> > |   |     |          +-------------+-------+
> > |   |     |          | ``length``  | 3     |
> > |   |     |          +-------------+-------+
> > |   |     |          | ``pattern`` | "bar" |
> > +---+-----+----------+-------------+-------+
> > | 5 | RAW | ``spec`` | ``offset``  | -30   |
> > |   |     |          +-------------+-------+
> > |   |     |          | ``length``  | 3     |
> > |   |     |          +-------------+-------+
> > |   |     |          | ``pattern`` | "baz" |
> > +---+-----+----------+-------------+-------+
> > 
> > This translates to:
> > 
> > - Locate "foo" in UDP payload, remember its offset.
> > - Check "bar" at "foo"'s offset plus 20 bytes.
> > - Check "baz" at "foo"'s offset minus 30 bytes.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``ETH``
> > ^^^^^^^
> > 
> > Matches an Ethernet header.
> > 
> > - ``dst``: destination MAC.
> > - ``src``: source MAC.
> > - ``type``: EtherType.
> > - ``tags``: number of 802.1Q/ad tags defined.
> > - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> > 
> >  - ``tpid``: Tag protocol identifier.
> >  - ``tci``: Tag control information.
> > 
> > ``IPV4``
> > ^^^^^^^^
> > 
> > Matches an IPv4 header.
> > 
> > - ``src``: source IP address.
> > - ``dst``: destination IP address.
> > - ``tos``: ToS/DSCP field.
> > - ``ttl``: TTL field.
> > - ``proto``: protocol number for the next layer.
> > 
> > ``IPV6``
> > ^^^^^^^^
> > 
> > Matches an IPv6 header.
> > 
> > - ``src``: source IP address.
> > - ``dst``: destination IP address.
> > - ``tc``: traffic class field.
> > - ``nh``: Next header field (protocol).
> > - ``hop_limit``: hop limit field (TTL).
> > 
> > ``ICMP``
> > ^^^^^^^^
> > 
> > Matches an ICMP header.
> > 
> > - TBD.
> > 
> > ``UDP``
> > ^^^^^^^
> > 
> > Matches a UDP header.
> > 
> > - ``sport``: source port.
> > - ``dport``: destination port.
> > - ``length``: UDP length.
> > - ``checksum``: UDP checksum.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``TCP``
> > ^^^^^^^
> > 
> > Matches a TCP header.
> > 
> > - ``sport``: source port.
> > - ``dport``: destination port.
> > - All other TCP fields and bits.
> > 
> > ``VXLAN``
> > ^^^^^^^^^
> > 
> > Matches a VXLAN header.
> > 
> > - TBD.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Actions
> > ~~~~~~~
> > 
> > Each possible action is represented by a type. Some have associated
> > configuration structures. Several actions combined in a list can be affected
> > to a flow rule. That list is not ordered.
> > 
> > At least one action must be defined in a filter rule in order to do
> > something with matched packets.
> > 
> > - Actions are defined with ``struct rte_flow_action``.
> > - A list of actions is defined with ``struct rte_flow_actions``.
> > 
> > They fall in three categories:
> > 
> > - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
> >   processing matched packets by subsequent flow rules, unless overridden
> >   with PASSTHRU.
> > 
> > - Non terminating actions (PASSTHRU, DUP) that leave matched packets up
> > for
> >   additional processing by subsequent flow rules.
> > 
> > - Other non terminating meta actions that do not affect the fate of packets
> >   (END, VOID, ID, COUNT).
> > 
> > When several actions are combined in a flow rule, they should all have
> > different types (e.g. dropping a packet twice is not possible). However
> > considering the VOID type is an exception to this rule, the defined behavior
> > is for PMDs to only take into account the last action of a given type found
> > in the list. PMDs still perform error checking on the entire list.
> > 
> > *Note that PASSTHRU is the only action able to override a terminating rule.*
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Example of an action that redirects packets to queue index 10:
> > 
> > +----------------+
> > | QUEUE          |
> > +===========+====+
> > | ``queue`` | 10 |
> > +-----------+----+
> > 
> > Action lists examples, their order is not significant, applications must
> > consider all actions to be performed simultaneously:
> > 
> > +----------------+
> > | Count and drop |
> > +=======+========+
> > | COUNT |        |
> > +-------+--------+
> > | DROP  |        |
> > +-------+--------+
> > 
> > +--------------------------+
> > | Tag, count and redirect  |
> > +=======+===========+======+
> > | ID    | ``id``    | 0x2a |
> > +-------+-----------+------+
> > | COUNT |                  |
> > +-------+-----------+------+
> > | QUEUE | ``queue`` | 10   |
> > +-------+-----------+------+
> > 
> > +-----------------------+
> > | Redirect to queue 5   |
> > +=======+===============+
> > | DROP  |               |
> > +-------+-----------+---+
> > | QUEUE | ``queue`` | 5 |
> > +-------+-----------+---+
> > 
> > In the above example, considering both actions are performed
> > simultaneously,
> > its end result is that only QUEUE has any effect.
> > 
> > +-----------------------+
> > | Redirect to queue 3   |
> > +=======+===========+===+
> > | QUEUE | ``queue`` | 5 |
> > +-------+-----------+---+
> > | VOID  |               |
> > +-------+-----------+---+
> > | QUEUE | ``queue`` | 3 |
> > +-------+-----------+---+
> > 
> > As previously described, only the last action of a given type found in the
> > list is taken into account. The above example also shows that VOID is
> > ignored.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Action types
> > ~~~~~~~~~~~~
> > 
> > Common action types are described in this section. Like pattern item types,
> > this list is not exhaustive as new actions will be added in the future.
> > 
> > ``END`` (action)
> > ^^^^^^^^^^^^^^^^
> > 
> > End marker for action lists. Prevents further processing of actions, thereby
> > ending the list.
> > 
> > - Its numeric value is **0** for convenience.
> > - PMD support is mandatory.
> > - No configurable property.
> > 
> > +---------------+
> > | END           |
> > +===============+
> > | no properties |
> > +---------------+
> > 
> > ``VOID`` (action)
> > ^^^^^^^^^^^^^^^^^
> > 
> > Used as a placeholder for convenience. It is ignored and simply discarded by
> > PMDs.
> > 
> > - PMD support is mandatory.
> > - No configurable property.
> > 
> > +---------------+
> > | VOID          |
> > +===============+
> > | no properties |
> > +---------------+
> > 
> > ``PASSTHRU``
> > ^^^^^^^^^^^^
> > 
> > Leaves packets up for additional processing by subsequent flow rules. This
> > is the default when a rule does not contain a terminating action, but can be
> > specified to force a rule to become non-terminating.
> > 
> > - No configurable property.
> > 
> > +---------------+
> > | PASSTHRU      |
> > +===============+
> > | no properties |
> > +---------------+
> > 
> > Example to copy a packet to a queue and continue processing by subsequent
> > flow rules:
> [Sugesh] If a packet get copied to a queue, it’s a termination action. 
> How can its possible to do subsequent action after the packet already 
> moved to the queue. ?How it differs from DUP action?
>  Am I missing anything here? 

Devices may not support the combination of QUEUE + PASSTHRU (i.e. making
QUEUE non-terminating). However these same devices may expose the ability to
copy a packet to another (sniffer) queue all while keeping the rule
terminating (QUEUE + DUP but no PASSTHRU).

DUP with two rules, assuming priorties and PASSTRHU are supported:

- pattern X, priority 0; actions: QUEUE 5, PASSTHRU (non-terminating)

- pattern X, priority 1; actions: QUEUE 6 (terminating)

DUP with two actions on a single rule and a single priority:

- pattern X, priority 0; actions: DUP 5, QUEUE 6 (terminating)

If supported, from an application point of view the end result is similar in
both cases (note the second case may be implemented by the PMD using two HW
rules internally).

However the second case does not waste a priority level and clearly states
the intent to the PMD which is more likely to be supported. If HW supports
DUP directly it is even faster since there is a single rule. That is why I
thought having DUP as an action would be useful.

> > +--------------------------+
> > | Copy to queue 8          |
> > +==========+===============+
> > | PASSTHRU |               |
> > +----------+-----------+---+
> > | QUEUE    | ``queue`` | 8 |
> > +----------+-----------+---+
> > 
> > ``ID``
> > ^^^^^^
> > 
> > Attaches a 32 bit value to packets.
> > 
> > +----------------------------------------------+
> > | ID                                           |
> > +========+=====================================+
> > | ``id`` | 32 bit value to return with packets |
> > +--------+-------------------------------------+
> > 
> [Sugesh] I assume the application has to program the flow 
> with a unique ID and matching packets are stamped with this ID
> when reporting to the software. The uniqueness of ID is NOT 
> guaranteed by the API framework. Correct me if I am wrong here.

You are right, if the way I wrote it is not clear enough, I'm open to
suggestions to improve it.

> [Sugesh] Is it a limitation to use only 32 bit ID? Is it possible to have a
> 64 bit ID? So that application can use the control plane flow pointer
> Itself as an ID. Does it make sense? 

I've specified a 32 bit ID for now because this is what FDIR supports and
also what existing devices can report today AFAIK (i40e and mlx5).

We could use 64 bit for future-proofness in a separate action like "ID64"
when at least one device supports it.

To PMD maintainers: please comment if you know devices that support tagging
matching packets with more than 32 bits of user-provided data!

> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``QUEUE``
> > ^^^^^^^^^
> > 
> > Assigns packets to a given queue index.
> > 
> > - Terminating by default.
> > 
> > +--------------------------------+
> > | QUEUE                          |
> > +===========+====================+
> > | ``queue`` | queue index to use |
> > +-----------+--------------------+
> > 
> > ``DROP``
> > ^^^^^^^^
> > 
> > Drop packets.
> > 
> > - No configurable property.
> > - Terminating by default.
> > - PASSTHRU overrides this action if both are specified.
> > 
> > +---------------+
> > | DROP          |
> > +===============+
> > | no properties |
> > +---------------+
> > 
> > ``COUNT``
> > ^^^^^^^^^
> > 
> [Sugesh] Should we really have to set count action explicitly for every rule?
> IMHO it would be great to be an implicit action. Most of the application would be
> interested in the stats of almost all the filters/flows .

I can see why, but no, it must be explicitly requested because you may want
to know in advance when it is not supported. Also considering it is
something else to be done by HW (a separate action), we can assume enabling
this may slow things down a bit.

HW limitations may also prevent you from having as many flow counters as you
want, in which case you probably want to carefully pick which rules have
them.

I think this target is most useful with DROP, VF and PF actions since
those are currently the only ones where SW may not see the related packets.

> > Enables hits counter for this rule.
> > 
> > This counter can be retrieved and reset through ``rte_flow_query()``, see
> > ``struct rte_flow_query_count``.
> > 
> > - Counters can be retrieved with ``rte_flow_query()``.
> > - No configurable property.
> > 
> > +---------------+
> > | COUNT         |
> > +===============+
> > | no properties |
> > +---------------+
> > 
> > Query structure to retrieve and reset the flow rule hits counter:
> > 
> > +------------------------------------------------+
> > | COUNT query                                    |
> > +===========+=====+==============================+
> > | ``reset`` | in  | reset counter after query    |
> > +-----------+-----+------------------------------+
> > | ``hits``  | out | number of hits for this flow |
> > +-----------+-----+------------------------------+
> > 
> > ``DUP``
> > ^^^^^^^
> > 
> > Duplicates packets to a given queue index.
> > 
> > This is normally combined with QUEUE, however when used alone, it is
> > actually similar to QUEUE + PASSTHRU.
> > 
> > - Non-terminating by default.
> > 
> > +------------------------------------------------+
> > | DUP                                            |
> > +===========+====================================+
> > | ``queue`` | queue index to duplicate packet to |
> > +-----------+------------------------------------+
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``RSS``
> > ^^^^^^^
> > 
> > Similar to QUEUE, except RSS is additionally performed on packets to spread
> > them among several queues according to the provided parameters.
> > 
> > - Terminating by default.
> > 
> > +---------------------------------------------+
> > | RSS                                         |
> > +==============+==============================+
> > | ``rss_conf`` | RSS parameters               |
> > +--------------+------------------------------+
> > | ``queues``   | number of entries in queue[] |
> > +--------------+------------------------------+
> > | ``queue[]``  | queue indices to use         |
> > +--------------+------------------------------+
> > 
> > ``PF`` (action)
> > ^^^^^^^^^^^^^^^
> > 
> > Redirects packets to the physical function (PF) of the current device.
> > 
> > - No configurable property.
> > - Terminating by default.
> > 
> > +---------------+
> > | PF            |
> > +===============+
> > | no properties |
> > +---------------+
> > 
> > ``VF`` (action)
> > ^^^^^^^^^^^^^^^
> > 
> > Redirects packets to the virtual function (VF) of the current device with
> > the specified ID.
> > 
> > - Terminating by default.
> > 
> > +---------------------------------------+
> > | VF                                    |
> > +========+==============================+
> > | ``id`` | VF ID to redirect packets to |
> > +--------+------------------------------+
> > 
> > Planned types
> > ~~~~~~~~~~~~~
> > 
> > Other action types are planned but not defined yet. These actions will add
> > the ability to alter matching packets in several ways, such as performing
> > encapsulation/decapsulation of tunnel headers on specific flows.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Rules management
> > ----------------
> > 
> > A simple API with only four functions is provided to fully manage flows.
> > 
> > Each created flow rule is associated with an opaque, PMD-specific handle
> > pointer. The application is responsible for keeping it until the rule is
> > destroyed.
> > 
> > Flows rules are defined with ``struct rte_flow``.
> > 
> > Validation
> > ~~~~~~~~~~
> > 
> > Given that expressing a definite set of device capabilities with this API is
> > not practical, a dedicated function is provided to check if a flow rule is
> > supported and can be created.
> > 
> > ::
> > 
> >  int
> >  rte_flow_validate(uint8_t port_id,
> >                    const struct rte_flow_pattern *pattern,
> >                    const struct rte_flow_actions *actions);
> > 
> > While this function has no effect on the target device, the flow rule is
> > validated against its current configuration state and the returned value
> > should be considered valid by the caller for that state only.
> > 
> > The returned value is guaranteed to remain valid only as long as no
> > successful calls to rte_flow_create() or rte_flow_destroy() are made in the
> > meantime and no device parameter affecting flow rules in any way are
> > modified, due to possible collisions or resource limitations (although in
> > such cases ``EINVAL`` should not be returned).
> > 
> > Arguments:
> > 
> > - ``port_id``: port identifier of Ethernet device.
> > - ``pattern``: pattern specification to check.
> > - ``actions``: actions associated with the flow definition.
> > 
> > Return value:
> > 
> > - **0** if flow rule is valid and can be created. A negative errno value
> >   otherwise (``rte_errno`` is also set), the following errors are defined.
> > - ``-EINVAL``: unknown or invalid rule specification.
> > - ``-ENOTSUP``: valid but unsupported rule specification (e.g. partial masks
> >   are unsupported).
> > - ``-EEXIST``: collision with an existing rule.
> > - ``-ENOMEM``: not enough resources.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Creation
> > ~~~~~~~~
> > 
> > Creating a flow rule is similar to validating one, except the rule is
> > actually created.
> > 
> > ::
> > 
> >  struct rte_flow *
> >  rte_flow_create(uint8_t port_id,
> >                  const struct rte_flow_pattern *pattern,
> >                  const struct rte_flow_actions *actions);
> > 
> > Arguments:
> > 
> > - ``port_id``: port identifier of Ethernet device.
> > - ``pattern``: pattern specification to add.
> > - ``actions``: actions associated with the flow definition.
> > 
> > Return value:
> > 
> > A valid flow pointer in case of success, NULL otherwise and ``rte_errno`` is
> > set to the positive version of one of the error codes defined for
> > ``rte_flow_validate()``.
> [Sugesh] : Kind of implementation specific query. What if application
> try to add duplicate rules? Does the API create new flow entry for every 
> API call? 

If an application adds duplicate rules at a given priority level, the second
one may return an error depending on the PMD. Collisions are sometimes
trivial to detect (such as the same pattern twice), others not so much (one
matching an Ethernet header only, the other one matching an IP header only).

Either way if a packet is matched by two rules at a given priority level,
what happens is described in 3.3 (High level design) and 4.4.1 (Priorities).

Applications are responsible for not relying on the PMD to detect these, or
should use a single priority level for each rule to make things clear.

However since the number of HW priority levels is finite and possibly small,
they must also make sure not to waste them. My advice is to only use
priority levels when it cannot be proven that rules do not collide.

If all you have is perfect matching rules without wildcards and all of them
match the same number of layers, a single priority level is fine.

> [Sugesh] Another concern is the cost and time of installing these rules
> in the hardware. Can we make these APIs time bound(or at least an option to
> set the time limit to execute these APIs), so that
> Application doesn’t have to wait so long when installing and deleting flows with
> slow hardware/NIC. What do you think? Most of the datapath flow installations are 
> dynamic and triggered only when there is
> an ingress traffic. Delay in flow insertion/deletion have unpredictable consequences.

This API is (currently) aimed at the control path only, and must indeed be
assumed to be slow. Creating million of rules may take quite long as it may
involve syscalls and other time-consuming synchronization things on the PMD
side.

So currently there is no plan to have rules added from the data path with
time constraints. I think it would be implemented through a different set of
functions anyway.

I do not think adding time limits is practical, even specifying in the API
that creating a single flow rule must take less than a maximum number of
seconds in order to be effective is too much of a constraint (applications
that create all flows during init may not care after all).

You should consider in any case that modifying flow rules will always be
slower than receiving packets, there is no way around that. Applications
have to live with it and provide a software fallback for incoming packets
while managing flow rules.

Moreover, think about what happens when you hit the maximum number of flow
rules and cannot create any more. Applications need to implement some kind
of fallback in their data path.

Offloading flows in HW is also only useful if they live much longer than the
time taken to create and delete them. Perhaps applications may choose to do
so after detecting long lived flows such as TCP sessions.

You may have one separate control thread dedicated to manage flows and
keep your normal control thread unaffected by delays. Several threads can
even be dedicated, one per device.

> [Sugesh] Another query is on the synchronization part. What if same rules are 
> handled from different threads? Is application responsible for handling the concurrent
> hardware programming?

Like most (if not all) DPDK APIs, applications are responsible for managing
locking issues as decribed in 4.3 (Behavior). Since this is a control path
API and applications usually have a single control thread, locking should
not be necessary in most cases.

Regarding my above comment about using several control threads to manage
different devices, section 4.3 says:
 
 "There is no provision for reentrancy/multi-thread safety, although nothing
 should prevent different devices from being configured at the same
 time. PMDs may protect their control path functions accordingly."

I'd like to emphasize it is not "per port" but "per device", since in a few
cases a configurable resource is shared by several ports. It may be
difficult for applications to determine which ports are shared by a given
device but this falls outside the scope of this API.

Do you think adding the guarantee that it is always safe to configure two
different ports simultaneously without locking from the application side is
necessary? In which case the PMD would be responsible for locking shared
resources.

> > Destruction
> > ~~~~~~~~~~~
> > 
> > Flow rules destruction is not automatic, and a queue should not be released
> > if any are still attached to it. Applications must take care of performing
> > this step before releasing resources.
> > 
> > ::
> > 
> >  int
> >  rte_flow_destroy(uint8_t port_id,
> >                   struct rte_flow *flow);
> > 
> > 
> [Sugesh] I would suggest having a clean-up API is really useful as the releasing of
> Queue(is it applicable for releasing of port too?) is not guaranteeing the automatic flow 
> destruction.

Would something like rte_flow_flush(port_id) do the trick? I wanted to
emphasize in this first draft that applications should really keep the flow
pointers around in order to manage/destroy them. It is their responsibility,
not PMD's.

> This way application can initialize the port,
> clean-up all the existing rules and create new rules  on a clean slate.

No resource can be released as long as a flow rule is using it (bad things
may happen otherwise), all flow rules must be destroyed first, thus none can
possibly remain after initializing a port. It is assumed that PMDs do
automatic clean up during init if necessary to ensure this.

> > Failure to destroy a flow rule may occur when other flow rules depend on it,
> > and destroying it would result in an inconsistent state.
> > 
> > This function is only guaranteed to succeed if flow rules are destroyed in
> > reverse order of their creation.
> > 
> > Arguments:
> > 
> > - ``port_id``: port identifier of Ethernet device.
> > - ``flow``: flow rule to destroy.
> > 
> > Return value:
> > 
> > - **0** on success, a negative errno value otherwise and ``rte_errno`` is
> >   set.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Query
> > ~~~~~
> > 
> > Query an existing flow rule.
> > 
> > This function allows retrieving flow-specific data such as counters. Data
> > is gathered by special actions which must be present in the flow rule
> > definition.
> > 
> > ::
> > 
> >  int
> >  rte_flow_query(uint8_t port_id,
> >                 struct rte_flow *flow,
> >                 enum rte_flow_action_type action,
> >                 void *data);
> > 
> > Arguments:
> > 
> > - ``port_id``: port identifier of Ethernet device.
> > - ``flow``: flow rule to query.
> > - ``action``: action type to query.
> > - ``data``: pointer to storage for the associated query data type.
> > 
> > Return value:
> > 
> > - **0** on success, a negative errno value otherwise and ``rte_errno`` is
> >   set.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Behavior
> > --------
> > 
> > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> >   returned).
> > 
> > - There is no provision for reentrancy/multi-thread safety, although nothing
> >   should prevent different devices from being configured at the same
> >   time. PMDs may protect their control path functions accordingly.
> > 
> > - Stopping the data path (TX/RX) should not be necessary when managing
> > flow
> >   rules. If this cannot be achieved naturally or with workarounds (such as
> >   temporarily replacing the burst function pointers), an appropriate error
> >   code must be returned (``EBUSY``).
> > 
> > - PMDs, not applications, are responsible for maintaining flow rules
> >   configuration when stopping and restarting a port or performing other
> >   actions which may affect them. They can only be destroyed explicitly.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> [Sugesh] Query all the rules for a specific port/queue?? Useful when adding and
> deleting ports and queues dynamically according to the need. I am not sure 
> what are the other  different usecases for these APIs. But I feel it makes much easier to 
> manage flows from the application. What do you think?

Not sure, that seems to fall out of the scope of this API. As described,
applications already store the related rte_flow pointers. Accordingly, they
know how many rules are associated to a given port. They need both a port ID
and a flow rule pointer to destroy them after all.

Now perhaps something to convert back an existing rte_flow to a pattern and
a list of actions, however I cannot see an immediate use case for it.

What you describe seems to be doable through a front-end API, I think
keeping this one as low-level as possible with only basic actions is better
right now. I'll keep your suggestion in mind.

> > Compatibility
> > -------------
> > 
> > No known hardware implementation supports all the features described in
> > this
> > document.
> > 
> > Unsupported features or combinations are not expected to be fully
> > emulated
> > in software by PMDs for performance reasons. Partially supported features
> > may be completed in software as long as hardware performs most of the
> > work
> > (such as queue redirection and packet recognition).
> > 
> > However PMDs are expected to do their best to satisfy application requests
> > by working around hardware limitations as long as doing so does not affect
> > the behavior of existing flow rules.
> > 
> > The following sections provide a few examples of such cases, they are based
> > on limitations built into the previous APIs.
> > 
> > Global bitmasks
> > ~~~~~~~~~~~~~~~
> > 
> > Each flow rule comes with its own, per-layer bitmasks, while hardware may
> > support only a single, device-wide bitmask for a given layer type, so that
> > two IPv4 rules cannot use different bitmasks.
> > 
> > The expected behavior in this case is that PMDs automatically configure
> > global bitmasks according to the needs of the first created flow rule.
> > 
> > Subsequent rules are allowed only if their bitmasks match those, the
> > ``EEXIST`` error code should be returned otherwise.
> > 
> > Unsupported layer types
> > ~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > Many protocols can be simulated by crafting patterns with the `RAW`_ type.
> > 
> > PMDs can rely on this capability to simulate support for protocols with
> > fixed headers not directly recognized by hardware.
> > 
> > ``ANY`` pattern item
> > ~~~~~~~~~~~~~~~~~~~~
> > 
> > This pattern item stands for anything, which can be difficult to translate
> > to something hardware would understand, particularly if followed by more
> > specific types.
> > 
> > Consider the following pattern:
> > 
> > +---+--------------------------------+
> > | 0 | ETHER                          |
> > +---+--------------------------------+
> > | 1 | ANY (``min`` = 1, ``max`` = 1) |
> > +---+--------------------------------+
> > | 2 | TCP                            |
> > +---+--------------------------------+
> > 
> > Knowing that TCP does not make sense with something other than IPv4 and
> > IPv6
> > as L3, such a pattern may be translated to two flow rules instead:
> > 
> > +---+--------------------+
> > | 0 | ETHER              |
> > +---+--------------------+
> > | 1 | IPV4 (zeroed mask) |
> > +---+--------------------+
> > | 2 | TCP                |
> > +---+--------------------+
> > 
> > +---+--------------------+
> > | 0 | ETHER              |
> > +---+--------------------+
> > | 1 | IPV6 (zeroed mask) |
> > +---+--------------------+
> > | 2 | TCP                |
> > +---+--------------------+
> > 
> > Note that as soon as a ANY rule covers several layers, this approach may
> > yield a large number of hidden flow rules. It is thus suggested to only
> > support the most common scenarios (anything as L2 and/or L3).
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > Unsupported actions
> > ~~~~~~~~~~~~~~~~~~~
> > 
> > - When combined with a `QUEUE`_ action, packet counting (`COUNT`_) and
> >   tagging (`ID`_) may be implemented in software as long as the target queue
> >   is used by a single rule.
> > 
> > - A rule specifying both `DUP`_ + `QUEUE`_ may be translated to two hidden
> >   rules combining `QUEUE`_ and `PASSTHRU`_.
> > 
> > - When a single target queue is provided, `RSS`_ can also be implemented
> >   through `QUEUE`_.
> > 
> > Flow rules priority
> > ~~~~~~~~~~~~~~~~~~~
> > 
> > While it would naturally make sense, flow rules cannot be assumed to be
> > processed by hardware in the same order as their creation for several
> > reasons:
> > 
> > - They may be managed internally as a tree or a hash table instead of a
> >   list.
> > - Removing a flow rule before adding another one can either put the new
> > rule
> >   at the end of the list or reuse a freed entry.
> > - Duplication may occur when packets are matched by several rules.
> > 
> > For overlapping rules (particularly in order to use the `PASSTHRU`_ action)
> > predictable behavior is only guaranteed by using different priority levels.
> > 
> > Priority levels are not necessarily implemented in hardware, or may be
> > severely limited (e.g. a single priority bit).
> > 
> > For these reasons, priority levels may be implemented purely in software by
> > PMDs.
> > 
> > - For devices expecting flow rules to be added in the correct order, PMDs
> >   may destroy and re-create existing rules after adding a new one with
> >   a higher priority.
> > 
> > - A configurable number of dummy or empty rules can be created at
> >   initialization time to save high priority slots for later.
> > 
> > - In order to save priority levels, PMDs may evaluate whether rules are
> >   likely to collide and adjust their priority accordingly.
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > API migration
> > =============
> > 
> > Exhaustive list of deprecated filter types and how to convert them to
> > generic flow rules.
> > 
> > ``MACVLAN`` to ``ETH`` → ``VF``, ``PF``
> > ---------------------------------------
> > 
> > `MACVLAN`_ can be translated to a basic `ETH`_ flow rule with a `VF
> > (action)`_ or `PF (action)`_ terminating action.
> > 
> > +------------------------------------+
> > | MACVLAN                            |
> > +--------------------------+---------+
> > | Pattern                  | Actions |
> > +===+=====+==========+=====+=========+
> > | 0 | ETH | ``spec`` | any | VF,     |
> > |   |     +----------+-----+ PF      |
> > |   |     | ``mask`` | any |         |
> > +---+-----+----------+-----+---------+
> > 
> > ``ETHERTYPE`` to ``ETH`` → ``QUEUE``, ``DROP``
> > ----------------------------------------------
> > 
> > `ETHERTYPE`_ is basically an `ETH`_ flow rule with `QUEUE`_ or `DROP`_ as
> > a terminating action.
> > 
> > +------------------------------------+
> > | ETHERTYPE                          |
> > +--------------------------+---------+
> > | Pattern                  | Actions |
> > +===+=====+==========+=====+=========+
> > | 0 | ETH | ``spec`` | any | QUEUE,  |
> > |   |     +----------+-----+ DROP    |
> > |   |     | ``mask`` | any |         |
> > +---+-----+----------+-----+---------+
> > 
> > ``FLEXIBLE`` to ``RAW`` → ``QUEUE``
> > -----------------------------------
> > 
> > `FLEXIBLE`_ can be translated to one `RAW`_ pattern with `QUEUE`_ as the
> > terminating action and a defined priority level.
> > 
> > +------------------------------------+
> > | FLEXIBLE                           |
> > +--------------------------+---------+
> > | Pattern                  | Actions |
> > +===+=====+==========+=====+=========+
> > | 0 | RAW | ``spec`` | any | QUEUE   |
> > |   |     +----------+-----+         |
> > |   |     | ``mask`` | any |         |
> > +---+-----+----------+-----+---------+
> > 
> > ``SYN`` to ``TCP`` → ``QUEUE``
> > ------------------------------
> > 
> > `SYN`_ is a `TCP`_ rule with only the ``syn`` bit enabled and masked, and
> > `QUEUE`_ as the terminating action.
> > 
> > Priority level can be set to simulate the high priority bit.
> > 
> > +---------------------------------------------+
> > | SYN                                         |
> > +-----------------------------------+---------+
> > | Pattern                           | Actions |
> > +===+======+==========+=============+=========+
> > | 0 | ETH  | ``spec`` | N/A         | QUEUE   |
> > |   |      +----------+-------------+         |
> > |   |      | ``mask`` | empty       |         |
> > +---+------+----------+-------------+         |
> > | 1 | IPV4 | ``spec`` | N/A         |         |
> > |   |      +----------+-------------+         |
> > |   |      | ``mask`` | empty       |         |
> > +---+------+----------+-------------+         |
> > | 2 | TCP  | ``spec`` | ``syn`` = 1 |         |
> > |   |      +----------+-------------+         |
> > |   |      | ``mask`` | ``syn`` = 1 |         |
> > +---+------+----------+-------------+---------+
> > 
> > ``NTUPLE`` to ``IPV4``, ``TCP``, ``UDP`` → ``QUEUE``
> > ----------------------------------------------------
> > 
> > `NTUPLE`_ is similar to specifying an empty L2, `IPV4`_ as L3 with `TCP`_ or
> > `UDP`_ as L4 and `QUEUE`_ as the terminating action.
> > 
> > A priority level can be specified as well.
> > 
> > +---------------------------------------+
> > | NTUPLE                                |
> > +-----------------------------+---------+
> > | Pattern                     | Actions |
> > +===+======+==========+=======+=========+
> > | 0 | ETH  | ``spec`` | N/A   | QUEUE   |
> > |   |      +----------+-------+         |
> > |   |      | ``mask`` | empty |         |
> > +---+------+----------+-------+         |
> > | 1 | IPV4 | ``spec`` | any   |         |
> > |   |      +----------+-------+         |
> > |   |      | ``mask`` | any   |         |
> > +---+------+----------+-------+         |
> > | 2 | TCP, | ``spec`` | any   |         |
> > |   | UDP  +----------+-------+         |
> > |   |      | ``mask`` | any   |         |
> > +---+------+----------+-------+---------+
> > 
> > ``TUNNEL`` to ``ETH``, ``IPV4``, ``IPV6``, ``VXLAN`` (or other) → ``QUEUE``
> > ---------------------------------------------------------------------------
> > 
> > `TUNNEL`_ matches common IPv4 and IPv6 L3/L4-based tunnel types.
> > 
> > In the following table, `ANY`_ is used to cover the optional L4.
> > 
> > +------------------------------------------------+
> > | TUNNEL                                         |
> > +--------------------------------------+---------+
> > | Pattern                              | Actions |
> > +===+=========+==========+=============+=========+
> > | 0 | ETH     | ``spec`` | any         | QUEUE   |
> > |   |         +----------+-------------+         |
> > |   |         | ``mask`` | any         |         |
> > +---+---------+----------+-------------+         |
> > | 1 | IPV4,   | ``spec`` | any         |         |
> > |   | IPV6    +----------+-------------+         |
> > |   |         | ``mask`` | any         |         |
> > +---+---------+----------+-------------+         |
> > | 2 | ANY     | ``spec`` | ``min`` = 0 |         |
> > |   |         |          +-------------+         |
> > |   |         |          | ``max`` = 0 |         |
> > |   |         +----------+-------------+         |
> > |   |         | ``mask`` | N/A         |         |
> > +---+---------+----------+-------------+         |
> > | 3 | VXLAN,  | ``spec`` | any         |         |
> > |   | GENEVE, +----------+-------------+         |
> > |   | TEREDO, | ``mask`` | any         |         |
> > |   | NVGRE,  |          |             |         |
> > |   | GRE,    |          |             |         |
> > |   | ...     |          |             |         |
> > +---+---------+----------+-------------+---------+
> > 
> > .. raw:: pdf
> > 
> >    PageBreak
> > 
> > ``FDIR`` to most item types → ``QUEUE``, ``DROP``, ``PASSTHRU``
> > ---------------------------------------------------------------
> > 
> > `FDIR`_ is more complex than any other type, there are several methods to
> > emulate its functionality. It is summarized for the most part in the table
> > below.
> > 
> > A few features are intentionally not supported:
> > 
> > - The ability to configure the matching input set and masks for the entire
> >   device, PMDs should take care of it automatically according to flow rules.
> > 
> > - Returning four or eight bytes of matched data when using flex bytes
> >   filtering. Although a specific action could implement it, it conflicts
> >   with the much more useful 32 bits tagging on devices that support it.
> > 
> > - Side effects on RSS processing of the entire device. Flow rules that
> >   conflict with the current device configuration should not be
> >   allowed. Similarly, device configuration should not be allowed when it
> >   affects existing flow rules.
> > 
> > - Device modes of operation. "none" is unsupported since filtering cannot be
> >   disabled as long as a flow rule is present.
> > 
> > - "MAC VLAN" or "tunnel" perfect matching modes should be automatically
> > set
> >   according to the created flow rules.
> > 
> > +----------------------------------------------+
> > | FDIR                                         |
> > +---------------------------------+------------+
> > | Pattern                         | Actions    |
> > +===+============+==========+=====+============+
> > | 0 | ETH,       | ``spec`` | any | QUEUE,     |
> > |   | RAW        +----------+-----+ DROP,      |
> > |   |            | ``mask`` | any | PASSTHRU   |
> > +---+------------+----------+-----+------------+
> > | 1 | IPV4,      | ``spec`` | any | ID         |
> > |   | IPV6       +----------+-----+ (optional) |
> > |   |            | ``mask`` | any |            |
> > +---+------------+----------+-----+            |
> > | 2 | TCP,       | ``spec`` | any |            |
> > |   | UDP,       +----------+-----+            |
> > |   | SCTP       | ``mask`` | any |            |
> > +---+------------+----------+-----+            |
> > | 3 | VF,        | ``spec`` | any |            |
> > |   | PF,        +----------+-----+            |
> > |   | SIGNATURE  | ``mask`` | any |            |
> > |   | (optional) |          |     |            |
> > +---+------------+----------+-----+------------+
> > 
> > ``HASH``
> > ~~~~~~~~
> > 
> > Hashing configuration is set per rule through the `SIGNATURE`_ item.
> > 
> > Since it is usually a global device setting, all flow rules created with
> > this item may have to share the same specification.
> > 
> > ``L2_TUNNEL`` to ``VOID`` → ``VXLAN`` (or others)
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > All packets are matched. This type alters incoming packets to encapsulate
> > them in a chosen tunnel type, optionally redirect them to a VF as well.
> > 
> > The destination pool for tag based forwarding can be emulated with other
> > flow rules using `DUP`_ as the action.
> > 
> > +----------------------------------------+
> > | L2_TUNNEL                              |
> > +---------------------------+------------+
> > | Pattern                   | Actions    |
> > +===+======+==========+=====+============+
> > | 0 | VOID | ``spec`` | N/A | VXLAN,     |
> > |   |      |          |     | GENEVE,    |
> > |   |      |          |     | ...        |
> > |   |      +----------+-----+------------+
> > |   |      | ``mask`` | N/A | VF         |
> > |   |      |          |     | (optional) |
> > +---+------+----------+-----+------------+
> > 
> > --
> > Adrien Mazarguil
> > 6WIND

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-08 11:11 ` Liang, Cunming
@ 2016-07-08 12:38   ` Bruce Richardson
  2016-07-08 13:25   ` Adrien Mazarguil
  1 sibling, 0 replies; 54+ messages in thread
From: Bruce Richardson @ 2016-07-08 12:38 UTC (permalink / raw)
  To: Liang, Cunming
  Cc: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu, Jan Medala,
	John Daley, Jing Chen, Konstantin Ananyev, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, Pablo de Lara,
	Olga Shern

On Fri, Jul 08, 2016 at 07:11:28PM +0800, Liang, Cunming wrote:

Please Cunming, when replying in the middle of a long email - of which this
is a perfect example - delete the large chunks of content you are not replying
to. I had to page down multiple times to find the new comments you wrote, and
even then I wasn't sure that I hadn't missed something along the way.

/Bruce

> Hi Adrien,
> 
> On 7/6/2016 2:16 AM, Adrien Mazarguil wrote:
> >Hi All,
> >
<snip>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-05 18:16 Adrien Mazarguil
  2016-07-07  7:14 ` Lu, Wenzhuo
  2016-07-07 23:15 ` Chandran, Sugesh
@ 2016-07-08 11:11 ` Liang, Cunming
  2016-07-08 12:38   ` Bruce Richardson
  2016-07-08 13:25   ` Adrien Mazarguil
  2016-07-11 10:41 ` Jerin Jacob
  2016-07-21  8:13 ` Rahul Lakkireddy
  4 siblings, 2 replies; 54+ messages in thread
From: Liang, Cunming @ 2016-07-08 11:11 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu, Jan Medala,
	John Daley, Jing Chen, Konstantin Ananyev, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, Pablo de Lara,
	Olga Shern

Hi Adrien,

On 7/6/2016 2:16 AM, Adrien Mazarguil wrote:
> Hi All,
>
> First, forgive me for this large message, I know our mailboxes already
> suffer quite a bit from the amount of traffic on this ML.
>
> This is not exactly yet another thread about how flow director should be
> extended, rather about a brand new API to handle filtering and
> classification for incoming packets in the most PMD-generic and
> application-friendly fashion we can come up with. Reasons described below.
>
> I think this topic is important enough to include both the users of this API
> as well as PMD maintainers. So far I have CC'ed librte_ether (especially
> rte_eth_ctrl.h contributors), testpmd and PMD maintainers (with and without
> a .filter_ctrl implementation), but if you know application maintainers
> other than testpmd who use FDIR or might be interested in this discussion,
> feel free to add them.
>
> The issues we found with the current approach are already summarized in the
> following document, but here is a quick summary for TL;DR folks:
>
> - PMDs do not expose a common set of filter types and even when they do,
>    their behavior more or less differs.
>
> - Applications need to determine and adapt to device-specific limitations
>    and quirks on their own, without help from PMDs.
>
> - Writing an application that creates flow rules targeting all devices
>    supported by DPDK is thus difficult, if not impossible.
>
> - The current API has too many unspecified areas (particularly regarding
>    side effects of flow rules) that make PMD implementation tricky.
>
> This RFC API handles everything currently supported by .filter_ctrl, the
> idea being to reimplement all of these to make them fully usable by
> applications in a more generic and well defined fashion. It has a very small
> set of mandatory features and an easy method to let applications probe for
> supported capabilities.
>
> The only downside is more work for the software control side of PMDs because
> they have to adapt to the API instead of the reverse. I think helpers can be
> added to EAL to assist with this.
>
> HTML version:
>
>   https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
>
> PDF version:
>
>   https://rawgit.com/6WIND/rte_flow/master/rte_flow.pdf
>
> Related draft header file (for reference while reading the specification):
>
>   https://raw.githubusercontent.com/6WIND/rte_flow/master/rte_flow.h
>
> Git tree for completeness (latest .rst version can be retrieved from here):
>
>   https://github.com/6WIND/rte_flow
>
> What follows is the ReST source of the above, for inline comments and
> discussion. I intend to update that specification accordingly.
>
> ========================
> Generic filter interface
> ========================
>
> .. footer::
>
>     v0.6
>
> .. contents::
> .. sectnum::
> .. raw:: pdf
>
>     PageBreak
>
> Overview
> ========
>
> DPDK provides several competing interfaces added over time to perform packet
> matching and related actions such as filtering and classification.
>
> They must be extended to implement the features supported by newer devices
> in order to expose them to applications, however the current design has
> several drawbacks:
>
> - Complicated filter combinations which have not been hard-coded cannot be
>    expressed.
> - Prone to API/ABI breakage when new features must be added to an existing
>    filter type, which frequently happens.
>
>  From an application point of view:
>
> - Having disparate interfaces, all optional and lacking in features does not
>    make this API easy to use.
> - Seemingly arbitrary built-in limitations of filter types based on the
>    device they were initially designed for.
> - Undefined relationship between different filter types.
> - High complexity, considerable undocumented and/or undefined behavior.
>
> Considering the growing number of devices supported by DPDK, adding a new
> filter type each time a new feature must be implemented is not sustainable
> in the long term. Applications not written to target a specific device
> cannot really benefit from such an API.
>
> For these reasons, this document defines an extensible unified API that
> encompasses and supersedes these legacy filter types.
>
> .. raw:: pdf
>
>     PageBreak
>
> Current API
> ===========
>
> Rationale
> ---------
>
> The reason several competing (and mostly overlapping) filtering APIs are
> present in DPDK is due to its nature as a thin layer between hardware and
> software.
>
> Each subsequent interface has been added to better match the capabilities
> and limitations of the latest supported device, which usually happened to
> need an incompatible configuration approach. Because of this, many ended up
> device-centric and not usable by applications that were not written for that
> particular device.
>
> This document is not the first attempt to address this proliferation issue,
> in fact a lot of work has already been done both to create a more generic
> interface while somewhat keeping compatibility with legacy ones through a
> common call interface (``rte_eth_dev_filter_ctrl()`` with the
> ``.filter_ctrl`` PMD callback in ``rte_ethdev.h``).
>
> Today, these previously incompatible interfaces are known as filter types
> (``RTE_ETH_FILTER_*`` from ``enum rte_filter_type`` in ``rte_eth_ctrl.h``).
>
> However while trivial to extend with new types, it only shifted the
> underlying problem as applications still need to be written for one kind of
> filter type, which, as described in the following sections, is not
> necessarily implemented by all PMDs that support filtering.
>
> .. raw:: pdf
>
>     PageBreak
>
> Filter types
> ------------
>
> This section summarizes the capabilities of each filter type.
>
> Although the following list is exhaustive, the description of individual
> types may contain inaccuracies due to the lack of documentation or usage
> examples.
>
> Note: names are prefixed with ``RTE_ETH_FILTER_``.
>
> ``MACVLAN``
> ~~~~~~~~~~~
>
> Matching:
>
> - L2 source/destination addresses.
> - Optional 802.1Q VLAN ID.
> - Masking individual fields on a rule basis is not supported.
>
> Action:
>
> - Packets are redirected either to a given VF device using its ID or to the
>    PF.
>
> ``ETHERTYPE``
> ~~~~~~~~~~~~~
>
> Matching:
>
> - L2 source/destination addresses (optional).
> - Ethertype (no VLAN ID?).
> - Masking individual fields on a rule basis is not supported.
>
> Action:
>
> - Receive packets on a given queue.
> - Drop packets.
>
> ``FLEXIBLE``
> ~~~~~~~~~~~~
>
> Matching:
>
> - At most 128 consecutive bytes anywhere in packets.
> - Masking is supported with byte granularity.
> - Priorities are supported (relative to this filter type, undefined
>    otherwise).
>
> Action:
>
> - Receive packets on a given queue.
>
> ``SYN``
> ~~~~~~~
>
> Matching:
>
> - TCP SYN packets only.
> - One high priority bit can be set to give the highest possible priority to
>    this type when other filters with different types are configured.
>
> Action:
>
> - Receive packets on a given queue.
>
> ``NTUPLE``
> ~~~~~~~~~~
>
> Matching:
>
> - Source/destination IPv4 addresses (optional in 2-tuple mode).
> - Source/destination TCP/UDP port (mandatory in 2 and 5-tuple modes).
> - L4 protocol (2 and 5-tuple modes).
> - Masking individual fields is supported.
> - TCP flags.
> - Up to 7 levels of priority relative to this filter type, undefined
>    otherwise.
> - No IPv6.
>
> Action:
>
> - Receive packets on a given queue.
>
> ``TUNNEL``
> ~~~~~~~~~~
>
> Matching:
>
> - Outer L2 source/destination addresses.
> - Inner L2 source/destination addresses.
> - Inner VLAN ID.
> - IPv4/IPv6 source (destination?) address.
> - Tunnel type to match (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE, 802.1BR
>    E-Tag).
> - Tenant ID for tunneling protocols that have one.
> - Any combination of the above can be specified.
> - Masking individual fields on a rule basis is not supported.
>
> Action:
>
> - Receive packets on a given queue.
>
> .. raw:: pdf
>
>     PageBreak
>
> ``FDIR``
> ~~~~~~~~
>
> Queries:
>
> - Device capabilities and limitations.
> - Device statistics about configured filters (resource usage, collisions).
> - Device configuration (matching input set and masks)
>
> Matching:
>
> - Device mode of operation: none (to disable filtering), signature
>    (hash-based dispatching from masked fields) or perfect (either MAC VLAN or
>    tunnel).
> - L2 Ethertype.
> - Outer L2 destination address (MAC VLAN mode).
> - Inner L2 destination address, tunnel type (NVGRE, VXLAN) and tunnel ID
>    (tunnel mode).
> - IPv4 source/destination addresses, ToS, TTL and protocol fields.
> - IPv6 source/destination addresses, TC, protocol and hop limits fields.
> - UDP source/destination IPv4/IPv6 and ports.
> - TCP source/destination IPv4/IPv6 and ports.
> - SCTP source/destination IPv4/IPv6, ports and verification tag field.
> - Note, only one protocol type at once (either only L2 Ethertype, basic
>    IPv6, IPv4+UDP, IPv4+TCP and so on).
> - VLAN TCI (extended API).
> - At most 16 bytes to match in payload (extended API). A global device
>    look-up table specifies for each possible protocol layer (unknown, raw,
>    L2, L3, L4) the offset to use for each byte (they do not need to be
>    contiguous) and the related bitmask.
> - Whether packet is addressed to PF or VF, in that case its ID can be
>    matched as well (extended API).
> - Masking most of the above fields is supported, but simultaneously affects
>    all filters configured on a device.
> - Input set can be modified in a similar fashion for a given device to
>    ignore individual fields of filters (i.e. do not match the destination
>    address in a IPv4 filter, refer to **RTE_ETH_INPUT_SET_**
>    macros). Configuring this also affects RSS processing on **i40e**.
> - Filters can also provide 32 bits of arbitrary data to return as part of
>    matched packets.
>
> Action:
>
> - **RTE_ETH_FDIR_ACCEPT**: receive (accept) packet on a given queue.
> - **RTE_ETH_FDIR_REJECT**: drop packet immediately.
> - **RTE_ETH_FDIR_PASSTHRU**: similar to accept for the last filter in list,
>    otherwise process it with subsequent filters.
> - For accepted packets and if requested by filter, either 32 bits of
>    arbitrary data and four bytes of matched payload (only in case of flex
>    bytes matching), or eight bytes of matched payload (flex also) are added
>    to meta data.
>
> .. raw:: pdf
>
>     PageBreak
>
> ``HASH``
> ~~~~~~~~
>
> Not an actual filter type. Provides and retrieves the global device
> configuration (per port or entire NIC) for hash functions and their
> properties.
>
> Hash function selection: "default" (keep current), XOR or Toeplitz.
>
> This function can be configured per flow type (**RTE_ETH_FLOW_**
> definitions), supported types are:
>
> - Unknown.
> - Raw.
> - Fragmented or non-fragmented IPv4.
> - Non-fragmented IPv4 with L4 (TCP, UDP, SCTP or other).
> - Fragmented or non-fragmented IPv6.
> - Non-fragmented IPv6 with L4 (TCP, UDP, SCTP or other).
> - L2 payload.
> - IPv6 with extensions.
> - IPv6 with L4 (TCP, UDP) and extensions.
>
> ``L2_TUNNEL``
> ~~~~~~~~~~~~~
>
> Matching:
>
> - All packets received on a given port.
>
> Action:
>
> - Add tunnel encapsulation (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE,
>    802.1BR E-Tag) using the provided Ethertype and tunnel ID (only E-Tag
>    is implemented at the moment).
> - VF ID to use for tag insertion (currently unused).
> - Destination pool for tag based forwarding (pools are IDs that can be
>    affected to ports, duplication occurs if the same ID is shared by several
>    ports of the same NIC).
>
> .. raw:: pdf
>
>     PageBreak
>
> Driver support
> --------------
>
> ======== ======= ========= ======== === ====== ====== ==== ==== =========
> Driver   MACVLAN ETHERTYPE FLEXIBLE SYN NTUPLE TUNNEL FDIR HASH L2_TUNNEL
> ======== ======= ========= ======== === ====== ====== ==== ==== =========
> bnx2x
> cxgbe
> e1000            yes       yes      yes yes
> ena
> enic                                                  yes
> fm10k
> i40e     yes     yes                           yes    yes  yes
> ixgbe            yes                yes yes           yes       yes
> mlx4
> mlx5                                                  yes
> szedata2
> ======== ======= ========= ======== === ====== ====== ==== ==== =========
>
> Flow director
> -------------
>
> Flow director (FDIR) is the name of the most capable filter type, which
> covers most features offered by others. As such, it is the most widespread
> in PMDs that support filtering (i.e. all of them besides **e1000**).
>
> It is also the only type that allows an arbitrary 32 bits value provided by
> applications to be attached to a filter and returned with matching packets
> instead of relying on the destination queue to recognize flows.
>
> Unfortunately, even FDIR requires applications to be aware of low-level
> capabilities and limitations (most of which come directly from **ixgbe** and
> **i40e**):
>
> - Bitmasks are set globally per device (port?), not per filter.
> - Configuration state is not expected to be saved by the driver, and
>    stopping/restarting a port requires the application to perform it again
>    (API documentation is also unclear about this).
> - Monolithic approach with ABI issues as soon as a new kind of flow or
>    combination needs to be supported.
> - Cryptic global statistics/counters.
> - Unclear about how priorities are managed; filters seem to be arranged as a
>    linked list in hardware (possibly related to configuration order).
>
> Packet alteration
> -----------------
>
> One interesting feature is that the L2 tunnel filter type implements the
> ability to alter incoming packets through a filter (in this case to
> encapsulate them), thus the **mlx5** flow encap/decap features are not a
> foreign concept.
>
> .. raw:: pdf
>
>     PageBreak
>
> Proposed API
> ============
>
> Terminology
> -----------
>
> - **Filtering API**: overall framework affecting the fate of selected
>    packets, covers everything described in this document.
> - **Matching pattern**: properties to look for in received packets, a
>    combination of any number of items.
> - **Pattern item**: part of a pattern that either matches packet data
>    (protocol header, payload or derived information), or specifies properties
>    of the pattern itself.
> - **Actions**: what needs to be done when a packet matches a pattern.
> - **Flow rule**: this is the result of combining a *matching pattern* with
>    *actions*.
> - **Filter rule**: a less generic term than *flow rule*, can otherwise be
>    used interchangeably.
> - **Hit**: a flow rule is said to be *hit* when processing a matching
>    packet.
>
> Requirements
> ------------
>
> As described in the previous section, there is a growing need for a common
> method to configure filtering and related actions in a hardware independent
> fashion.
>
> The filtering API should not disallow any filter combination by design and
> must remain as simple as possible to use. It can simply be defined as a
> method to perform one or several actions on selected packets.
>
> PMDs are aware of the capabilities of the device they manage and should be
> responsible for preventing unsupported or conflicting combinations.
>
> This approach is fundamentally different as it places most of the burden on
> the software side of the PMD instead of having device capabilities directly
> mapped to API functions, then expecting applications to work around ensuing
> compatibility issues.
>
> Requirements for a new API:
>
> - Flexible and extensible without causing API/ABI problems for existing
>    applications.
> - Should be unambiguous and easy to use.
> - Support existing filtering features and actions listed in `Filter types`_.
> - Support packet alteration.
> - In case of overlapping filters, their priority should be well documented.
> - Support filter queries (for example to retrieve counters).
>
> .. raw:: pdf
>
>     PageBreak
>
> High level design
> -----------------
>
> The chosen approach to make filtering as generic as possible is by
> expressing matching patterns through lists of items instead of the flat
> structures used in DPDK today, enabling combinations that are not predefined
> and thus being more versatile.
>
> Flow rules can have several distinct actions (such as counting,
> encapsulating, decapsulating before redirecting packets to a particular
> queue, etc.), instead of relying on several rules to achieve this and having
> applications deal with hardware implementation details regarding their
> order.
>
> Support for different priority levels on a rule basis is provided, for
> example in order to force a more specific rule come before a more generic
> one for packets matched by both, however hardware support for more than a
> single priority level cannot be guaranteed. When supported, the number of
> available priority levels is usually low, which is why they can also be
> implemented in software by PMDs (e.g. to simulate missing priority levels by
> reordering rules).
>
> In order to remain as hardware agnostic as possible, by default all rules
> are considered to have the same priority, which means that the order between
> overlapping rules (when a packet is matched by several filters) is
> undefined, packet duplication may even occur as a result.
>
> PMDs may refuse to create overlapping rules at a given priority level when
> they can be detected (e.g. if a pattern matches an existing filter).
>
> Thus predictable results for a given priority level can only be achieved
> with non-overlapping rules, using perfect matching on all protocol layers.
>
> Support for multiple actions per rule may be implemented internally on top
> of non-default hardware priorities, as a result both features may not be
> simultaneously available to applications.
>
> Considering that allowed pattern/actions combinations cannot be known in
> advance and would result in an unpractically large number of capabilities to
> expose, a method is provided to validate a given rule from the current
> device configuration state without actually adding it (akin to a "dry run"
> mode).
>
> This enables applications to check if the rule types they need is supported
> at initialization time, before starting their data path. This method can be
> used anytime, its only requirement being that the resources needed by a rule
> must exist (e.g. a target RX queue must be configured first).
>
> Each defined rule is associated with an opaque handle managed by the PMD,
> applications are responsible for keeping it. These can be used for queries
> and rules management, such as retrieving counters or other data and
> destroying them.
>
> Handles must be destroyed before releasing associated resources such as
> queues.
>
> Integration
> -----------
>
> To avoid ABI breakage, this new interface will be implemented through the
> existing filtering control framework (``rte_eth_dev_filter_ctrl()``) using
> **RTE_ETH_FILTER_GENERIC** as a new filter type.
>
> However a public front-end API described in `Rules management`_ will
> be added as the preferred method to use it.
>
> Once discussions with the community have converged to a definite API, legacy
> filter types should be deprecated and a deadline defined to remove their
> support entirely.
>
> PMDs will have to be gradually converted to **RTE_ETH_FILTER_GENERIC** or
> drop filtering support entirely. Less maintained PMDs for older hardware may
> lose support at this point.
>
> The notion of filter type will then be deprecated and subsequently dropped
> to avoid confusion between both frameworks.
>
> Implementation details
> ======================
>
> Flow rule
> ---------
>
> A flow rule is the combination of a matching pattern with a list of actions,
> and is the basis of this API.
>
> Priorities
> ~~~~~~~~~~
>
> A priority can be assigned to a matching pattern.
>
> The default priority level is 0 and is also the highest. Support for more
> than a single priority level in hardware is not guaranteed.
>
> If a packet is matched by several filters at a given priority level, the
> outcome is undefined. It can take any path and can even be duplicated.
>
> Matching pattern
> ~~~~~~~~~~~~~~~~
>
> A matching pattern comprises any number of items of various types.
>
> Items are arranged in a list to form a matching pattern for packets. They
> fall in two categories:
>
> - Protocol matching (ANY, RAW, ETH, IPV4, IPV6, ICMP, UDP, TCP, VXLAN and so
>    on), usually associated with a specification structure. These must be
>    stacked in the same order as the protocol layers to match, starting from
>    L2.
>
> - Affecting how the pattern is processed (END, VOID, INVERT, PF, VF,
>    SIGNATURE and so on), often without a specification structure. Since they
>    are meta data that does not match packet contents, these can be specified
>    anywhere within item lists without affecting the protocol matching items.
>
> Most item specifications can be optionally paired with a mask to narrow the
> specific fields or bits to be matched.
>
> - Items are defined with ``struct rte_flow_item``.
> - Patterns are defined with ``struct rte_flow_pattern``.
>
> Example of an item specification matching an Ethernet header:
>
> +-----------------------------------------+
> | Ethernet                                |
> +==========+=========+====================+
> | ``spec`` | ``src`` | ``00:01:02:03:04`` |
> |          +---------+--------------------+
> |          | ``dst`` | ``00:2a:66:00:01`` |
> +----------+---------+--------------------+
> | ``mask`` | ``src`` | ``00:ff:ff:ff:00`` |
> |          +---------+--------------------+
> |          | ``dst`` | ``00:00:00:00:ff`` |
> +----------+---------+--------------------+
>
> Non-masked bits stand for any value, Ethernet headers with the following
> properties are thus matched:
>
> - ``src``: ``??:01:02:03:??``
> - ``dst``: ``??:??:??:??:01``
>
> Except for meta types that do not need one, ``spec`` must be a valid pointer
> to a structure of the related item type. A ``mask`` of the same type can be
> provided to tell which bits in ``spec`` are to be matched.
>
> A mask is normally only needed for ``spec`` fields matching packet data,
> ignored otherwise. See individual item types for more information.
>
> A ``NULL`` mask pointer is allowed and is similar to matching with a full
> mask (all ones) ``spec`` fields supported by hardware, the remaining fields
> are ignored (all zeroes), there is thus no error checking for unsupported
> fields.
>
> Matching pattern items for packet data must be naturally stacked (ordered
> from lowest to highest protocol layer), as in the following examples:
>
> +--------------+
> | TCPv4 as L4  |
> +===+==========+
> | 0 | Ethernet |
> +---+----------+
> | 1 | IPv4     |
> +---+----------+
> | 2 | TCP      |
> +---+----------+
>
> +----------------+
> | TCPv6 in VXLAN |
> +===+============+
> | 0 | Ethernet   |
> +---+------------+
> | 1 | IPv4       |
> +---+------------+
> | 2 | UDP        |
> +---+------------+
> | 3 | VXLAN      |
> +---+------------+
> | 4 | Ethernet   |
> +---+------------+
> | 5 | IPv6       |
> +---+------------+
> | 6 | TCP        |
> +---+------------+
>
> +-----------------------------+
> | TCPv4 as L4 with meta items |
> +===+=========================+
> | 0 | VOID                    |
> +---+-------------------------+
> | 1 | Ethernet                |
> +---+-------------------------+
> | 2 | VOID                    |
> +---+-------------------------+
> | 3 | IPv4                    |
> +---+-------------------------+
> | 4 | TCP                     |
> +---+-------------------------+
> | 5 | VOID                    |
> +---+-------------------------+
> | 6 | VOID                    |
> +---+-------------------------+
>
> The above example shows how meta items do not affect packet data matching
> items, as long as those remain stacked properly. The resulting matching
> pattern is identical to "TCPv4 as L4".
>
> +----------------+
> | UDPv6 anywhere |
> +===+============+
> | 0 | IPv6       |
> +---+------------+
> | 1 | UDP        |
> +---+------------+
>
> If supported by the PMD, omitting one or several protocol layers at the
> bottom of the stack as in the above example (missing an Ethernet
> specification) enables hardware to look anywhere in packets.
>
> It is unspecified whether the payload of supported encapsulations
> (e.g. VXLAN inner packet) is matched by such a pattern, which may apply to
> inner, outer or both packets.
>
> +---------------------+
> | Invalid, missing L3 |
> +===+=================+
> | 0 | Ethernet        |
> +---+-----------------+
> | 1 | UDP             |
> +---+-----------------+
>
> The above pattern is invalid due to a missing L3 specification between L2
> and L4. It is only allowed at the bottom and at the top of the stack.
>
> Meta item types
> ~~~~~~~~~~~~~~~
>
> These do not match packet data but affect how the pattern is processed, most
> of them do not need a specification structure. This particularity allows
> them to be specified anywhere without affecting other item types.
[LC] For the meta item(END, VOID, INVERT) and some data matching type 
like ANY and RAW,
it's all PMD responsible to understand the key character and to parse 
the header graph?
>
> ``END``
> ^^^^^^^
>
> End marker for item lists. Prevents further processing of items, thereby
> ending the pattern.
>
> - Its numeric value is **0** for convenience.
> - PMD support is mandatory.
> - Both ``spec`` and ``mask`` are ignored.
>
> +--------------------+
> | END                |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
>
> ``VOID``
> ^^^^^^^^
>
> Used as a placeholder for convenience. It is ignored and simply discarded by
> PMDs.
>
> - PMD support is mandatory.
> - Both ``spec`` and ``mask`` are ignored.
>
> +--------------------+
> | VOID               |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
>
> One usage example for this type is generating rules that share a common
> prefix quickly without reallocating memory, only by updating item types:
>
> +------------------------+
> | TCP, UDP or ICMP as L4 |
> +===+====================+
> | 0 | Ethernet           |
> +---+--------------------+
> | 1 | IPv4               |
> +---+------+------+------+
> | 2 | UDP  | VOID | VOID |
> +---+------+------+------+
> | 3 | VOID | TCP  | VOID |
> +---+------+------+------+
> | 4 | VOID | VOID | ICMP |
> +---+------+------+------+
>
> .. raw:: pdf
>
>     PageBreak
>
> ``INVERT``
> ^^^^^^^^^^
>
> Inverted matching, i.e. process packets that do not match the pattern.
>
> - Both ``spec`` and ``mask`` are ignored.
>
> +--------------------+
> | INVERT             |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
>
> Usage example in order to match non-TCPv4 packets only:
>
> +--------------------+
> | Anything but TCPv4 |
> +===+================+
> | 0 | INVERT         |
> +---+----------------+
> | 1 | Ethernet       |
> +---+----------------+
> | 2 | IPv4           |
> +---+----------------+
> | 3 | TCP            |
> +---+----------------+
>
> ``PF``
> ^^^^^^
>
> Matches packets addressed to the physical function of the device.
>
> - Both ``spec`` and ``mask`` are ignored.
>
> +--------------------+
> | PF                 |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
>
> ``VF``
> ^^^^^^
>
> Matches packets addressed to the given virtual function ID of the device.
>
> - Only ``spec`` needs to be defined, ``mask`` is ignored.
>
> +----------------------------------------+
> | VF                                     |
> +==========+=========+===================+
> | ``spec`` | ``vf``  | destination VF ID |
> +----------+---------+-------------------+
> | ``mask`` | ignored                     |
> +----------+-----------------------------+
>
> ``SIGNATURE``
> ^^^^^^^^^^^^^
>
> Requests hash-based signature dispatching for this rule.
>
> Considering this is a global setting on devices that support it, all
> subsequent filter rules may have to be created with it as well.
>
> - Only ``spec`` needs to be defined, ``mask`` is ignored.
>
> +--------------------+
> | SIGNATURE          |
> +==========+=========+
> | ``spec`` | TBD     |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
>
> .. raw:: pdf
>
>     PageBreak
>
> Data matching item types
> ~~~~~~~~~~~~~~~~~~~~~~~~
>
> Most of these are basically protocol header definitions with associated
> bitmasks. They must be specified (stacked) from lowest to highest protocol
> layer.
>
> The following list is not exhaustive as new protocols will be added in the
> future.
>
> ``ANY``
> ^^^^^^^
>
> Matches any protocol in place of the current layer, a single ANY may also
> stand for several protocol layers.
>
> This is usually specified as the first pattern item when looking for a
> protocol anywhere in a packet.
>
> - A maximum value of **0** requests matching any number of protocol layers
>    above or equal to the minimum value, a maximum value lower than the
>    minimum one is otherwise invalid.
> - Only ``spec`` needs to be defined, ``mask`` is ignored.
>
> +-----------------------------------------------------------------------+
> | ANY                                                                   |
> +==========+=========+==================================================+
> | ``spec`` | ``min`` | minimum number of layers covered                 |
> |          +---------+--------------------------------------------------+
> |          | ``max`` | maximum number of layers covered, 0 for infinity |
> +----------+---------+--------------------------------------------------+
> | ``mask`` | ignored                                                    |
> +----------+------------------------------------------------------------+
>
> Example for VXLAN TCP payload matching regardless of outer L3 (IPv4 or IPv6)
> and L4 (UDP) both matched by the first ANY specification, and inner L3 (IPv4
> or IPv6) matched by the second ANY specification:
>
> +----------------------------------+
> | TCP in VXLAN with wildcards      |
> +===+==============================+
> | 0 | Ethernet                     |
> +---+-----+----------+---------+---+
> | 1 | ANY | ``spec`` | ``min`` | 2 |
> |   |     |          +---------+---+
> |   |     |          | ``max`` | 2 |
> +---+-----+----------+---------+---+
> | 2 | VXLAN                        |
> +---+------------------------------+
> | 3 | Ethernet                     |
> +---+-----+----------+---------+---+
> | 4 | ANY | ``spec`` | ``min`` | 1 |
> |   |     |          +---------+---+
> |   |     |          | ``max`` | 1 |
> +---+-----+----------+---------+---+
> | 5 | TCP                          |
> +---+------------------------------+
>
> .. raw:: pdf
>
>     PageBreak
>
> ``RAW``
> ^^^^^^^
>
> Matches a string of a given length at a given offset (in bytes), or anywhere
> in the payload of the current protocol layer (including L2 header if used as
> the first item in the stack).
>
> This does not increment the protocol layer count as it is not a protocol
> definition. Subsequent RAW items modulate the first absolute one with
> relative offsets.
>
> - Using **-1** as the ``offset`` of the first RAW item makes its absolute
>    offset not fixed, i.e. the pattern is searched everywhere.
> - ``mask`` only affects the pattern.
The RAW matching type allow offset & length which support anchor setting 
setting and string match.
It's not defined for a user defined packet layout. Sometimes, comparing 
payload raw data after a header require
{offset, length}. One typical case is 5-tuples matching. The 'PORT' of 
transport layer is an offset to the IP header.
It can't address by IP/ANY, as it requires to extract key from the field 
in ANY.

>
> +--------------------------------------------------------------+
> | RAW                                                          |
> +==========+=============+=====================================+
> | ``spec`` | ``offset``  | absolute or relative pattern offset |
> |          +-------------+-------------------------------------+
> |          | ``length``  | pattern length                      |
> |          +-------------+-------------------------------------+
> |          | ``pattern`` | byte string of the above length     |
> +----------+-------------+-------------------------------------+
> | ``mask`` | ``offset``  | ignored                             |
> |          +-------------+-------------------------------------+
> |          | ``length``  | ignored                             |
> |          +-------------+-------------------------------------+
> |          | ``pattern`` | bitmask with the same byte length   |
> +----------+-------------+-------------------------------------+
>
> Example pattern looking for several strings at various offsets of a UDP
> payload, using combined RAW items:
>
> +------------------------------------------+
> | UDP payload matching                     |
> +===+======================================+
> | 0 | Ethernet                             |
> +---+--------------------------------------+
> | 1 | IPv4                                 |
> +---+--------------------------------------+
> | 2 | UDP                                  |
> +---+-----+----------+-------------+-------+
> | 3 | RAW | ``spec`` | ``offset``  | -1    |
> |   |     |          +-------------+-------+
> |   |     |          | ``length``  | 3     |
> |   |     |          +-------------+-------+
> |   |     |          | ``pattern`` | "foo" |
> +---+-----+----------+-------------+-------+
> | 4 | RAW | ``spec`` | ``offset``  | 20    |
> |   |     |          +-------------+-------+
> |   |     |          | ``length``  | 3     |
> |   |     |          +-------------+-------+
> |   |     |          | ``pattern`` | "bar" |
> +---+-----+----------+-------------+-------+
> | 5 | RAW | ``spec`` | ``offset``  | -30   |
> |   |     |          +-------------+-------+
> |   |     |          | ``length``  | 3     |
> |   |     |          +-------------+-------+
> |   |     |          | ``pattern`` | "baz" |
> +---+-----+----------+-------------+-------+
>
> This translates to:
>
> - Locate "foo" in UDP payload, remember its offset.
> - Check "bar" at "foo"'s offset plus 20 bytes.
> - Check "baz" at "foo"'s offset minus 30 bytes.
>
> .. raw:: pdf
>
>     PageBreak
>
> ``ETH``
> ^^^^^^^
>
> Matches an Ethernet header.
>
> - ``dst``: destination MAC.
> - ``src``: source MAC.
> - ``type``: EtherType.
> - ``tags``: number of 802.1Q/ad tags defined.
> - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
>
>   - ``tpid``: Tag protocol identifier.
>   - ``tci``: Tag control information.
>
> ``IPV4``
> ^^^^^^^^
>
> Matches an IPv4 header.
>
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tos``: ToS/DSCP field.
> - ``ttl``: TTL field.
> - ``proto``: protocol number for the next layer.
>
> ``IPV6``
> ^^^^^^^^
>
> Matches an IPv6 header.
>
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tc``: traffic class field.
> - ``nh``: Next header field (protocol).
> - ``hop_limit``: hop limit field (TTL).
>
> ``ICMP``
> ^^^^^^^^
>
> Matches an ICMP header.
>
> - TBD.
>
> ``UDP``
> ^^^^^^^
>
> Matches a UDP header.
>
> - ``sport``: source port.
> - ``dport``: destination port.
> - ``length``: UDP length.
> - ``checksum``: UDP checksum.
>
> .. raw:: pdf
>
>     PageBreak
>
> ``TCP``
> ^^^^^^^
>
> Matches a TCP header.
>
> - ``sport``: source port.
> - ``dport``: destination port.
> - All other TCP fields and bits.
>
> ``VXLAN``
> ^^^^^^^^^
>
> Matches a VXLAN header.
>
> - TBD.
>
> .. raw:: pdf
>
>     PageBreak
>
> Actions
> ~~~~~~~
>
> Each possible action is represented by a type. Some have associated
> configuration structures. Several actions combined in a list can be affected
> to a flow rule. That list is not ordered.
>
> At least one action must be defined in a filter rule in order to do
> something with matched packets.
>
> - Actions are defined with ``struct rte_flow_action``.
> - A list of actions is defined with ``struct rte_flow_actions``.
>
> They fall in three categories:
>
> - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
>    processing matched packets by subsequent flow rules, unless overridden
>    with PASSTHRU.
>
> - Non terminating actions (PASSTHRU, DUP) that leave matched packets up for
>    additional processing by subsequent flow rules.
>
> - Other non terminating meta actions that do not affect the fate of packets
>    (END, VOID, ID, COUNT).
>
> When several actions are combined in a flow rule, they should all have
> different types (e.g. dropping a packet twice is not possible). However
> considering the VOID type is an exception to this rule, the defined behavior
> is for PMDs to only take into account the last action of a given type found
> in the list. PMDs still perform error checking on the entire list.
>
> *Note that PASSTHRU is the only action able to override a terminating rule.*
[LC] I'm wondering how to address the meta data carried by mbuf, there's 
no mentioned here.
For packets hit one specific flow, usually there's something for CPU to 
identify the flow.
FDIR and RSS as an example, has id or key in mbuf. In addition, some 
meta may pointed by userdata in mbuf.
Any view on it ?

>
> .. raw:: pdf
>
>     PageBreak
>
> Example of an action that redirects packets to queue index 10:
>
> +----------------+
> | QUEUE          |
> +===========+====+
> | ``queue`` | 10 |
> +-----------+----+
>
> Action lists examples, their order is not significant, applications must
> consider all actions to be performed simultaneously:
>
> +----------------+
> | Count and drop |
> +=======+========+
> | COUNT |        |
> +-------+--------+
> | DROP  |        |
> +-------+--------+
>
> +--------------------------+
> | Tag, count and redirect  |
> +=======+===========+======+
> | ID    | ``id``    | 0x2a |
> +-------+-----------+------+
> | COUNT |                  |
> +-------+-----------+------+
> | QUEUE | ``queue`` | 10   |
> +-------+-----------+------+
>
> +-----------------------+
> | Redirect to queue 5   |
> +=======+===============+
> | DROP  |               |
> +-------+-----------+---+
> | QUEUE | ``queue`` | 5 |
> +-------+-----------+---+
>
> In the above example, considering both actions are performed simultaneously,
> its end result is that only QUEUE has any effect.
>
> +-----------------------+
> | Redirect to queue 3   |
> +=======+===========+===+
> | QUEUE | ``queue`` | 5 |
> +-------+-----------+---+
> | VOID  |               |
> +-------+-----------+---+
> | QUEUE | ``queue`` | 3 |
> +-------+-----------+---+
>
> As previously described, only the last action of a given type found in the
> list is taken into account. The above example also shows that VOID is
> ignored.
>
> .. raw:: pdf
>
>     PageBreak
>
> Action types
> ~~~~~~~~~~~~
>
> Common action types are described in this section. Like pattern item types,
> this list is not exhaustive as new actions will be added in the future.
>
> ``END`` (action)
> ^^^^^^^^^^^^^^^^
>
> End marker for action lists. Prevents further processing of actions, thereby
> ending the list.
>
> - Its numeric value is **0** for convenience.
> - PMD support is mandatory.
> - No configurable property.
>
> +---------------+
> | END           |
> +===============+
> | no properties |
> +---------------+
>
> ``VOID`` (action)
> ^^^^^^^^^^^^^^^^^
>
> Used as a placeholder for convenience. It is ignored and simply discarded by
> PMDs.
>
> - PMD support is mandatory.
> - No configurable property.
>
> +---------------+
> | VOID          |
> +===============+
> | no properties |
> +---------------+
>
> ``PASSTHRU``
> ^^^^^^^^^^^^
>
> Leaves packets up for additional processing by subsequent flow rules. This
> is the default when a rule does not contain a terminating action, but can be
> specified to force a rule to become non-terminating.
>
> - No configurable property.
>
> +---------------+
> | PASSTHRU      |
> +===============+
> | no properties |
> +---------------+
>
> Example to copy a packet to a queue and continue processing by subsequent
> flow rules:
>
> +--------------------------+
> | Copy to queue 8          |
> +==========+===============+
> | PASSTHRU |               |
> +----------+-----------+---+
> | QUEUE    | ``queue`` | 8 |
> +----------+-----------+---+
>
> ``ID``
> ^^^^^^
>
> Attaches a 32 bit value to packets.
>
> +----------------------------------------------+
> | ID                                           |
> +========+=====================================+
> | ``id`` | 32 bit value to return with packets |
> +--------+-------------------------------------+
>
> .. raw:: pdf
>
>     PageBreak
>
> ``QUEUE``
> ^^^^^^^^^
>
> Assigns packets to a given queue index.
>
> - Terminating by default.
>
> +--------------------------------+
> | QUEUE                          |
> +===========+====================+
> | ``queue`` | queue index to use |
> +-----------+--------------------+
>
> ``DROP``
> ^^^^^^^^
>
> Drop packets.
>
> - No configurable property.
> - Terminating by default.
> - PASSTHRU overrides this action if both are specified.
>
> +---------------+
> | DROP          |
> +===============+
> | no properties |
> +---------------+
>
> ``COUNT``
> ^^^^^^^^^
>
> Enables hits counter for this rule.
>
> This counter can be retrieved and reset through ``rte_flow_query()``, see
> ``struct rte_flow_query_count``.
>
> - Counters can be retrieved with ``rte_flow_query()``.
> - No configurable property.
>
> +---------------+
> | COUNT         |
> +===============+
> | no properties |
> +---------------+
>
> Query structure to retrieve and reset the flow rule hits counter:
>
> +------------------------------------------------+
> | COUNT query                                    |
> +===========+=====+==============================+
> | ``reset`` | in  | reset counter after query    |
> +-----------+-----+------------------------------+
> | ``hits``  | out | number of hits for this flow |
> +-----------+-----+------------------------------+
>
> ``DUP``
> ^^^^^^^
>
> Duplicates packets to a given queue index.
>
> This is normally combined with QUEUE, however when used alone, it is
> actually similar to QUEUE + PASSTHRU.
>
> - Non-terminating by default.
>
> +------------------------------------------------+
> | DUP                                            |
> +===========+====================================+
> | ``queue`` | queue index to duplicate packet to |
> +-----------+------------------------------------+
>
> .. raw:: pdf
>
>     PageBreak
>
> ``RSS``
> ^^^^^^^
>
> Similar to QUEUE, except RSS is additionally performed on packets to spread
> them among several queues according to the provided parameters.
>
> - Terminating by default.
>
> +---------------------------------------------+
> | RSS                                         |
> +==============+==============================+
> | ``rss_conf`` | RSS parameters               |
> +--------------+------------------------------+
> | ``queues``   | number of entries in queue[] |
> +--------------+------------------------------+
> | ``queue[]``  | queue indices to use         |
> +--------------+------------------------------+
>
> ``PF`` (action)
> ^^^^^^^^^^^^^^^
>
> Redirects packets to the physical function (PF) of the current device.
>
> - No configurable property.
> - Terminating by default.
>
> +---------------+
> | PF            |
> +===============+
> | no properties |
> +---------------+
>
> ``VF`` (action)
> ^^^^^^^^^^^^^^^
>
> Redirects packets to the virtual function (VF) of the current device with
> the specified ID.
>
> - Terminating by default.
>
> +---------------------------------------+
> | VF                                    |
> +========+==============================+
> | ``id`` | VF ID to redirect packets to |
> +--------+------------------------------+
>
> Planned types
> ~~~~~~~~~~~~~
>
> Other action types are planned but not defined yet. These actions will add
> the ability to alter matching packets in several ways, such as performing
> encapsulation/decapsulation of tunnel headers on specific flows.
>
> .. raw:: pdf
>
>     PageBreak
>
> Rules management
> ----------------
>
> A simple API with only four functions is provided to fully manage flows.
>
> Each created flow rule is associated with an opaque, PMD-specific handle
> pointer. The application is responsible for keeping it until the rule is
> destroyed.
>
> Flows rules are defined with ``struct rte_flow``.
>
> Validation
> ~~~~~~~~~~
>
> Given that expressing a definite set of device capabilities with this API is
> not practical, a dedicated function is provided to check if a flow rule is
> supported and can be created.
>
> ::
>
>   int
>   rte_flow_validate(uint8_t port_id,
>                     const struct rte_flow_pattern *pattern,
>                     const struct rte_flow_actions *actions);
>
> While this function has no effect on the target device, the flow rule is
> validated against its current configuration state and the returned value
> should be considered valid by the caller for that state only.
>
> The returned value is guaranteed to remain valid only as long as no
> successful calls to rte_flow_create() or rte_flow_destroy() are made in the
> meantime and no device parameter affecting flow rules in any way are
> modified, due to possible collisions or resource limitations (although in
> such cases ``EINVAL`` should not be returned).
>
> Arguments:
>
> - ``port_id``: port identifier of Ethernet device.
> - ``pattern``: pattern specification to check.
> - ``actions``: actions associated with the flow definition.
>
> Return value:
>
> - **0** if flow rule is valid and can be created. A negative errno value
>    otherwise (``rte_errno`` is also set), the following errors are defined.
> - ``-EINVAL``: unknown or invalid rule specification.
> - ``-ENOTSUP``: valid but unsupported rule specification (e.g. partial masks
>    are unsupported).
> - ``-EEXIST``: collision with an existing rule.
> - ``-ENOMEM``: not enough resources.
>
> .. raw:: pdf
>
>     PageBreak
>
> Creation
> ~~~~~~~~
>
> Creating a flow rule is similar to validating one, except the rule is
> actually created.
>
> ::
>
>   struct rte_flow *
>   rte_flow_create(uint8_t port_id,
>                   const struct rte_flow_pattern *pattern,
>                   const struct rte_flow_actions *actions);
>
> Arguments:
>
> - ``port_id``: port identifier of Ethernet device.
> - ``pattern``: pattern specification to add.
> - ``actions``: actions associated with the flow definition.
>
> Return value:
>
> A valid flow pointer in case of success, NULL otherwise and ``rte_errno`` is
> set to the positive version of one of the error codes defined for
> ``rte_flow_validate()``.
>
> Destruction
> ~~~~~~~~~~~
>
> Flow rules destruction is not automatic, and a queue should not be released
> if any are still attached to it. Applications must take care of performing
> this step before releasing resources.
>
> ::
>
>   int
>   rte_flow_destroy(uint8_t port_id,
>                    struct rte_flow *flow);
>
>
> Failure to destroy a flow rule may occur when other flow rules depend on it,
> and destroying it would result in an inconsistent state.
>
> This function is only guaranteed to succeed if flow rules are destroyed in
> reverse order of their creation.
>
> Arguments:
>
> - ``port_id``: port identifier of Ethernet device.
> - ``flow``: flow rule to destroy.
>
> Return value:
>
> - **0** on success, a negative errno value otherwise and ``rte_errno`` is
>    set.
>
> .. raw:: pdf
>
>     PageBreak
>
> Query
> ~~~~~
>
> Query an existing flow rule.
>
> This function allows retrieving flow-specific data such as counters. Data
> is gathered by special actions which must be present in the flow rule
> definition.
>
> ::
>
>   int
>   rte_flow_query(uint8_t port_id,
>                  struct rte_flow *flow,
>                  enum rte_flow_action_type action,
>                  void *data);
>
> Arguments:
>
> - ``port_id``: port identifier of Ethernet device.
> - ``flow``: flow rule to query.
> - ``action``: action type to query.
> - ``data``: pointer to storage for the associated query data type.
>
> Return value:
>
> - **0** on success, a negative errno value otherwise and ``rte_errno`` is
>    set.
>
> .. raw:: pdf
>
>     PageBreak
>
> Behavior
> --------
>
> - API operations are synchronous and blocking (``EAGAIN`` cannot be
>    returned).
>
> - There is no provision for reentrancy/multi-thread safety, although nothing
>    should prevent different devices from being configured at the same
>    time. PMDs may protect their control path functions accordingly.
>
> - Stopping the data path (TX/RX) should not be necessary when managing flow
>    rules. If this cannot be achieved naturally or with workarounds (such as
>    temporarily replacing the burst function pointers), an appropriate error
>    code must be returned (``EBUSY``).
>
> - PMDs, not applications, are responsible for maintaining flow rules
>    configuration when stopping and restarting a port or performing other
>    actions which may affect them. They can only be destroyed explicitly.
>
> .. raw:: pdf
>
>     PageBreak
>
> Compatibility
> -------------
>
> No known hardware implementation supports all the features described in this
> document.
>
> Unsupported features or combinations are not expected to be fully emulated
> in software by PMDs for performance reasons. Partially supported features
> may be completed in software as long as hardware performs most of the work
> (such as queue redirection and packet recognition).
>
> However PMDs are expected to do their best to satisfy application requests
> by working around hardware limitations as long as doing so does not affect
> the behavior of existing flow rules.
>
> The following sections provide a few examples of such cases, they are based
> on limitations built into the previous APIs.
>
> Global bitmasks
> ~~~~~~~~~~~~~~~
>
> Each flow rule comes with its own, per-layer bitmasks, while hardware may
> support only a single, device-wide bitmask for a given layer type, so that
> two IPv4 rules cannot use different bitmasks.
>
> The expected behavior in this case is that PMDs automatically configure
> global bitmasks according to the needs of the first created flow rule.
>
> Subsequent rules are allowed only if their bitmasks match those, the
> ``EEXIST`` error code should be returned otherwise.
>
> Unsupported layer types
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> Many protocols can be simulated by crafting patterns with the `RAW`_ type.
>
> PMDs can rely on this capability to simulate support for protocols with
> fixed headers not directly recognized by hardware.
>
> ``ANY`` pattern item
> ~~~~~~~~~~~~~~~~~~~~
>
> This pattern item stands for anything, which can be difficult to translate
> to something hardware would understand, particularly if followed by more
> specific types.
>
> Consider the following pattern:
>
> +---+--------------------------------+
> | 0 | ETHER                          |
> +---+--------------------------------+
> | 1 | ANY (``min`` = 1, ``max`` = 1) |
> +---+--------------------------------+
> | 2 | TCP                            |
> +---+--------------------------------+
>
> Knowing that TCP does not make sense with something other than IPv4 and IPv6
> as L3, such a pattern may be translated to two flow rules instead:
>
> +---+--------------------+
> | 0 | ETHER              |
> +---+--------------------+
> | 1 | IPV4 (zeroed mask) |
> +---+--------------------+
> | 2 | TCP                |
> +---+--------------------+
>
> +---+--------------------+
> | 0 | ETHER              |
> +---+--------------------+
> | 1 | IPV6 (zeroed mask) |
> +---+--------------------+
> | 2 | TCP                |
> +---+--------------------+
>
> Note that as soon as a ANY rule covers several layers, this approach may
> yield a large number of hidden flow rules. It is thus suggested to only
> support the most common scenarios (anything as L2 and/or L3).
>
> .. raw:: pdf
>
>     PageBreak
>
> Unsupported actions
> ~~~~~~~~~~~~~~~~~~~
>
> - When combined with a `QUEUE`_ action, packet counting (`COUNT`_) and
>    tagging (`ID`_) may be implemented in software as long as the target queue
>    is used by a single rule.
>
> - A rule specifying both `DUP`_ + `QUEUE`_ may be translated to two hidden
>    rules combining `QUEUE`_ and `PASSTHRU`_.
>
> - When a single target queue is provided, `RSS`_ can also be implemented
>    through `QUEUE`_.
>
> Flow rules priority
> ~~~~~~~~~~~~~~~~~~~
>
> While it would naturally make sense, flow rules cannot be assumed to be
> processed by hardware in the same order as their creation for several
> reasons:
>
> - They may be managed internally as a tree or a hash table instead of a
>    list.
> - Removing a flow rule before adding another one can either put the new rule
>    at the end of the list or reuse a freed entry.
> - Duplication may occur when packets are matched by several rules.
>
> For overlapping rules (particularly in order to use the `PASSTHRU`_ action)
> predictable behavior is only guaranteed by using different priority levels.
>
> Priority levels are not necessarily implemented in hardware, or may be
> severely limited (e.g. a single priority bit).
>
> For these reasons, priority levels may be implemented purely in software by
> PMDs.
>
> - For devices expecting flow rules to be added in the correct order, PMDs
>    may destroy and re-create existing rules after adding a new one with
>    a higher priority.
>
> - A configurable number of dummy or empty rules can be created at
>    initialization time to save high priority slots for later.
>
> - In order to save priority levels, PMDs may evaluate whether rules are
>    likely to collide and adjust their priority accordingly.
>
> .. raw:: pdf
>
>     PageBreak
>
> API migration
> =============
>
> Exhaustive list of deprecated filter types and how to convert them to
> generic flow rules.
>
> ``MACVLAN`` to ``ETH`` → ``VF``, ``PF``
> ---------------------------------------
>
> `MACVLAN`_ can be translated to a basic `ETH`_ flow rule with a `VF
> (action)`_ or `PF (action)`_ terminating action.
>
> +------------------------------------+
> | MACVLAN                            |
> +--------------------------+---------+
> | Pattern                  | Actions |
> +===+=====+==========+=====+=========+
> | 0 | ETH | ``spec`` | any | VF,     |
> |   |     +----------+-----+ PF      |
> |   |     | ``mask`` | any |         |
> +---+-----+----------+-----+---------+
>
> ``ETHERTYPE`` to ``ETH`` → ``QUEUE``, ``DROP``
> ----------------------------------------------
>
> `ETHERTYPE`_ is basically an `ETH`_ flow rule with `QUEUE`_ or `DROP`_ as
> a terminating action.
>
> +------------------------------------+
> | ETHERTYPE                          |
> +--------------------------+---------+
> | Pattern                  | Actions |
> +===+=====+==========+=====+=========+
> | 0 | ETH | ``spec`` | any | QUEUE,  |
> |   |     +----------+-----+ DROP    |
> |   |     | ``mask`` | any |         |
> +---+-----+----------+-----+---------+
>
> ``FLEXIBLE`` to ``RAW`` → ``QUEUE``
> -----------------------------------
>
> `FLEXIBLE`_ can be translated to one `RAW`_ pattern with `QUEUE`_ as the
> terminating action and a defined priority level.
>
> +------------------------------------+
> | FLEXIBLE                           |
> +--------------------------+---------+
> | Pattern                  | Actions |
> +===+=====+==========+=====+=========+
> | 0 | RAW | ``spec`` | any | QUEUE   |
> |   |     +----------+-----+         |
> |   |     | ``mask`` | any |         |
> +---+-----+----------+-----+---------+
>
> ``SYN`` to ``TCP`` → ``QUEUE``
> ------------------------------
>
> `SYN`_ is a `TCP`_ rule with only the ``syn`` bit enabled and masked, and
> `QUEUE`_ as the terminating action.
>
> Priority level can be set to simulate the high priority bit.
>
> +---------------------------------------------+
> | SYN                                         |
> +-----------------------------------+---------+
> | Pattern                           | Actions |
> +===+======+==========+=============+=========+
> | 0 | ETH  | ``spec`` | N/A         | QUEUE   |
> |   |      +----------+-------------+         |
> |   |      | ``mask`` | empty       |         |
> +---+------+----------+-------------+         |
> | 1 | IPV4 | ``spec`` | N/A         |         |
> |   |      +----------+-------------+         |
> |   |      | ``mask`` | empty       |         |
> +---+------+----------+-------------+         |
> | 2 | TCP  | ``spec`` | ``syn`` = 1 |         |
> |   |      +----------+-------------+         |
> |   |      | ``mask`` | ``syn`` = 1 |         |
> +---+------+----------+-------------+---------+
>
> ``NTUPLE`` to ``IPV4``, ``TCP``, ``UDP`` → ``QUEUE``
> ----------------------------------------------------
>
> `NTUPLE`_ is similar to specifying an empty L2, `IPV4`_ as L3 with `TCP`_ or
> `UDP`_ as L4 and `QUEUE`_ as the terminating action.
>
> A priority level can be specified as well.
>
> +---------------------------------------+
> | NTUPLE                                |
> +-----------------------------+---------+
> | Pattern                     | Actions |
> +===+======+==========+=======+=========+
> | 0 | ETH  | ``spec`` | N/A   | QUEUE   |
> |   |      +----------+-------+         |
> |   |      | ``mask`` | empty |         |
> +---+------+----------+-------+         |
> | 1 | IPV4 | ``spec`` | any   |         |
> |   |      +----------+-------+         |
> |   |      | ``mask`` | any   |         |
> +---+------+----------+-------+         |
> | 2 | TCP, | ``spec`` | any   |         |
> |   | UDP  +----------+-------+         |
> |   |      | ``mask`` | any   |         |
> +---+------+----------+-------+---------+
>
> ``TUNNEL`` to ``ETH``, ``IPV4``, ``IPV6``, ``VXLAN`` (or other) → ``QUEUE``
> ---------------------------------------------------------------------------
>
> `TUNNEL`_ matches common IPv4 and IPv6 L3/L4-based tunnel types.
>
> In the following table, `ANY`_ is used to cover the optional L4.
>
> +------------------------------------------------+
> | TUNNEL                                         |
> +--------------------------------------+---------+
> | Pattern                              | Actions |
> +===+=========+==========+=============+=========+
> | 0 | ETH     | ``spec`` | any         | QUEUE   |
> |   |         +----------+-------------+         |
> |   |         | ``mask`` | any         |         |
> +---+---------+----------+-------------+         |
> | 1 | IPV4,   | ``spec`` | any         |         |
> |   | IPV6    +----------+-------------+         |
> |   |         | ``mask`` | any         |         |
> +---+---------+----------+-------------+         |
> | 2 | ANY     | ``spec`` | ``min`` = 0 |         |
> |   |         |          +-------------+         |
> |   |         |          | ``max`` = 0 |         |
> |   |         +----------+-------------+         |
> |   |         | ``mask`` | N/A         |         |
> +---+---------+----------+-------------+         |
> | 3 | VXLAN,  | ``spec`` | any         |         |
> |   | GENEVE, +----------+-------------+         |
> |   | TEREDO, | ``mask`` | any         |         |
> |   | NVGRE,  |          |             |         |
> |   | GRE,    |          |             |         |
> |   | ...     |          |             |         |
> +---+---------+----------+-------------+---------+
>
> .. raw:: pdf
>
>     PageBreak
>
> ``FDIR`` to most item types → ``QUEUE``, ``DROP``, ``PASSTHRU``
> ---------------------------------------------------------------
>
> `FDIR`_ is more complex than any other type, there are several methods to
> emulate its functionality. It is summarized for the most part in the table
> below.
>
> A few features are intentionally not supported:
>
> - The ability to configure the matching input set and masks for the entire
>    device, PMDs should take care of it automatically according to flow rules.
>
> - Returning four or eight bytes of matched data when using flex bytes
>    filtering. Although a specific action could implement it, it conflicts
>    with the much more useful 32 bits tagging on devices that support it.
>
> - Side effects on RSS processing of the entire device. Flow rules that
>    conflict with the current device configuration should not be
>    allowed. Similarly, device configuration should not be allowed when it
>    affects existing flow rules.
>
> - Device modes of operation. "none" is unsupported since filtering cannot be
>    disabled as long as a flow rule is present.
>
> - "MAC VLAN" or "tunnel" perfect matching modes should be automatically set
>    according to the created flow rules.
>
> +----------------------------------------------+
> | FDIR                                         |
> +---------------------------------+------------+
> | Pattern                         | Actions    |
> +===+============+==========+=====+============+
> | 0 | ETH,       | ``spec`` | any | QUEUE,     |
> |   | RAW        +----------+-----+ DROP,      |
> |   |            | ``mask`` | any | PASSTHRU   |
> +---+------------+----------+-----+------------+
> | 1 | IPV4,      | ``spec`` | any | ID         |
> |   | IPV6       +----------+-----+ (optional) |
> |   |            | ``mask`` | any |            |
> +---+------------+----------+-----+            |
> | 2 | TCP,       | ``spec`` | any |            |
> |   | UDP,       +----------+-----+            |
> |   | SCTP       | ``mask`` | any |            |
> +---+------------+----------+-----+            |
> | 3 | VF,        | ``spec`` | any |            |
> |   | PF,        +----------+-----+            |
> |   | SIGNATURE  | ``mask`` | any |            |
> |   | (optional) |          |     |            |
> +---+------------+----------+-----+------------+
>
> ``HASH``
> ~~~~~~~~
>
> Hashing configuration is set per rule through the `SIGNATURE`_ item.
>
> Since it is usually a global device setting, all flow rules created with
> this item may have to share the same specification.
>
> ``L2_TUNNEL`` to ``VOID`` → ``VXLAN`` (or others)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> All packets are matched. This type alters incoming packets to encapsulate
> them in a chosen tunnel type, optionally redirect them to a VF as well.
>
> The destination pool for tag based forwarding can be emulated with other
> flow rules using `DUP`_ as the action.
>
> +----------------------------------------+
> | L2_TUNNEL                              |
> +---------------------------+------------+
> | Pattern                   | Actions    |
> +===+======+==========+=====+============+
> | 0 | VOID | ``spec`` | N/A | VXLAN,     |
> |   |      |          |     | GENEVE,    |
> |   |      |          |     | ...        |
> |   |      +----------+-----+------------+
> |   |      | ``mask`` | N/A | VF         |
> |   |      |          |     | (optional) |
> +---+------+----------+-----+------------+
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-05 18:16 Adrien Mazarguil
  2016-07-07  7:14 ` Lu, Wenzhuo
@ 2016-07-07 23:15 ` Chandran, Sugesh
  2016-07-08 13:03   ` Adrien Mazarguil
  2016-07-08 11:11 ` Liang, Cunming
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Chandran, Sugesh @ 2016-07-07 23:15 UTC (permalink / raw)
  To: Adrien Mazarguil, dev
  Cc: Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Lu, Wenzhuo, Jan Medala,
	John Daley, Chen, Jing D, Ananyev, Konstantin, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, De Lara Guarch,
	Pablo, Olga Shern

Hi Adrien,

Thank you for proposing this. It would be really useful for application such as OVS-DPDK.
Please find my comments and questions inline below prefixed with [Sugesh]. Most of them are from the perspective of enabling these APIs in application such as OVS-DPDK.

Regards
_Sugesh


> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Tuesday, July 5, 2016 7:17 PM
> To: dev@dpdk.org
> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Zhang, Helin
> <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Rasesh
> Mody <rasesh.mody@qlogic.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>; Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>;
> Jan Medala <jan@semihalf.com>; John Daley <johndale@cisco.com>; Chen,
> Jing D <jing.d.chen@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Matej Vido <matejvido@gmail.com>;
> Alejandro Lucero <alejandro.lucero@netronome.com>; Sony Chacko
> <sony.chacko@qlogic.com>; Jerin Jacob
> <jerin.jacob@caviumnetworks.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olga Shern <olgas@mellanox.com>
> Subject: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
> 
> Hi All,
> 
> First, forgive me for this large message, I know our mailboxes already
> suffer quite a bit from the amount of traffic on this ML.
> 
> This is not exactly yet another thread about how flow director should be
> extended, rather about a brand new API to handle filtering and
> classification for incoming packets in the most PMD-generic and
> application-friendly fashion we can come up with. Reasons described below.
> 
> I think this topic is important enough to include both the users of this API
> as well as PMD maintainers. So far I have CC'ed librte_ether (especially
> rte_eth_ctrl.h contributors), testpmd and PMD maintainers (with and
> without
> a .filter_ctrl implementation), but if you know application maintainers
> other than testpmd who use FDIR or might be interested in this discussion,
> feel free to add them.
> 
> The issues we found with the current approach are already summarized in
> the
> following document, but here is a quick summary for TL;DR folks:
> 
> - PMDs do not expose a common set of filter types and even when they do,
>   their behavior more or less differs.
> 
> - Applications need to determine and adapt to device-specific limitations
>   and quirks on their own, without help from PMDs.
> 
> - Writing an application that creates flow rules targeting all devices
>   supported by DPDK is thus difficult, if not impossible.
> 
> - The current API has too many unspecified areas (particularly regarding
>   side effects of flow rules) that make PMD implementation tricky.
> 
> This RFC API handles everything currently supported by .filter_ctrl, the
> idea being to reimplement all of these to make them fully usable by
> applications in a more generic and well defined fashion. It has a very small
> set of mandatory features and an easy method to let applications probe for
> supported capabilities.
> 
> The only downside is more work for the software control side of PMDs
> because
> they have to adapt to the API instead of the reverse. I think helpers can be
> added to EAL to assist with this.
> 
> HTML version:
> 
>  https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
> 
> PDF version:
> 
>  https://rawgit.com/6WIND/rte_flow/master/rte_flow.pdf
> 
> Related draft header file (for reference while reading the specification):
> 
>  https://raw.githubusercontent.com/6WIND/rte_flow/master/rte_flow.h
> 
> Git tree for completeness (latest .rst version can be retrieved from here):
> 
>  https://github.com/6WIND/rte_flow
> 
> What follows is the ReST source of the above, for inline comments and
> discussion. I intend to update that specification accordingly.
> 
> ========================
> Generic filter interface
> ========================
> 
> .. footer::
> 
>    v0.6
> 
> .. contents::
> .. sectnum::
> .. raw:: pdf
> 
>    PageBreak
> 
> Overview
> ========
> 
> DPDK provides several competing interfaces added over time to perform
> packet
> matching and related actions such as filtering and classification.
> 
> They must be extended to implement the features supported by newer
> devices
> in order to expose them to applications, however the current design has
> several drawbacks:
> 
> - Complicated filter combinations which have not been hard-coded cannot be
>   expressed.
> - Prone to API/ABI breakage when new features must be added to an
> existing
>   filter type, which frequently happens.
> 
> From an application point of view:
> 
> - Having disparate interfaces, all optional and lacking in features does not
>   make this API easy to use.
> - Seemingly arbitrary built-in limitations of filter types based on the
>   device they were initially designed for.
> - Undefined relationship between different filter types.
> - High complexity, considerable undocumented and/or undefined behavior.
> 
> Considering the growing number of devices supported by DPDK, adding a
> new
> filter type each time a new feature must be implemented is not sustainable
> in the long term. Applications not written to target a specific device
> cannot really benefit from such an API.
> 
> For these reasons, this document defines an extensible unified API that
> encompasses and supersedes these legacy filter types.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Current API
> ===========
> 
> Rationale
> ---------
> 
> The reason several competing (and mostly overlapping) filtering APIs are
> present in DPDK is due to its nature as a thin layer between hardware and
> software.
> 
> Each subsequent interface has been added to better match the capabilities
> and limitations of the latest supported device, which usually happened to
> need an incompatible configuration approach. Because of this, many ended
> up
> device-centric and not usable by applications that were not written for that
> particular device.
> 
> This document is not the first attempt to address this proliferation issue,
> in fact a lot of work has already been done both to create a more generic
> interface while somewhat keeping compatibility with legacy ones through a
> common call interface (``rte_eth_dev_filter_ctrl()`` with the
> ``.filter_ctrl`` PMD callback in ``rte_ethdev.h``).
> 
> Today, these previously incompatible interfaces are known as filter types
> (``RTE_ETH_FILTER_*`` from ``enum rte_filter_type`` in ``rte_eth_ctrl.h``).
> 
> However while trivial to extend with new types, it only shifted the
> underlying problem as applications still need to be written for one kind of
> filter type, which, as described in the following sections, is not
> necessarily implemented by all PMDs that support filtering.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Filter types
> ------------
> 
> This section summarizes the capabilities of each filter type.
> 
> Although the following list is exhaustive, the description of individual
> types may contain inaccuracies due to the lack of documentation or usage
> examples.
> 
> Note: names are prefixed with ``RTE_ETH_FILTER_``.
> 
> ``MACVLAN``
> ~~~~~~~~~~~
> 
> Matching:
> 
> - L2 source/destination addresses.
> - Optional 802.1Q VLAN ID.
> - Masking individual fields on a rule basis is not supported.
> 
> Action:
> 
> - Packets are redirected either to a given VF device using its ID or to the
>   PF.
> 
> ``ETHERTYPE``
> ~~~~~~~~~~~~~
> 
> Matching:
> 
> - L2 source/destination addresses (optional).
> - Ethertype (no VLAN ID?).
> - Masking individual fields on a rule basis is not supported.
> 
> Action:
> 
> - Receive packets on a given queue.
> - Drop packets.
> 
> ``FLEXIBLE``
> ~~~~~~~~~~~~
> 
> Matching:
> 
> - At most 128 consecutive bytes anywhere in packets.
> - Masking is supported with byte granularity.
> - Priorities are supported (relative to this filter type, undefined
>   otherwise).
> 
> Action:
> 
> - Receive packets on a given queue.
> 
> ``SYN``
> ~~~~~~~
> 
> Matching:
> 
> - TCP SYN packets only.
> - One high priority bit can be set to give the highest possible priority to
>   this type when other filters with different types are configured.
> 
> Action:
> 
> - Receive packets on a given queue.
> 
> ``NTUPLE``
> ~~~~~~~~~~
> 
> Matching:
> 
> - Source/destination IPv4 addresses (optional in 2-tuple mode).
> - Source/destination TCP/UDP port (mandatory in 2 and 5-tuple modes).
> - L4 protocol (2 and 5-tuple modes).
> - Masking individual fields is supported.
> - TCP flags.
> - Up to 7 levels of priority relative to this filter type, undefined
>   otherwise.
> - No IPv6.
> 
> Action:
> 
> - Receive packets on a given queue.
> 
> ``TUNNEL``
> ~~~~~~~~~~
> 
> Matching:
> 
> - Outer L2 source/destination addresses.
> - Inner L2 source/destination addresses.
> - Inner VLAN ID.
> - IPv4/IPv6 source (destination?) address.
> - Tunnel type to match (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE,
> 802.1BR
>   E-Tag).
> - Tenant ID for tunneling protocols that have one.
> - Any combination of the above can be specified.
> - Masking individual fields on a rule basis is not supported.
> 
> Action:
> 
> - Receive packets on a given queue.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``FDIR``
> ~~~~~~~~
> 
> Queries:
> 
> - Device capabilities and limitations.
> - Device statistics about configured filters (resource usage, collisions).
> - Device configuration (matching input set and masks)
> 
> Matching:
> 
> - Device mode of operation: none (to disable filtering), signature
>   (hash-based dispatching from masked fields) or perfect (either MAC VLAN
> or
>   tunnel).
> - L2 Ethertype.
> - Outer L2 destination address (MAC VLAN mode).
> - Inner L2 destination address, tunnel type (NVGRE, VXLAN) and tunnel ID
>   (tunnel mode).
> - IPv4 source/destination addresses, ToS, TTL and protocol fields.
> - IPv6 source/destination addresses, TC, protocol and hop limits fields.
> - UDP source/destination IPv4/IPv6 and ports.
> - TCP source/destination IPv4/IPv6 and ports.
> - SCTP source/destination IPv4/IPv6, ports and verification tag field.
> - Note, only one protocol type at once (either only L2 Ethertype, basic
>   IPv6, IPv4+UDP, IPv4+TCP and so on).
> - VLAN TCI (extended API).
> - At most 16 bytes to match in payload (extended API). A global device
>   look-up table specifies for each possible protocol layer (unknown, raw,
>   L2, L3, L4) the offset to use for each byte (they do not need to be
>   contiguous) and the related bitmask.
> - Whether packet is addressed to PF or VF, in that case its ID can be
>   matched as well (extended API).
> - Masking most of the above fields is supported, but simultaneously affects
>   all filters configured on a device.
> - Input set can be modified in a similar fashion for a given device to
>   ignore individual fields of filters (i.e. do not match the destination
>   address in a IPv4 filter, refer to **RTE_ETH_INPUT_SET_**
>   macros). Configuring this also affects RSS processing on **i40e**.
> - Filters can also provide 32 bits of arbitrary data to return as part of
>   matched packets.
> 
> Action:
> 
> - **RTE_ETH_FDIR_ACCEPT**: receive (accept) packet on a given queue.
> - **RTE_ETH_FDIR_REJECT**: drop packet immediately.
> - **RTE_ETH_FDIR_PASSTHRU**: similar to accept for the last filter in list,
>   otherwise process it with subsequent filters.
> - For accepted packets and if requested by filter, either 32 bits of
>   arbitrary data and four bytes of matched payload (only in case of flex
>   bytes matching), or eight bytes of matched payload (flex also) are added
>   to meta data.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``HASH``
> ~~~~~~~~
> 
> Not an actual filter type. Provides and retrieves the global device
> configuration (per port or entire NIC) for hash functions and their
> properties.
> 
> Hash function selection: "default" (keep current), XOR or Toeplitz.
> 
> This function can be configured per flow type (**RTE_ETH_FLOW_**
> definitions), supported types are:
> 
> - Unknown.
> - Raw.
> - Fragmented or non-fragmented IPv4.
> - Non-fragmented IPv4 with L4 (TCP, UDP, SCTP or other).
> - Fragmented or non-fragmented IPv6.
> - Non-fragmented IPv6 with L4 (TCP, UDP, SCTP or other).
> - L2 payload.
> - IPv6 with extensions.
> - IPv6 with L4 (TCP, UDP) and extensions.
> 
> ``L2_TUNNEL``
> ~~~~~~~~~~~~~
> 
> Matching:
> 
> - All packets received on a given port.
> 
> Action:
> 
> - Add tunnel encapsulation (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE,
>   802.1BR E-Tag) using the provided Ethertype and tunnel ID (only E-Tag
>   is implemented at the moment).
> - VF ID to use for tag insertion (currently unused).
> - Destination pool for tag based forwarding (pools are IDs that can be
>   affected to ports, duplication occurs if the same ID is shared by several
>   ports of the same NIC).
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Driver support
> --------------
> 
> ======== ======= ========= ======== === ====== ====== ==== ====
> =========
> Driver   MACVLAN ETHERTYPE FLEXIBLE SYN NTUPLE TUNNEL FDIR HASH
> L2_TUNNEL
> ======== ======= ========= ======== === ====== ====== ==== ====
> =========
> bnx2x
> cxgbe
> e1000            yes       yes      yes yes
> ena
> enic                                                  yes
> fm10k
> i40e     yes     yes                           yes    yes  yes
> ixgbe            yes                yes yes           yes       yes
> mlx4
> mlx5                                                  yes
> szedata2
> ======== ======= ========= ======== === ====== ====== ==== ====
> =========
> 
> Flow director
> -------------
> 
> Flow director (FDIR) is the name of the most capable filter type, which
> covers most features offered by others. As such, it is the most widespread
> in PMDs that support filtering (i.e. all of them besides **e1000**).
> 
> It is also the only type that allows an arbitrary 32 bits value provided by
> applications to be attached to a filter and returned with matching packets
> instead of relying on the destination queue to recognize flows.
> 
> Unfortunately, even FDIR requires applications to be aware of low-level
> capabilities and limitations (most of which come directly from **ixgbe** and
> **i40e**):
> 
> - Bitmasks are set globally per device (port?), not per filter.
[Sugesh] This means application cannot define filters that matches on arbitrary different offsets?
If that’s the case, I assume the application has to program bitmask in advance. Otherwise how 
the API framework deduce this bitmask information from the rules?? Its not very clear to me
that how application pass down the bitmask information for multiple filters on same port?
> - Configuration state is not expected to be saved by the driver, and
>   stopping/restarting a port requires the application to perform it again
>   (API documentation is also unclear about this).
> - Monolithic approach with ABI issues as soon as a new kind of flow or
>   combination needs to be supported.
> - Cryptic global statistics/counters.
> - Unclear about how priorities are managed; filters seem to be arranged as a
>   linked list in hardware (possibly related to configuration order).
> 
> Packet alteration
> -----------------
> 
> One interesting feature is that the L2 tunnel filter type implements the
> ability to alter incoming packets through a filter (in this case to
> encapsulate them), thus the **mlx5** flow encap/decap features are not a
> foreign concept.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Proposed API
> ============
> 
> Terminology
> -----------
> 
> - **Filtering API**: overall framework affecting the fate of selected
>   packets, covers everything described in this document.
> - **Matching pattern**: properties to look for in received packets, a
>   combination of any number of items.
> - **Pattern item**: part of a pattern that either matches packet data
>   (protocol header, payload or derived information), or specifies properties
>   of the pattern itself.
> - **Actions**: what needs to be done when a packet matches a pattern.
> - **Flow rule**: this is the result of combining a *matching pattern* with
>   *actions*.
> - **Filter rule**: a less generic term than *flow rule*, can otherwise be
>   used interchangeably.
> - **Hit**: a flow rule is said to be *hit* when processing a matching
>   packet.
> 
> Requirements
> ------------
> 
> As described in the previous section, there is a growing need for a common
> method to configure filtering and related actions in a hardware independent
> fashion.
> 
> The filtering API should not disallow any filter combination by design and
> must remain as simple as possible to use. It can simply be defined as a
> method to perform one or several actions on selected packets.
> 
> PMDs are aware of the capabilities of the device they manage and should be
> responsible for preventing unsupported or conflicting combinations.
> 
> This approach is fundamentally different as it places most of the burden on
> the software side of the PMD instead of having device capabilities directly
> mapped to API functions, then expecting applications to work around
> ensuing
> compatibility issues.
> 
> Requirements for a new API:
> 
> - Flexible and extensible without causing API/ABI problems for existing
>   applications.
> - Should be unambiguous and easy to use.
> - Support existing filtering features and actions listed in `Filter types`_.
> - Support packet alteration.
> - In case of overlapping filters, their priority should be well documented.
> - Support filter queries (for example to retrieve counters).
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> High level design
> -----------------
> 
> The chosen approach to make filtering as generic as possible is by
> expressing matching patterns through lists of items instead of the flat
> structures used in DPDK today, enabling combinations that are not
> predefined
> and thus being more versatile.
> 
> Flow rules can have several distinct actions (such as counting,
> encapsulating, decapsulating before redirecting packets to a particular
> queue, etc.), instead of relying on several rules to achieve this and having
> applications deal with hardware implementation details regarding their
> order.
> 
> Support for different priority levels on a rule basis is provided, for
> example in order to force a more specific rule come before a more generic
> one for packets matched by both, however hardware support for more than
> a
> single priority level cannot be guaranteed. When supported, the number of
> available priority levels is usually low, which is why they can also be
> implemented in software by PMDs (e.g. to simulate missing priority levels by
> reordering rules).
> 
> In order to remain as hardware agnostic as possible, by default all rules
> are considered to have the same priority, which means that the order
> between
> overlapping rules (when a packet is matched by several filters) is
> undefined, packet duplication may even occur as a result.
> 
> PMDs may refuse to create overlapping rules at a given priority level when
> they can be detected (e.g. if a pattern matches an existing filter).
> 
> Thus predictable results for a given priority level can only be achieved
> with non-overlapping rules, using perfect matching on all protocol layers.
> 
> Support for multiple actions per rule may be implemented internally on top
> of non-default hardware priorities, as a result both features may not be
> simultaneously available to applications.
> 
> Considering that allowed pattern/actions combinations cannot be known in
> advance and would result in an unpractically large number of capabilities to
> expose, a method is provided to validate a given rule from the current
> device configuration state without actually adding it (akin to a "dry run"
> mode).
> 
> This enables applications to check if the rule types they need is supported
> at initialization time, before starting their data path. This method can be
> used anytime, its only requirement being that the resources needed by a
> rule
> must exist (e.g. a target RX queue must be configured first).
> 
> Each defined rule is associated with an opaque handle managed by the PMD,
> applications are responsible for keeping it. These can be used for queries
> and rules management, such as retrieving counters or other data and
> destroying them.
> 
> Handles must be destroyed before releasing associated resources such as
> queues.
> 
> Integration
> -----------
> 
> To avoid ABI breakage, this new interface will be implemented through the
> existing filtering control framework (``rte_eth_dev_filter_ctrl()``) using
> **RTE_ETH_FILTER_GENERIC** as a new filter type.
> 
> However a public front-end API described in `Rules management`_ will
> be added as the preferred method to use it.
> 
> Once discussions with the community have converged to a definite API,
> legacy
> filter types should be deprecated and a deadline defined to remove their
> support entirely.
> 
> PMDs will have to be gradually converted to **RTE_ETH_FILTER_GENERIC**
> or
> drop filtering support entirely. Less maintained PMDs for older hardware
> may
> lose support at this point.
> 
> The notion of filter type will then be deprecated and subsequently dropped
> to avoid confusion between both frameworks.
> 
> Implementation details
> ======================
> 
> Flow rule
> ---------
> 
> A flow rule is the combination of a matching pattern with a list of actions,
> and is the basis of this API.
> 
> Priorities
> ~~~~~~~~~~
> 
> A priority can be assigned to a matching pattern.
> 
> The default priority level is 0 and is also the highest. Support for more
> than a single priority level in hardware is not guaranteed.
> 
> If a packet is matched by several filters at a given priority level, the
> outcome is undefined. It can take any path and can even be duplicated.
> 
> Matching pattern
> ~~~~~~~~~~~~~~~~
> 
> A matching pattern comprises any number of items of various types.
> 
> Items are arranged in a list to form a matching pattern for packets. They
> fall in two categories:
> 
> - Protocol matching (ANY, RAW, ETH, IPV4, IPV6, ICMP, UDP, TCP, VXLAN and
> so
>   on), usually associated with a specification structure. These must be
>   stacked in the same order as the protocol layers to match, starting from
>   L2.
> 
> - Affecting how the pattern is processed (END, VOID, INVERT, PF, VF,
>   SIGNATURE and so on), often without a specification structure. Since they
>   are meta data that does not match packet contents, these can be specified
>   anywhere within item lists without affecting the protocol matching items.
> 
> Most item specifications can be optionally paired with a mask to narrow the
> specific fields or bits to be matched.
> 
> - Items are defined with ``struct rte_flow_item``.
> - Patterns are defined with ``struct rte_flow_pattern``.
> 
> Example of an item specification matching an Ethernet header:
> 
> +-----------------------------------------+
> | Ethernet                                |
> +==========+=========+====================+
> | ``spec`` | ``src`` | ``00:01:02:03:04`` |
> |          +---------+--------------------+
> |          | ``dst`` | ``00:2a:66:00:01`` |
> +----------+---------+--------------------+
> | ``mask`` | ``src`` | ``00:ff:ff:ff:00`` |
> |          +---------+--------------------+
> |          | ``dst`` | ``00:00:00:00:ff`` |
> +----------+---------+--------------------+
> 
> Non-masked bits stand for any value, Ethernet headers with the following
> properties are thus matched:
> 
> - ``src``: ``??:01:02:03:??``
> - ``dst``: ``??:??:??:??:01``
> 
> Except for meta types that do not need one, ``spec`` must be a valid pointer
> to a structure of the related item type. A ``mask`` of the same type can be
> provided to tell which bits in ``spec`` are to be matched.
> 
> A mask is normally only needed for ``spec`` fields matching packet data,
> ignored otherwise. See individual item types for more information.
> 
> A ``NULL`` mask pointer is allowed and is similar to matching with a full
> mask (all ones) ``spec`` fields supported by hardware, the remaining fields
> are ignored (all zeroes), there is thus no error checking for unsupported
> fields.
> 
> Matching pattern items for packet data must be naturally stacked (ordered
> from lowest to highest protocol layer), as in the following examples:
> 
> +--------------+
> | TCPv4 as L4  |
> +===+==========+
> | 0 | Ethernet |
> +---+----------+
> | 1 | IPv4     |
> +---+----------+
> | 2 | TCP      |
> +---+----------+
> 
> +----------------+
> | TCPv6 in VXLAN |
> +===+============+
> | 0 | Ethernet   |
> +---+------------+
> | 1 | IPv4       |
> +---+------------+
> | 2 | UDP        |
> +---+------------+
> | 3 | VXLAN      |
> +---+------------+
> | 4 | Ethernet   |
> +---+------------+
> | 5 | IPv6       |
> +---+------------+
> | 6 | TCP        |
> +---+------------+
> 
> +-----------------------------+
> | TCPv4 as L4 with meta items |
> +===+=========================+
> | 0 | VOID                    |
> +---+-------------------------+
> | 1 | Ethernet                |
> +---+-------------------------+
> | 2 | VOID                    |
> +---+-------------------------+
> | 3 | IPv4                    |
> +---+-------------------------+
> | 4 | TCP                     |
> +---+-------------------------+
> | 5 | VOID                    |
> +---+-------------------------+
> | 6 | VOID                    |
> +---+-------------------------+
> 
> The above example shows how meta items do not affect packet data
> matching
> items, as long as those remain stacked properly. The resulting matching
> pattern is identical to "TCPv4 as L4".
> 
> +----------------+
> | UDPv6 anywhere |
> +===+============+
> | 0 | IPv6       |
> +---+------------+
> | 1 | UDP        |
> +---+------------+
> 
> If supported by the PMD, omitting one or several protocol layers at the
> bottom of the stack as in the above example (missing an Ethernet
> specification) enables hardware to look anywhere in packets.
> 
> It is unspecified whether the payload of supported encapsulations
> (e.g. VXLAN inner packet) is matched by such a pattern, which may apply to
> inner, outer or both packets.
> 
> +---------------------+
> | Invalid, missing L3 |
> +===+=================+
> | 0 | Ethernet        |
> +---+-----------------+
> | 1 | UDP             |
> +---+-----------------+
> 
> The above pattern is invalid due to a missing L3 specification between L2
> and L4. It is only allowed at the bottom and at the top of the stack.
> 
> Meta item types
> ~~~~~~~~~~~~~~~
> 
> These do not match packet data but affect how the pattern is processed,
> most
> of them do not need a specification structure. This particularity allows
> them to be specified anywhere without affecting other item types.
> 
> ``END``
> ^^^^^^^
> 
> End marker for item lists. Prevents further processing of items, thereby
> ending the pattern.
> 
> - Its numeric value is **0** for convenience.
> - PMD support is mandatory.
> - Both ``spec`` and ``mask`` are ignored.
> 
> +--------------------+
> | END                |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
> 
> ``VOID``
> ^^^^^^^^
> 
> Used as a placeholder for convenience. It is ignored and simply discarded by
> PMDs.
> 
> - PMD support is mandatory.
> - Both ``spec`` and ``mask`` are ignored.
> 
> +--------------------+
> | VOID               |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
> 
> One usage example for this type is generating rules that share a common
> prefix quickly without reallocating memory, only by updating item types:
> 
> +------------------------+
> | TCP, UDP or ICMP as L4 |
> +===+====================+
> | 0 | Ethernet           |
> +---+--------------------+
> | 1 | IPv4               |
> +---+------+------+------+
> | 2 | UDP  | VOID | VOID |
> +---+------+------+------+
> | 3 | VOID | TCP  | VOID |
> +---+------+------+------+
> | 4 | VOID | VOID | ICMP |
> +---+------+------+------+
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``INVERT``
> ^^^^^^^^^^
> 
> Inverted matching, i.e. process packets that do not match the pattern.
> 
> - Both ``spec`` and ``mask`` are ignored.
> 
> +--------------------+
> | INVERT             |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
> 
> Usage example in order to match non-TCPv4 packets only:
> 
> +--------------------+
> | Anything but TCPv4 |
> +===+================+
> | 0 | INVERT         |
> +---+----------------+
> | 1 | Ethernet       |
> +---+----------------+
> | 2 | IPv4           |
> +---+----------------+
> | 3 | TCP            |
> +---+----------------+
> 
> ``PF``
> ^^^^^^
> 
> Matches packets addressed to the physical function of the device.
> 
> - Both ``spec`` and ``mask`` are ignored.
> 
> +--------------------+
> | PF                 |
> +==========+=========+
> | ``spec`` | ignored |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
> 
> ``VF``
> ^^^^^^
> 
> Matches packets addressed to the given virtual function ID of the device.
> 
> - Only ``spec`` needs to be defined, ``mask`` is ignored.
> 
> +----------------------------------------+
> | VF                                     |
> +==========+=========+===================+
> | ``spec`` | ``vf``  | destination VF ID |
> +----------+---------+-------------------+
> | ``mask`` | ignored                     |
> +----------+-----------------------------+
> 
> ``SIGNATURE``
> ^^^^^^^^^^^^^
> 
> Requests hash-based signature dispatching for this rule.
> 
> Considering this is a global setting on devices that support it, all
> subsequent filter rules may have to be created with it as well.
> 
> - Only ``spec`` needs to be defined, ``mask`` is ignored.
> 
> +--------------------+
> | SIGNATURE          |
> +==========+=========+
> | ``spec`` | TBD     |
> +----------+---------+
> | ``mask`` | ignored |
> +----------+---------+
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Data matching item types
> ~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Most of these are basically protocol header definitions with associated
> bitmasks. They must be specified (stacked) from lowest to highest protocol
> layer.
> 
> The following list is not exhaustive as new protocols will be added in the
> future.
> 
> ``ANY``
> ^^^^^^^
> 
> Matches any protocol in place of the current layer, a single ANY may also
> stand for several protocol layers.
> 
> This is usually specified as the first pattern item when looking for a
> protocol anywhere in a packet.
> 
> - A maximum value of **0** requests matching any number of protocol
> layers
>   above or equal to the minimum value, a maximum value lower than the
>   minimum one is otherwise invalid.
> - Only ``spec`` needs to be defined, ``mask`` is ignored.
> 
> +-----------------------------------------------------------------------+
> | ANY                                                                   |
> +==========+=========+====================================
> ==============+
> | ``spec`` | ``min`` | minimum number of layers covered                 |
> |          +---------+--------------------------------------------------+
> |          | ``max`` | maximum number of layers covered, 0 for infinity |
> +----------+---------+--------------------------------------------------+
> | ``mask`` | ignored                                                    |
> +----------+------------------------------------------------------------+
> 
> Example for VXLAN TCP payload matching regardless of outer L3 (IPv4 or
> IPv6)
> and L4 (UDP) both matched by the first ANY specification, and inner L3 (IPv4
> or IPv6) matched by the second ANY specification:
> 
> +----------------------------------+
> | TCP in VXLAN with wildcards      |
> +===+==============================+
> | 0 | Ethernet                     |
> +---+-----+----------+---------+---+
> | 1 | ANY | ``spec`` | ``min`` | 2 |
> |   |     |          +---------+---+
> |   |     |          | ``max`` | 2 |
> +---+-----+----------+---------+---+
> | 2 | VXLAN                        |
> +---+------------------------------+
> | 3 | Ethernet                     |
> +---+-----+----------+---------+---+
> | 4 | ANY | ``spec`` | ``min`` | 1 |
> |   |     |          +---------+---+
> |   |     |          | ``max`` | 1 |
> +---+-----+----------+---------+---+
> | 5 | TCP                          |
> +---+------------------------------+
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``RAW``
> ^^^^^^^
> 
> Matches a string of a given length at a given offset (in bytes), or anywhere
> in the payload of the current protocol layer (including L2 header if used as
> the first item in the stack).
> 
> This does not increment the protocol layer count as it is not a protocol
> definition. Subsequent RAW items modulate the first absolute one with
> relative offsets.
> 
> - Using **-1** as the ``offset`` of the first RAW item makes its absolute
>   offset not fixed, i.e. the pattern is searched everywhere.
> - ``mask`` only affects the pattern.
> 
> +--------------------------------------------------------------+
> | RAW                                                          |
> +==========+=============+================================
> =====+
> | ``spec`` | ``offset``  | absolute or relative pattern offset |
> |          +-------------+-------------------------------------+
> |          | ``length``  | pattern length                      |
> |          +-------------+-------------------------------------+
> |          | ``pattern`` | byte string of the above length     |
> +----------+-------------+-------------------------------------+
> | ``mask`` | ``offset``  | ignored                             |
> |          +-------------+-------------------------------------+
> |          | ``length``  | ignored                             |
> |          +-------------+-------------------------------------+
> |          | ``pattern`` | bitmask with the same byte length   |
> +----------+-------------+-------------------------------------+
> 
> Example pattern looking for several strings at various offsets of a UDP
> payload, using combined RAW items:
> 
> +------------------------------------------+
> | UDP payload matching                     |
> +===+======================================+
> | 0 | Ethernet                             |
> +---+--------------------------------------+
> | 1 | IPv4                                 |
> +---+--------------------------------------+
> | 2 | UDP                                  |
> +---+-----+----------+-------------+-------+
> | 3 | RAW | ``spec`` | ``offset``  | -1    |
> |   |     |          +-------------+-------+
> |   |     |          | ``length``  | 3     |
> |   |     |          +-------------+-------+
> |   |     |          | ``pattern`` | "foo" |
> +---+-----+----------+-------------+-------+
> | 4 | RAW | ``spec`` | ``offset``  | 20    |
> |   |     |          +-------------+-------+
> |   |     |          | ``length``  | 3     |
> |   |     |          +-------------+-------+
> |   |     |          | ``pattern`` | "bar" |
> +---+-----+----------+-------------+-------+
> | 5 | RAW | ``spec`` | ``offset``  | -30   |
> |   |     |          +-------------+-------+
> |   |     |          | ``length``  | 3     |
> |   |     |          +-------------+-------+
> |   |     |          | ``pattern`` | "baz" |
> +---+-----+----------+-------------+-------+
> 
> This translates to:
> 
> - Locate "foo" in UDP payload, remember its offset.
> - Check "bar" at "foo"'s offset plus 20 bytes.
> - Check "baz" at "foo"'s offset minus 30 bytes.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``ETH``
> ^^^^^^^
> 
> Matches an Ethernet header.
> 
> - ``dst``: destination MAC.
> - ``src``: source MAC.
> - ``type``: EtherType.
> - ``tags``: number of 802.1Q/ad tags defined.
> - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> 
>  - ``tpid``: Tag protocol identifier.
>  - ``tci``: Tag control information.
> 
> ``IPV4``
> ^^^^^^^^
> 
> Matches an IPv4 header.
> 
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tos``: ToS/DSCP field.
> - ``ttl``: TTL field.
> - ``proto``: protocol number for the next layer.
> 
> ``IPV6``
> ^^^^^^^^
> 
> Matches an IPv6 header.
> 
> - ``src``: source IP address.
> - ``dst``: destination IP address.
> - ``tc``: traffic class field.
> - ``nh``: Next header field (protocol).
> - ``hop_limit``: hop limit field (TTL).
> 
> ``ICMP``
> ^^^^^^^^
> 
> Matches an ICMP header.
> 
> - TBD.
> 
> ``UDP``
> ^^^^^^^
> 
> Matches a UDP header.
> 
> - ``sport``: source port.
> - ``dport``: destination port.
> - ``length``: UDP length.
> - ``checksum``: UDP checksum.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``TCP``
> ^^^^^^^
> 
> Matches a TCP header.
> 
> - ``sport``: source port.
> - ``dport``: destination port.
> - All other TCP fields and bits.
> 
> ``VXLAN``
> ^^^^^^^^^
> 
> Matches a VXLAN header.
> 
> - TBD.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Actions
> ~~~~~~~
> 
> Each possible action is represented by a type. Some have associated
> configuration structures. Several actions combined in a list can be affected
> to a flow rule. That list is not ordered.
> 
> At least one action must be defined in a filter rule in order to do
> something with matched packets.
> 
> - Actions are defined with ``struct rte_flow_action``.
> - A list of actions is defined with ``struct rte_flow_actions``.
> 
> They fall in three categories:
> 
> - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
>   processing matched packets by subsequent flow rules, unless overridden
>   with PASSTHRU.
> 
> - Non terminating actions (PASSTHRU, DUP) that leave matched packets up
> for
>   additional processing by subsequent flow rules.
> 
> - Other non terminating meta actions that do not affect the fate of packets
>   (END, VOID, ID, COUNT).
> 
> When several actions are combined in a flow rule, they should all have
> different types (e.g. dropping a packet twice is not possible). However
> considering the VOID type is an exception to this rule, the defined behavior
> is for PMDs to only take into account the last action of a given type found
> in the list. PMDs still perform error checking on the entire list.
> 
> *Note that PASSTHRU is the only action able to override a terminating rule.*
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Example of an action that redirects packets to queue index 10:
> 
> +----------------+
> | QUEUE          |
> +===========+====+
> | ``queue`` | 10 |
> +-----------+----+
> 
> Action lists examples, their order is not significant, applications must
> consider all actions to be performed simultaneously:
> 
> +----------------+
> | Count and drop |
> +=======+========+
> | COUNT |        |
> +-------+--------+
> | DROP  |        |
> +-------+--------+
> 
> +--------------------------+
> | Tag, count and redirect  |
> +=======+===========+======+
> | ID    | ``id``    | 0x2a |
> +-------+-----------+------+
> | COUNT |                  |
> +-------+-----------+------+
> | QUEUE | ``queue`` | 10   |
> +-------+-----------+------+
> 
> +-----------------------+
> | Redirect to queue 5   |
> +=======+===============+
> | DROP  |               |
> +-------+-----------+---+
> | QUEUE | ``queue`` | 5 |
> +-------+-----------+---+
> 
> In the above example, considering both actions are performed
> simultaneously,
> its end result is that only QUEUE has any effect.
> 
> +-----------------------+
> | Redirect to queue 3   |
> +=======+===========+===+
> | QUEUE | ``queue`` | 5 |
> +-------+-----------+---+
> | VOID  |               |
> +-------+-----------+---+
> | QUEUE | ``queue`` | 3 |
> +-------+-----------+---+
> 
> As previously described, only the last action of a given type found in the
> list is taken into account. The above example also shows that VOID is
> ignored.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Action types
> ~~~~~~~~~~~~
> 
> Common action types are described in this section. Like pattern item types,
> this list is not exhaustive as new actions will be added in the future.
> 
> ``END`` (action)
> ^^^^^^^^^^^^^^^^
> 
> End marker for action lists. Prevents further processing of actions, thereby
> ending the list.
> 
> - Its numeric value is **0** for convenience.
> - PMD support is mandatory.
> - No configurable property.
> 
> +---------------+
> | END           |
> +===============+
> | no properties |
> +---------------+
> 
> ``VOID`` (action)
> ^^^^^^^^^^^^^^^^^
> 
> Used as a placeholder for convenience. It is ignored and simply discarded by
> PMDs.
> 
> - PMD support is mandatory.
> - No configurable property.
> 
> +---------------+
> | VOID          |
> +===============+
> | no properties |
> +---------------+
> 
> ``PASSTHRU``
> ^^^^^^^^^^^^
> 
> Leaves packets up for additional processing by subsequent flow rules. This
> is the default when a rule does not contain a terminating action, but can be
> specified to force a rule to become non-terminating.
> 
> - No configurable property.
> 
> +---------------+
> | PASSTHRU      |
> +===============+
> | no properties |
> +---------------+
> 
> Example to copy a packet to a queue and continue processing by subsequent
> flow rules:
[Sugesh] If a packet get copied to a queue, it’s a termination action. 
How can its possible to do subsequent action after the packet already 
moved to the queue. ?How it differs from DUP action?
 Am I missing anything here? 
> 
> +--------------------------+
> | Copy to queue 8          |
> +==========+===============+
> | PASSTHRU |               |
> +----------+-----------+---+
> | QUEUE    | ``queue`` | 8 |
> +----------+-----------+---+
> 
> ``ID``
> ^^^^^^
> 
> Attaches a 32 bit value to packets.
> 
> +----------------------------------------------+
> | ID                                           |
> +========+=====================================+
> | ``id`` | 32 bit value to return with packets |
> +--------+-------------------------------------+
> 
[Sugesh] I assume the application has to program the flow 
with a unique ID and matching packets are stamped with this ID
when reporting to the software. The uniqueness of ID is NOT 
guaranteed by the API framework. Correct me if I am wrong here.

[Sugesh] Is it a limitation to use only 32 bit ID? Is it possible to have a
64 bit ID? So that application can use the control plane flow pointer
Itself as an ID. Does it make sense? 


> .. raw:: pdf
> 
>    PageBreak
> 
> ``QUEUE``
> ^^^^^^^^^
> 
> Assigns packets to a given queue index.
> 
> - Terminating by default.
> 
> +--------------------------------+
> | QUEUE                          |
> +===========+====================+
> | ``queue`` | queue index to use |
> +-----------+--------------------+
> 
> ``DROP``
> ^^^^^^^^
> 
> Drop packets.
> 
> - No configurable property.
> - Terminating by default.
> - PASSTHRU overrides this action if both are specified.
> 
> +---------------+
> | DROP          |
> +===============+
> | no properties |
> +---------------+
> 
> ``COUNT``
> ^^^^^^^^^
> 
[Sugesh] Should we really have to set count action explicitly for every rule?
IMHO it would be great to be an implicit action. Most of the application would be
interested in the stats of almost all the filters/flows .
> Enables hits counter for this rule.
> 
> This counter can be retrieved and reset through ``rte_flow_query()``, see
> ``struct rte_flow_query_count``.
> 
> - Counters can be retrieved with ``rte_flow_query()``.
> - No configurable property.
> 
> +---------------+
> | COUNT         |
> +===============+
> | no properties |
> +---------------+
> 
> Query structure to retrieve and reset the flow rule hits counter:
> 
> +------------------------------------------------+
> | COUNT query                                    |
> +===========+=====+==============================+
> | ``reset`` | in  | reset counter after query    |
> +-----------+-----+------------------------------+
> | ``hits``  | out | number of hits for this flow |
> +-----------+-----+------------------------------+
> 
> ``DUP``
> ^^^^^^^
> 
> Duplicates packets to a given queue index.
> 
> This is normally combined with QUEUE, however when used alone, it is
> actually similar to QUEUE + PASSTHRU.
> 
> - Non-terminating by default.
> 
> +------------------------------------------------+
> | DUP                                            |
> +===========+====================================+
> | ``queue`` | queue index to duplicate packet to |
> +-----------+------------------------------------+
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``RSS``
> ^^^^^^^
> 
> Similar to QUEUE, except RSS is additionally performed on packets to spread
> them among several queues according to the provided parameters.
> 
> - Terminating by default.
> 
> +---------------------------------------------+
> | RSS                                         |
> +==============+==============================+
> | ``rss_conf`` | RSS parameters               |
> +--------------+------------------------------+
> | ``queues``   | number of entries in queue[] |
> +--------------+------------------------------+
> | ``queue[]``  | queue indices to use         |
> +--------------+------------------------------+
> 
> ``PF`` (action)
> ^^^^^^^^^^^^^^^
> 
> Redirects packets to the physical function (PF) of the current device.
> 
> - No configurable property.
> - Terminating by default.
> 
> +---------------+
> | PF            |
> +===============+
> | no properties |
> +---------------+
> 
> ``VF`` (action)
> ^^^^^^^^^^^^^^^
> 
> Redirects packets to the virtual function (VF) of the current device with
> the specified ID.
> 
> - Terminating by default.
> 
> +---------------------------------------+
> | VF                                    |
> +========+==============================+
> | ``id`` | VF ID to redirect packets to |
> +--------+------------------------------+
> 
> Planned types
> ~~~~~~~~~~~~~
> 
> Other action types are planned but not defined yet. These actions will add
> the ability to alter matching packets in several ways, such as performing
> encapsulation/decapsulation of tunnel headers on specific flows.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Rules management
> ----------------
> 
> A simple API with only four functions is provided to fully manage flows.
> 
> Each created flow rule is associated with an opaque, PMD-specific handle
> pointer. The application is responsible for keeping it until the rule is
> destroyed.
> 
> Flows rules are defined with ``struct rte_flow``.
> 
> Validation
> ~~~~~~~~~~
> 
> Given that expressing a definite set of device capabilities with this API is
> not practical, a dedicated function is provided to check if a flow rule is
> supported and can be created.
> 
> ::
> 
>  int
>  rte_flow_validate(uint8_t port_id,
>                    const struct rte_flow_pattern *pattern,
>                    const struct rte_flow_actions *actions);
> 
> While this function has no effect on the target device, the flow rule is
> validated against its current configuration state and the returned value
> should be considered valid by the caller for that state only.
> 
> The returned value is guaranteed to remain valid only as long as no
> successful calls to rte_flow_create() or rte_flow_destroy() are made in the
> meantime and no device parameter affecting flow rules in any way are
> modified, due to possible collisions or resource limitations (although in
> such cases ``EINVAL`` should not be returned).
> 
> Arguments:
> 
> - ``port_id``: port identifier of Ethernet device.
> - ``pattern``: pattern specification to check.
> - ``actions``: actions associated with the flow definition.
> 
> Return value:
> 
> - **0** if flow rule is valid and can be created. A negative errno value
>   otherwise (``rte_errno`` is also set), the following errors are defined.
> - ``-EINVAL``: unknown or invalid rule specification.
> - ``-ENOTSUP``: valid but unsupported rule specification (e.g. partial masks
>   are unsupported).
> - ``-EEXIST``: collision with an existing rule.
> - ``-ENOMEM``: not enough resources.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Creation
> ~~~~~~~~
> 
> Creating a flow rule is similar to validating one, except the rule is
> actually created.
> 
> ::
> 
>  struct rte_flow *
>  rte_flow_create(uint8_t port_id,
>                  const struct rte_flow_pattern *pattern,
>                  const struct rte_flow_actions *actions);
> 
> Arguments:
> 
> - ``port_id``: port identifier of Ethernet device.
> - ``pattern``: pattern specification to add.
> - ``actions``: actions associated with the flow definition.
> 
> Return value:
> 
> A valid flow pointer in case of success, NULL otherwise and ``rte_errno`` is
> set to the positive version of one of the error codes defined for
> ``rte_flow_validate()``.
[Sugesh] : Kind of implementation specific query. What if application
try to add duplicate rules? Does the API create new flow entry for every 
API call? 
[Sugesh] Another concern is the cost and time of installing these rules
in the hardware. Can we make these APIs time bound(or at least an option to
set the time limit to execute these APIs), so that
Application doesn’t have to wait so long when installing and deleting flows with
slow hardware/NIC. What do you think? Most of the datapath flow installations are 
dynamic and triggered only when there is
an ingress traffic. Delay in flow insertion/deletion have unpredictable consequences.

[Sugesh] Another query is on the synchronization part. What if same rules are 
handled from different threads? Is application responsible for handling the concurrent
hardware programming?

> 
> Destruction
> ~~~~~~~~~~~
> 
> Flow rules destruction is not automatic, and a queue should not be released
> if any are still attached to it. Applications must take care of performing
> this step before releasing resources.
> 
> ::
> 
>  int
>  rte_flow_destroy(uint8_t port_id,
>                   struct rte_flow *flow);
> 
> 
[Sugesh] I would suggest having a clean-up API is really useful as the releasing of
Queue(is it applicable for releasing of port too?) is not guaranteeing the automatic flow 
destruction. This way application can initialize the port,
clean-up all the existing rules and create new rules  on a clean slate.

> Failure to destroy a flow rule may occur when other flow rules depend on it,
> and destroying it would result in an inconsistent state.
> 
> This function is only guaranteed to succeed if flow rules are destroyed in
> reverse order of their creation.
> 
> Arguments:
> 
> - ``port_id``: port identifier of Ethernet device.
> - ``flow``: flow rule to destroy.
> 
> Return value:
> 
> - **0** on success, a negative errno value otherwise and ``rte_errno`` is
>   set.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Query
> ~~~~~
> 
> Query an existing flow rule.
> 
> This function allows retrieving flow-specific data such as counters. Data
> is gathered by special actions which must be present in the flow rule
> definition.
> 
> ::
> 
>  int
>  rte_flow_query(uint8_t port_id,
>                 struct rte_flow *flow,
>                 enum rte_flow_action_type action,
>                 void *data);
> 
> Arguments:
> 
> - ``port_id``: port identifier of Ethernet device.
> - ``flow``: flow rule to query.
> - ``action``: action type to query.
> - ``data``: pointer to storage for the associated query data type.
> 
> Return value:
> 
> - **0** on success, a negative errno value otherwise and ``rte_errno`` is
>   set.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Behavior
> --------
> 
> - API operations are synchronous and blocking (``EAGAIN`` cannot be
>   returned).
> 
> - There is no provision for reentrancy/multi-thread safety, although nothing
>   should prevent different devices from being configured at the same
>   time. PMDs may protect their control path functions accordingly.
> 
> - Stopping the data path (TX/RX) should not be necessary when managing
> flow
>   rules. If this cannot be achieved naturally or with workarounds (such as
>   temporarily replacing the burst function pointers), an appropriate error
>   code must be returned (``EBUSY``).
> 
> - PMDs, not applications, are responsible for maintaining flow rules
>   configuration when stopping and restarting a port or performing other
>   actions which may affect them. They can only be destroyed explicitly.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
[Sugesh] Query all the rules for a specific port/queue?? Useful when adding and
deleting ports and queues dynamically according to the need. I am not sure 
what are the other  different usecases for these APIs. But I feel it makes much easier to 
manage flows from the application. What do you think?
> Compatibility
> -------------
> 
> No known hardware implementation supports all the features described in
> this
> document.
> 
> Unsupported features or combinations are not expected to be fully
> emulated
> in software by PMDs for performance reasons. Partially supported features
> may be completed in software as long as hardware performs most of the
> work
> (such as queue redirection and packet recognition).
> 
> However PMDs are expected to do their best to satisfy application requests
> by working around hardware limitations as long as doing so does not affect
> the behavior of existing flow rules.
> 
> The following sections provide a few examples of such cases, they are based
> on limitations built into the previous APIs.
> 
> Global bitmasks
> ~~~~~~~~~~~~~~~
> 
> Each flow rule comes with its own, per-layer bitmasks, while hardware may
> support only a single, device-wide bitmask for a given layer type, so that
> two IPv4 rules cannot use different bitmasks.
> 
> The expected behavior in this case is that PMDs automatically configure
> global bitmasks according to the needs of the first created flow rule.
> 
> Subsequent rules are allowed only if their bitmasks match those, the
> ``EEXIST`` error code should be returned otherwise.
> 
> Unsupported layer types
> ~~~~~~~~~~~~~~~~~~~~~~~
> 
> Many protocols can be simulated by crafting patterns with the `RAW`_ type.
> 
> PMDs can rely on this capability to simulate support for protocols with
> fixed headers not directly recognized by hardware.
> 
> ``ANY`` pattern item
> ~~~~~~~~~~~~~~~~~~~~
> 
> This pattern item stands for anything, which can be difficult to translate
> to something hardware would understand, particularly if followed by more
> specific types.
> 
> Consider the following pattern:
> 
> +---+--------------------------------+
> | 0 | ETHER                          |
> +---+--------------------------------+
> | 1 | ANY (``min`` = 1, ``max`` = 1) |
> +---+--------------------------------+
> | 2 | TCP                            |
> +---+--------------------------------+
> 
> Knowing that TCP does not make sense with something other than IPv4 and
> IPv6
> as L3, such a pattern may be translated to two flow rules instead:
> 
> +---+--------------------+
> | 0 | ETHER              |
> +---+--------------------+
> | 1 | IPV4 (zeroed mask) |
> +---+--------------------+
> | 2 | TCP                |
> +---+--------------------+
> 
> +---+--------------------+
> | 0 | ETHER              |
> +---+--------------------+
> | 1 | IPV6 (zeroed mask) |
> +---+--------------------+
> | 2 | TCP                |
> +---+--------------------+
> 
> Note that as soon as a ANY rule covers several layers, this approach may
> yield a large number of hidden flow rules. It is thus suggested to only
> support the most common scenarios (anything as L2 and/or L3).
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> Unsupported actions
> ~~~~~~~~~~~~~~~~~~~
> 
> - When combined with a `QUEUE`_ action, packet counting (`COUNT`_) and
>   tagging (`ID`_) may be implemented in software as long as the target queue
>   is used by a single rule.
> 
> - A rule specifying both `DUP`_ + `QUEUE`_ may be translated to two hidden
>   rules combining `QUEUE`_ and `PASSTHRU`_.
> 
> - When a single target queue is provided, `RSS`_ can also be implemented
>   through `QUEUE`_.
> 
> Flow rules priority
> ~~~~~~~~~~~~~~~~~~~
> 
> While it would naturally make sense, flow rules cannot be assumed to be
> processed by hardware in the same order as their creation for several
> reasons:
> 
> - They may be managed internally as a tree or a hash table instead of a
>   list.
> - Removing a flow rule before adding another one can either put the new
> rule
>   at the end of the list or reuse a freed entry.
> - Duplication may occur when packets are matched by several rules.
> 
> For overlapping rules (particularly in order to use the `PASSTHRU`_ action)
> predictable behavior is only guaranteed by using different priority levels.
> 
> Priority levels are not necessarily implemented in hardware, or may be
> severely limited (e.g. a single priority bit).
> 
> For these reasons, priority levels may be implemented purely in software by
> PMDs.
> 
> - For devices expecting flow rules to be added in the correct order, PMDs
>   may destroy and re-create existing rules after adding a new one with
>   a higher priority.
> 
> - A configurable number of dummy or empty rules can be created at
>   initialization time to save high priority slots for later.
> 
> - In order to save priority levels, PMDs may evaluate whether rules are
>   likely to collide and adjust their priority accordingly.
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> API migration
> =============
> 
> Exhaustive list of deprecated filter types and how to convert them to
> generic flow rules.
> 
> ``MACVLAN`` to ``ETH`` → ``VF``, ``PF``
> ---------------------------------------
> 
> `MACVLAN`_ can be translated to a basic `ETH`_ flow rule with a `VF
> (action)`_ or `PF (action)`_ terminating action.
> 
> +------------------------------------+
> | MACVLAN                            |
> +--------------------------+---------+
> | Pattern                  | Actions |
> +===+=====+==========+=====+=========+
> | 0 | ETH | ``spec`` | any | VF,     |
> |   |     +----------+-----+ PF      |
> |   |     | ``mask`` | any |         |
> +---+-----+----------+-----+---------+
> 
> ``ETHERTYPE`` to ``ETH`` → ``QUEUE``, ``DROP``
> ----------------------------------------------
> 
> `ETHERTYPE`_ is basically an `ETH`_ flow rule with `QUEUE`_ or `DROP`_ as
> a terminating action.
> 
> +------------------------------------+
> | ETHERTYPE                          |
> +--------------------------+---------+
> | Pattern                  | Actions |
> +===+=====+==========+=====+=========+
> | 0 | ETH | ``spec`` | any | QUEUE,  |
> |   |     +----------+-----+ DROP    |
> |   |     | ``mask`` | any |         |
> +---+-----+----------+-----+---------+
> 
> ``FLEXIBLE`` to ``RAW`` → ``QUEUE``
> -----------------------------------
> 
> `FLEXIBLE`_ can be translated to one `RAW`_ pattern with `QUEUE`_ as the
> terminating action and a defined priority level.
> 
> +------------------------------------+
> | FLEXIBLE                           |
> +--------------------------+---------+
> | Pattern                  | Actions |
> +===+=====+==========+=====+=========+
> | 0 | RAW | ``spec`` | any | QUEUE   |
> |   |     +----------+-----+         |
> |   |     | ``mask`` | any |         |
> +---+-----+----------+-----+---------+
> 
> ``SYN`` to ``TCP`` → ``QUEUE``
> ------------------------------
> 
> `SYN`_ is a `TCP`_ rule with only the ``syn`` bit enabled and masked, and
> `QUEUE`_ as the terminating action.
> 
> Priority level can be set to simulate the high priority bit.
> 
> +---------------------------------------------+
> | SYN                                         |
> +-----------------------------------+---------+
> | Pattern                           | Actions |
> +===+======+==========+=============+=========+
> | 0 | ETH  | ``spec`` | N/A         | QUEUE   |
> |   |      +----------+-------------+         |
> |   |      | ``mask`` | empty       |         |
> +---+------+----------+-------------+         |
> | 1 | IPV4 | ``spec`` | N/A         |         |
> |   |      +----------+-------------+         |
> |   |      | ``mask`` | empty       |         |
> +---+------+----------+-------------+         |
> | 2 | TCP  | ``spec`` | ``syn`` = 1 |         |
> |   |      +----------+-------------+         |
> |   |      | ``mask`` | ``syn`` = 1 |         |
> +---+------+----------+-------------+---------+
> 
> ``NTUPLE`` to ``IPV4``, ``TCP``, ``UDP`` → ``QUEUE``
> ----------------------------------------------------
> 
> `NTUPLE`_ is similar to specifying an empty L2, `IPV4`_ as L3 with `TCP`_ or
> `UDP`_ as L4 and `QUEUE`_ as the terminating action.
> 
> A priority level can be specified as well.
> 
> +---------------------------------------+
> | NTUPLE                                |
> +-----------------------------+---------+
> | Pattern                     | Actions |
> +===+======+==========+=======+=========+
> | 0 | ETH  | ``spec`` | N/A   | QUEUE   |
> |   |      +----------+-------+         |
> |   |      | ``mask`` | empty |         |
> +---+------+----------+-------+         |
> | 1 | IPV4 | ``spec`` | any   |         |
> |   |      +----------+-------+         |
> |   |      | ``mask`` | any   |         |
> +---+------+----------+-------+         |
> | 2 | TCP, | ``spec`` | any   |         |
> |   | UDP  +----------+-------+         |
> |   |      | ``mask`` | any   |         |
> +---+------+----------+-------+---------+
> 
> ``TUNNEL`` to ``ETH``, ``IPV4``, ``IPV6``, ``VXLAN`` (or other) → ``QUEUE``
> ---------------------------------------------------------------------------
> 
> `TUNNEL`_ matches common IPv4 and IPv6 L3/L4-based tunnel types.
> 
> In the following table, `ANY`_ is used to cover the optional L4.
> 
> +------------------------------------------------+
> | TUNNEL                                         |
> +--------------------------------------+---------+
> | Pattern                              | Actions |
> +===+=========+==========+=============+=========+
> | 0 | ETH     | ``spec`` | any         | QUEUE   |
> |   |         +----------+-------------+         |
> |   |         | ``mask`` | any         |         |
> +---+---------+----------+-------------+         |
> | 1 | IPV4,   | ``spec`` | any         |         |
> |   | IPV6    +----------+-------------+         |
> |   |         | ``mask`` | any         |         |
> +---+---------+----------+-------------+         |
> | 2 | ANY     | ``spec`` | ``min`` = 0 |         |
> |   |         |          +-------------+         |
> |   |         |          | ``max`` = 0 |         |
> |   |         +----------+-------------+         |
> |   |         | ``mask`` | N/A         |         |
> +---+---------+----------+-------------+         |
> | 3 | VXLAN,  | ``spec`` | any         |         |
> |   | GENEVE, +----------+-------------+         |
> |   | TEREDO, | ``mask`` | any         |         |
> |   | NVGRE,  |          |             |         |
> |   | GRE,    |          |             |         |
> |   | ...     |          |             |         |
> +---+---------+----------+-------------+---------+
> 
> .. raw:: pdf
> 
>    PageBreak
> 
> ``FDIR`` to most item types → ``QUEUE``, ``DROP``, ``PASSTHRU``
> ---------------------------------------------------------------
> 
> `FDIR`_ is more complex than any other type, there are several methods to
> emulate its functionality. It is summarized for the most part in the table
> below.
> 
> A few features are intentionally not supported:
> 
> - The ability to configure the matching input set and masks for the entire
>   device, PMDs should take care of it automatically according to flow rules.
> 
> - Returning four or eight bytes of matched data when using flex bytes
>   filtering. Although a specific action could implement it, it conflicts
>   with the much more useful 32 bits tagging on devices that support it.
> 
> - Side effects on RSS processing of the entire device. Flow rules that
>   conflict with the current device configuration should not be
>   allowed. Similarly, device configuration should not be allowed when it
>   affects existing flow rules.
> 
> - Device modes of operation. "none" is unsupported since filtering cannot be
>   disabled as long as a flow rule is present.
> 
> - "MAC VLAN" or "tunnel" perfect matching modes should be automatically
> set
>   according to the created flow rules.
> 
> +----------------------------------------------+
> | FDIR                                         |
> +---------------------------------+------------+
> | Pattern                         | Actions    |
> +===+============+==========+=====+============+
> | 0 | ETH,       | ``spec`` | any | QUEUE,     |
> |   | RAW        +----------+-----+ DROP,      |
> |   |            | ``mask`` | any | PASSTHRU   |
> +---+------------+----------+-----+------------+
> | 1 | IPV4,      | ``spec`` | any | ID         |
> |   | IPV6       +----------+-----+ (optional) |
> |   |            | ``mask`` | any |            |
> +---+------------+----------+-----+            |
> | 2 | TCP,       | ``spec`` | any |            |
> |   | UDP,       +----------+-----+            |
> |   | SCTP       | ``mask`` | any |            |
> +---+------------+----------+-----+            |
> | 3 | VF,        | ``spec`` | any |            |
> |   | PF,        +----------+-----+            |
> |   | SIGNATURE  | ``mask`` | any |            |
> |   | (optional) |          |     |            |
> +---+------------+----------+-----+------------+
> 
> ``HASH``
> ~~~~~~~~
> 
> Hashing configuration is set per rule through the `SIGNATURE`_ item.
> 
> Since it is usually a global device setting, all flow rules created with
> this item may have to share the same specification.
> 
> ``L2_TUNNEL`` to ``VOID`` → ``VXLAN`` (or others)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> All packets are matched. This type alters incoming packets to encapsulate
> them in a chosen tunnel type, optionally redirect them to a VF as well.
> 
> The destination pool for tag based forwarding can be emulated with other
> flow rules using `DUP`_ as the action.
> 
> +----------------------------------------+
> | L2_TUNNEL                              |
> +---------------------------+------------+
> | Pattern                   | Actions    |
> +===+======+==========+=====+============+
> | 0 | VOID | ``spec`` | N/A | VXLAN,     |
> |   |      |          |     | GENEVE,    |
> |   |      |          |     | ...        |
> |   |      +----------+-----+------------+
> |   |      | ``mask`` | N/A | VF         |
> |   |      |          |     | (optional) |
> +---+------+----------+-----+------------+
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-07  7:14 ` Lu, Wenzhuo
@ 2016-07-07 10:26   ` Adrien Mazarguil
  2016-07-19  8:11     ` Lu, Wenzhuo
  0 siblings, 1 reply; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-07 10:26 UTC (permalink / raw)
  To: Lu, Wenzhuo
  Cc: dev, Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Lu Wenzhuo,

Thanks for your feedback, I'm replying below as well.

On Thu, Jul 07, 2016 at 07:14:18AM +0000, Lu, Wenzhuo wrote:
> Hi Adrien,
> I have some questions, please see inline, thanks.
> 
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > Sent: Wednesday, July 6, 2016 2:17 AM
> > To: dev@dpdk.org
> > Cc: Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody; Ajit Khaparde;
> > Rahul Lakkireddy; Lu, Wenzhuo; Jan Medala; John Daley; Chen, Jing D; Ananyev,
> > Konstantin; Matej Vido; Alejandro Lucero; Sony Chacko; Jerin Jacob; De Lara
> > Guarch, Pablo; Olga Shern
> > Subject: [RFC] Generic flow director/filtering/classification API
> > 
> > 
> > Requirements for a new API:
> > 
> > - Flexible and extensible without causing API/ABI problems for existing
> >   applications.
> > - Should be unambiguous and easy to use.
> > - Support existing filtering features and actions listed in `Filter types`_.
> > - Support packet alteration.
> > - In case of overlapping filters, their priority should be well documented.
> Does that mean we don't guarantee the consistent of priority? The priority can be different on different NICs. So the behavior of the actions  can be different. Right?

No, the intent is precisely to define what happens in order to get a
consistent result across different devices, and document cases with
undefined behavior. There must be no room left for interpretation.

For example, the API must describe what happens when two overlapping filters
(e.g. one matching an Ethernet header, another one matching an IP header)
match a given packet at a given priority level.

It is documented in section 4.1.1 (priorities) as "undefined behavior".
Applications remain free to do it and deal with consequences, at least they
know they cannot expect a consistent outcome, unless they use different
priority levels for both rules, see also 4.4.5 (flow rules priority).

> Seems the users still need to aware the some details of the HW? Do we need to add the negotiation for the priority?

Priorities as defined in this document may not be directly mappable to HW
capabilities (e.g. HW does not support enough priorities, or that some
corner case make them not work as described), in which case the PMD may
choose to simulate priorities (again 4.4.5), as long as the end result
follows the specification.

So users must not be aware of some HW details, the PMD does and must perform
the needed workarounds to suit their expectations. Users may only be
impacted by errors while attempting to create rules that are either
unsupported or would cause them (or existing rules) to diverge from the
spec.

> > Flow rules can have several distinct actions (such as counting,
> > encapsulating, decapsulating before redirecting packets to a particular
> > queue, etc.), instead of relying on several rules to achieve this and having
> > applications deal with hardware implementation details regarding their
> > order.
> I think normally HW doesn't support several actions in one rule. If a rule has several actions, seems HW has to split it to several rules. The order can still be a problem.

Note that, except for a very small subset of pattern items and actions,
supporting multiple actions for a given rule is not mandatory, and can be
emulated as you said by having to split them into several rules each with
its own priority if possible (see 3.3 "high level design").

Also, a rule "action" as defined in this API can be just about anything, for
example combining a queue redirection with 32-bit tagging. FDIR supports
many cases of what can be described as several actions, see 5.7 "FDIR to
most item types → QUEUE, DROP, PASSTHRU".

If you were thinking about having two queue targets for a given rule, then
I agree with you - that is why a rule cannot have more than a single action
of a given type (see 4.1.5 actions), to avoid such abuse from applications.

Applications must use several pass-through rules with different priority
levels if they want to perform a given action several times on a given
packet. Again, PMDs support is not mandatory as pass-through is optional.

> > ``ETH``
> > ^^^^^^^
> > 
> > Matches an Ethernet header.
> > 
> > - ``dst``: destination MAC.
> > - ``src``: source MAC.
> > - ``type``: EtherType.
> > - ``tags``: number of 802.1Q/ad tags defined.
> > - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> > 
> >  - ``tpid``: Tag protocol identifier.
> >  - ``tci``: Tag control information.
> "ETH" means all the parameters, dst, src, type... need to be matched? The same question for IPv4, IPv6 ...

Yes, it's basically the description of all Ethernet header fields including
VLAN tags (same for other protocols). Please see the linked draft header
file which should make all of this easier to understand:

 https://raw.githubusercontent.com/6WIND/rte_flow/master/rte_flow.h

> > ``UDP``
> > ^^^^^^^
> > 
> > Matches a UDP header.
> > 
> > - ``sport``: source port.
> > - ``dport``: destination port.
> > - ``length``: UDP length.
> > - ``checksum``: UDP checksum.
> Why checksum? Do we need to filter the packets by checksum value?

Well, I've decided to include all protocol header fields for completeness
(so the ABI does not need to be broken later then they become necessary, or
require another pattern item), not that all of them make sense in a pattern.

In this specific case, all PMDs I know of must reject a pattern
specification with a nonzero mask for the checksum field, because none of
them support it.

> > ``VOID`` (action)
> > ^^^^^^^^^^^^^^^^^
> > 
> > Used as a placeholder for convenience. It is ignored and simply discarded by
> > PMDs.
> Don't understand why we need VOID. If it’s about the format. Why not guarantee it in rte layer?

I'm not sure to understand your question about rte layer, but this type is
fully managed by the PMD and is not supposed to be translated to a hardware
action.

I think it may come handy in some cases (like the VOID pattern item), so it
is defined just in case. Should be relatively trivial to support.

Applications may find a use for it when they want to statically define
templates for flow rules, when they need room for some reason.

> > Behavior
> > --------
> > 
> > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> >   returned).
> > 
> > - There is no provision for reentrancy/multi-thread safety, although nothing
> >   should prevent different devices from being configured at the same
> >   time. PMDs may protect their control path functions accordingly.
> > 
> > - Stopping the data path (TX/RX) should not be necessary when managing flow
> >   rules. If this cannot be achieved naturally or with workarounds (such as
> >   temporarily replacing the burst function pointers), an appropriate error
> >   code must be returned (``EBUSY``).
> PMD cannot stop the data path without adding lock. So I think if some rules cannot be applied without stopping rx/tx, PMD has to return fail.
> Or let the APP to stop the data path.

Agreed, that is the intent. If the PMD cannot touch flow rules for some
reason even after trying really hard, then it just returns EBUSY.

Perhaps we should write down that applications may get a different outcome
after stopping the data path if they get EBUSY?

> > - PMDs, not applications, are responsible for maintaining flow rules
> >   configuration when stopping and restarting a port or performing other
> >   actions which may affect them. They can only be destroyed explicitly.
> Don’t understand " They can only be destroyed explicitly."

This part says that as long as an application has not called
rte_flow_destroy() on a flow rule, it never disappears, whatever happens to
the port (stopped, restarted). The application is not responsible for
re-creating rules after that.

Note that according to the specification, this may translate to not being
able to stop a port as long as a flow rule is present, depending on how nice
the PMD intends to be with applications. Implementation can be done in small
steps with minimal amount of code on the PMD side.

> If a new rule has conflict with an old one, what should we do? Return fail?

That should not happen. If say 100 rules have been created with various
priorities and the port is happily running with them, stopping the port may
require the PMD to destroy them, it then has to re-create all 100 of them
exactly as they were automatically when restarting the port.

If re-creating them is not possible for some reason, the port cannot be
restarted as long as rules that cannot be added back haven't been destroyed
by the application. Frankly, this should not happen.

To manage this case, I suggest preventing applications from doing things
that conflict with existing flow rules while the port is stopped (just like
when it is not stopped, as described in 5.7 "FDIR to most item types").

> > ``ANY`` pattern item
> > ~~~~~~~~~~~~~~~~~~~~
> > 
> > This pattern item stands for anything, which can be difficult to translate
> > to something hardware would understand, particularly if followed by more
> > specific types.
> > 
> > Consider the following pattern:
> > 
> > +---+--------------------------------+
> > | 0 | ETHER                          |
> > +---+--------------------------------+
> > | 1 | ANY (``min`` = 1, ``max`` = 1) |
> > +---+--------------------------------+
> > | 2 | TCP                            |
> > +---+--------------------------------+
> > 
> > Knowing that TCP does not make sense with something other than IPv4 and IPv6
> > as L3, such a pattern may be translated to two flow rules instead:
> > 
> > +---+--------------------+
> > | 0 | ETHER              |
> > +---+--------------------+
> > | 1 | IPV4 (zeroed mask) |
> > +---+--------------------+
> > | 2 | TCP                |
> > +---+--------------------+
> > 
> > +---+--------------------+
> > | 0 | ETHER              |
> > +---+--------------------+
> > | 1 | IPV6 (zeroed mask) |
> > +---+--------------------+
> > | 2 | TCP                |
> > +---+--------------------+
> > 
> > Note that as soon as a ANY rule covers several layers, this approach may
> > yield a large number of hidden flow rules. It is thus suggested to only
> > support the most common scenarios (anything as L2 and/or L3).
> I think "any" may make things confusing.  How about if the NIC doesn't support IPv6? Should we return fail for this rule?

In a sense you are right, ANY relies on HW capabilities so you cannot know
that it won't match unsupported protocols. The above example would be
somewhat useless for a conscious application which should really have
created two separate flow rules (and gotten an error on the IPv6 one).

So an ANY flow rule only able to match v4 packets won't return an error.

ANY can be useful to match outer packets when only a tunnel header and the
inner packet are meaningful to the application. HW that does not recognize
the outer packet is not able to recognize the inner one anyway.

This section only says that PMDs should do their best to make HW match what
they can when faced with ANY.

Also once again, ANY support is not mandatory.

> > Flow rules priority
> > ~~~~~~~~~~~~~~~~~~~
> > 
> > While it would naturally make sense, flow rules cannot be assumed to be
> > processed by hardware in the same order as their creation for several
> > reasons:
> > 
> > - They may be managed internally as a tree or a hash table instead of a
> >   list.
> > - Removing a flow rule before adding another one can either put the new rule
> >   at the end of the list or reuse a freed entry.
> > - Duplication may occur when packets are matched by several rules.
> > 
> > For overlapping rules (particularly in order to use the `PASSTHRU`_ action)
> > predictable behavior is only guaranteed by using different priority levels.
> > 
> > Priority levels are not necessarily implemented in hardware, or may be
> > severely limited (e.g. a single priority bit).
> > 
> > For these reasons, priority levels may be implemented purely in software by
> > PMDs.
> > 
> > - For devices expecting flow rules to be added in the correct order, PMDs
> >   may destroy and re-create existing rules after adding a new one with
> >   a higher priority.
> > 
> > - A configurable number of dummy or empty rules can be created at
> >   initialization time to save high priority slots for later.
> > 
> > - In order to save priority levels, PMDs may evaluate whether rules are
> >   likely to collide and adjust their priority accordingly.
> If there's 3 rules, r1, r2,r3. The rules say the priority is r1 > r2 > r3. If PMD can only support r1 > r3 > r2, or doesn't support r2. Should PMD apply r1 and r3 or totally not support them all?

Remember that the API lets applications create only one rule at a time. If
all 3 rules are not supported together but individually are, the answer
depends on what the application does:

1. r1 OK, r2 FAIL => application chooses to stop here, thus only r1 works as
  expected (may roll back and remove r1 as a result).

2. r1 OK, r2 FAIL, r3 OK => application chooses to ignore the fact r2 failed
  and added r3 anyway, so it should end up with r1 > r3.

Applications should do as described in 1, they need to check for errors if
they want consistency.

This document describes only the basic functions, but may be extended later
with methods to add several flow rules at once, so rules that depend on
others can be added together and a single failure is returned without the
need for a rollback at the application level.

> A generic question, is the parsing supposed to be done by rte or PMD?

Actually, a bit of both. EAL will certainly at least provide helpers to
assist PMDs. This specification defines only the public-facing API for now,
but our goal is really to have something that is not too difficult to
implement both for applications and PMDs.

These helpers can be defined later with the first implementation.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [dpdk-dev] [RFC] Generic flow director/filtering/classification API
  2016-07-05 18:16 Adrien Mazarguil
@ 2016-07-07  7:14 ` Lu, Wenzhuo
  2016-07-07 10:26   ` Adrien Mazarguil
  2016-07-07 23:15 ` Chandran, Sugesh
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Lu, Wenzhuo @ 2016-07-07  7:14 UTC (permalink / raw)
  To: Adrien Mazarguil, dev
  Cc: Thomas Monjalon, Zhang, Helin, Wu, Jingjing, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Jan Medala, John Daley, Chen,
	Jing D, Ananyev, Konstantin, Matej Vido, Alejandro Lucero,
	Sony Chacko, Jerin Jacob, De Lara Guarch, Pablo, Olga Shern

Hi Adrien,
I have some questions, please see inline, thanks.

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, July 6, 2016 2:17 AM
> To: dev@dpdk.org
> Cc: Thomas Monjalon; Zhang, Helin; Wu, Jingjing; Rasesh Mody; Ajit Khaparde;
> Rahul Lakkireddy; Lu, Wenzhuo; Jan Medala; John Daley; Chen, Jing D; Ananyev,
> Konstantin; Matej Vido; Alejandro Lucero; Sony Chacko; Jerin Jacob; De Lara
> Guarch, Pablo; Olga Shern
> Subject: [RFC] Generic flow director/filtering/classification API
> 
> 
> Requirements for a new API:
> 
> - Flexible and extensible without causing API/ABI problems for existing
>   applications.
> - Should be unambiguous and easy to use.
> - Support existing filtering features and actions listed in `Filter types`_.
> - Support packet alteration.
> - In case of overlapping filters, their priority should be well documented.
Does that mean we don't guarantee the consistent of priority? The priority can be different on different NICs. So the behavior of the actions  can be different. Right?
Seems the users still need to aware the some details of the HW? Do we need to add the negotiation for the priority?

> 
> Flow rules can have several distinct actions (such as counting,
> encapsulating, decapsulating before redirecting packets to a particular
> queue, etc.), instead of relying on several rules to achieve this and having
> applications deal with hardware implementation details regarding their
> order.
I think normally HW doesn't support several actions in one rule. If a rule has several actions, seems HW has to split it to several rules. The order can still be a problem.

> 
> ``ETH``
> ^^^^^^^
> 
> Matches an Ethernet header.
> 
> - ``dst``: destination MAC.
> - ``src``: source MAC.
> - ``type``: EtherType.
> - ``tags``: number of 802.1Q/ad tags defined.
> - ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:
> 
>  - ``tpid``: Tag protocol identifier.
>  - ``tci``: Tag control information.
"ETH" means all the parameters, dst, src, type... need to be matched? The same question for IPv4, IPv6 ...

> 
> ``UDP``
> ^^^^^^^
> 
> Matches a UDP header.
> 
> - ``sport``: source port.
> - ``dport``: destination port.
> - ``length``: UDP length.
> - ``checksum``: UDP checksum.
Why checksum? Do we need to filter the packets by checksum value?

> 
> ``VOID`` (action)
> ^^^^^^^^^^^^^^^^^
> 
> Used as a placeholder for convenience. It is ignored and simply discarded by
> PMDs.
Don't understand why we need VOID. If it’s about the format. Why not guarantee it in rte layer?

> 
> Behavior
> --------
> 
> - API operations are synchronous and blocking (``EAGAIN`` cannot be
>   returned).
> 
> - There is no provision for reentrancy/multi-thread safety, although nothing
>   should prevent different devices from being configured at the same
>   time. PMDs may protect their control path functions accordingly.
> 
> - Stopping the data path (TX/RX) should not be necessary when managing flow
>   rules. If this cannot be achieved naturally or with workarounds (such as
>   temporarily replacing the burst function pointers), an appropriate error
>   code must be returned (``EBUSY``).
PMD cannot stop the data path without adding lock. So I think if some rules cannot be applied without stopping rx/tx, PMD has to return fail.
Or let the APP to stop the data path.

> 
> - PMDs, not applications, are responsible for maintaining flow rules
>   configuration when stopping and restarting a port or performing other
>   actions which may affect them. They can only be destroyed explicitly.
Don’t understand " They can only be destroyed explicitly." If a new rule has conflict with an old one, what should we do? Return fail?

> 
> ``ANY`` pattern item
> ~~~~~~~~~~~~~~~~~~~~
> 
> This pattern item stands for anything, which can be difficult to translate
> to something hardware would understand, particularly if followed by more
> specific types.
> 
> Consider the following pattern:
> 
> +---+--------------------------------+
> | 0 | ETHER                          |
> +---+--------------------------------+
> | 1 | ANY (``min`` = 1, ``max`` = 1) |
> +---+--------------------------------+
> | 2 | TCP                            |
> +---+--------------------------------+
> 
> Knowing that TCP does not make sense with something other than IPv4 and IPv6
> as L3, such a pattern may be translated to two flow rules instead:
> 
> +---+--------------------+
> | 0 | ETHER              |
> +---+--------------------+
> | 1 | IPV4 (zeroed mask) |
> +---+--------------------+
> | 2 | TCP                |
> +---+--------------------+
> 
> +---+--------------------+
> | 0 | ETHER              |
> +---+--------------------+
> | 1 | IPV6 (zeroed mask) |
> +---+--------------------+
> | 2 | TCP                |
> +---+--------------------+
> 
> Note that as soon as a ANY rule covers several layers, this approach may
> yield a large number of hidden flow rules. It is thus suggested to only
> support the most common scenarios (anything as L2 and/or L3).
I think "any" may make things confusing.  How about if the NIC doesn't support IPv6? Should we return fail for this rule?

> 
> Flow rules priority
> ~~~~~~~~~~~~~~~~~~~
> 
> While it would naturally make sense, flow rules cannot be assumed to be
> processed by hardware in the same order as their creation for several
> reasons:
> 
> - They may be managed internally as a tree or a hash table instead of a
>   list.
> - Removing a flow rule before adding another one can either put the new rule
>   at the end of the list or reuse a freed entry.
> - Duplication may occur when packets are matched by several rules.
> 
> For overlapping rules (particularly in order to use the `PASSTHRU`_ action)
> predictable behavior is only guaranteed by using different priority levels.
> 
> Priority levels are not necessarily implemented in hardware, or may be
> severely limited (e.g. a single priority bit).
> 
> For these reasons, priority levels may be implemented purely in software by
> PMDs.
> 
> - For devices expecting flow rules to be added in the correct order, PMDs
>   may destroy and re-create existing rules after adding a new one with
>   a higher priority.
> 
> - A configurable number of dummy or empty rules can be created at
>   initialization time to save high priority slots for later.
> 
> - In order to save priority levels, PMDs may evaluate whether rules are
>   likely to collide and adjust their priority accordingly.
If there's 3 rules, r1, r2,r3. The rules say the priority is r1 > r2 > r3. If PMD can only support r1 > r3 > r2, or doesn't support r2. Should PMD apply r1 and r3 or totally not support them all?

A generic question, is the parsing supposed to be done by rte or PMD?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [dpdk-dev] [RFC] Generic flow director/filtering/classification API
@ 2016-07-05 18:16 Adrien Mazarguil
  2016-07-07  7:14 ` Lu, Wenzhuo
                   ` (4 more replies)
  0 siblings, 5 replies; 54+ messages in thread
From: Adrien Mazarguil @ 2016-07-05 18:16 UTC (permalink / raw)
  To: dev
  Cc: Thomas Monjalon, Helin Zhang, Jingjing Wu, Rasesh Mody,
	Ajit Khaparde, Rahul Lakkireddy, Wenzhuo Lu, Jan Medala,
	John Daley, Jing Chen, Konstantin Ananyev, Matej Vido,
	Alejandro Lucero, Sony Chacko, Jerin Jacob, Pablo de Lara,
	Olga Shern

Hi All,

First, forgive me for this large message, I know our mailboxes already
suffer quite a bit from the amount of traffic on this ML.

This is not exactly yet another thread about how flow director should be
extended, rather about a brand new API to handle filtering and
classification for incoming packets in the most PMD-generic and
application-friendly fashion we can come up with. Reasons described below.

I think this topic is important enough to include both the users of this API
as well as PMD maintainers. So far I have CC'ed librte_ether (especially
rte_eth_ctrl.h contributors), testpmd and PMD maintainers (with and without
a .filter_ctrl implementation), but if you know application maintainers
other than testpmd who use FDIR or might be interested in this discussion,
feel free to add them.

The issues we found with the current approach are already summarized in the
following document, but here is a quick summary for TL;DR folks:

- PMDs do not expose a common set of filter types and even when they do,
  their behavior more or less differs.

- Applications need to determine and adapt to device-specific limitations
  and quirks on their own, without help from PMDs.

- Writing an application that creates flow rules targeting all devices
  supported by DPDK is thus difficult, if not impossible.

- The current API has too many unspecified areas (particularly regarding
  side effects of flow rules) that make PMD implementation tricky.

This RFC API handles everything currently supported by .filter_ctrl, the
idea being to reimplement all of these to make them fully usable by
applications in a more generic and well defined fashion. It has a very small
set of mandatory features and an easy method to let applications probe for
supported capabilities.

The only downside is more work for the software control side of PMDs because
they have to adapt to the API instead of the reverse. I think helpers can be
added to EAL to assist with this.

HTML version:

 https://rawgit.com/6WIND/rte_flow/master/rte_flow.html

PDF version:

 https://rawgit.com/6WIND/rte_flow/master/rte_flow.pdf

Related draft header file (for reference while reading the specification):

 https://raw.githubusercontent.com/6WIND/rte_flow/master/rte_flow.h

Git tree for completeness (latest .rst version can be retrieved from here):

 https://github.com/6WIND/rte_flow

What follows is the ReST source of the above, for inline comments and
discussion. I intend to update that specification accordingly.

========================
Generic filter interface
========================

.. footer::

   v0.6

.. contents::
.. sectnum::
.. raw:: pdf

   PageBreak

Overview
========

DPDK provides several competing interfaces added over time to perform packet
matching and related actions such as filtering and classification.

They must be extended to implement the features supported by newer devices
in order to expose them to applications, however the current design has
several drawbacks:

- Complicated filter combinations which have not been hard-coded cannot be
  expressed.
- Prone to API/ABI breakage when new features must be added to an existing
  filter type, which frequently happens.

>From an application point of view:

- Having disparate interfaces, all optional and lacking in features does not
  make this API easy to use.
- Seemingly arbitrary built-in limitations of filter types based on the
  device they were initially designed for.
- Undefined relationship between different filter types.
- High complexity, considerable undocumented and/or undefined behavior.

Considering the growing number of devices supported by DPDK, adding a new
filter type each time a new feature must be implemented is not sustainable
in the long term. Applications not written to target a specific device
cannot really benefit from such an API.

For these reasons, this document defines an extensible unified API that
encompasses and supersedes these legacy filter types.

.. raw:: pdf

   PageBreak

Current API
===========

Rationale
---------

The reason several competing (and mostly overlapping) filtering APIs are
present in DPDK is due to its nature as a thin layer between hardware and
software.

Each subsequent interface has been added to better match the capabilities
and limitations of the latest supported device, which usually happened to
need an incompatible configuration approach. Because of this, many ended up
device-centric and not usable by applications that were not written for that
particular device.

This document is not the first attempt to address this proliferation issue,
in fact a lot of work has already been done both to create a more generic
interface while somewhat keeping compatibility with legacy ones through a
common call interface (``rte_eth_dev_filter_ctrl()`` with the
``.filter_ctrl`` PMD callback in ``rte_ethdev.h``).

Today, these previously incompatible interfaces are known as filter types
(``RTE_ETH_FILTER_*`` from ``enum rte_filter_type`` in ``rte_eth_ctrl.h``).

However while trivial to extend with new types, it only shifted the
underlying problem as applications still need to be written for one kind of
filter type, which, as described in the following sections, is not
necessarily implemented by all PMDs that support filtering.

.. raw:: pdf

   PageBreak

Filter types
------------

This section summarizes the capabilities of each filter type.

Although the following list is exhaustive, the description of individual
types may contain inaccuracies due to the lack of documentation or usage
examples.

Note: names are prefixed with ``RTE_ETH_FILTER_``.

``MACVLAN``
~~~~~~~~~~~

Matching:

- L2 source/destination addresses.
- Optional 802.1Q VLAN ID.
- Masking individual fields on a rule basis is not supported.

Action:

- Packets are redirected either to a given VF device using its ID or to the
  PF.

``ETHERTYPE``
~~~~~~~~~~~~~

Matching:

- L2 source/destination addresses (optional).
- Ethertype (no VLAN ID?).
- Masking individual fields on a rule basis is not supported.

Action:

- Receive packets on a given queue.
- Drop packets.

``FLEXIBLE``
~~~~~~~~~~~~

Matching:

- At most 128 consecutive bytes anywhere in packets.
- Masking is supported with byte granularity.
- Priorities are supported (relative to this filter type, undefined
  otherwise).

Action:

- Receive packets on a given queue.

``SYN``
~~~~~~~

Matching:

- TCP SYN packets only.
- One high priority bit can be set to give the highest possible priority to
  this type when other filters with different types are configured.

Action:

- Receive packets on a given queue.

``NTUPLE``
~~~~~~~~~~

Matching:

- Source/destination IPv4 addresses (optional in 2-tuple mode).
- Source/destination TCP/UDP port (mandatory in 2 and 5-tuple modes).
- L4 protocol (2 and 5-tuple modes).
- Masking individual fields is supported.
- TCP flags.
- Up to 7 levels of priority relative to this filter type, undefined
  otherwise.
- No IPv6.

Action:

- Receive packets on a given queue.

``TUNNEL``
~~~~~~~~~~

Matching:

- Outer L2 source/destination addresses.
- Inner L2 source/destination addresses.
- Inner VLAN ID.
- IPv4/IPv6 source (destination?) address.
- Tunnel type to match (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE, 802.1BR
  E-Tag).
- Tenant ID for tunneling protocols that have one.
- Any combination of the above can be specified.
- Masking individual fields on a rule basis is not supported.

Action:

- Receive packets on a given queue.

.. raw:: pdf

   PageBreak

``FDIR``
~~~~~~~~

Queries:

- Device capabilities and limitations.
- Device statistics about configured filters (resource usage, collisions).
- Device configuration (matching input set and masks)

Matching:

- Device mode of operation: none (to disable filtering), signature
  (hash-based dispatching from masked fields) or perfect (either MAC VLAN or
  tunnel).
- L2 Ethertype.
- Outer L2 destination address (MAC VLAN mode).
- Inner L2 destination address, tunnel type (NVGRE, VXLAN) and tunnel ID
  (tunnel mode).
- IPv4 source/destination addresses, ToS, TTL and protocol fields.
- IPv6 source/destination addresses, TC, protocol and hop limits fields.
- UDP source/destination IPv4/IPv6 and ports.
- TCP source/destination IPv4/IPv6 and ports.
- SCTP source/destination IPv4/IPv6, ports and verification tag field.
- Note, only one protocol type at once (either only L2 Ethertype, basic
  IPv6, IPv4+UDP, IPv4+TCP and so on).
- VLAN TCI (extended API).
- At most 16 bytes to match in payload (extended API). A global device
  look-up table specifies for each possible protocol layer (unknown, raw,
  L2, L3, L4) the offset to use for each byte (they do not need to be
  contiguous) and the related bitmask.
- Whether packet is addressed to PF or VF, in that case its ID can be
  matched as well (extended API).
- Masking most of the above fields is supported, but simultaneously affects
  all filters configured on a device.
- Input set can be modified in a similar fashion for a given device to
  ignore individual fields of filters (i.e. do not match the destination
  address in a IPv4 filter, refer to **RTE_ETH_INPUT_SET_**
  macros). Configuring this also affects RSS processing on **i40e**.
- Filters can also provide 32 bits of arbitrary data to return as part of
  matched packets.

Action:

- **RTE_ETH_FDIR_ACCEPT**: receive (accept) packet on a given queue.
- **RTE_ETH_FDIR_REJECT**: drop packet immediately.
- **RTE_ETH_FDIR_PASSTHRU**: similar to accept for the last filter in list,
  otherwise process it with subsequent filters.
- For accepted packets and if requested by filter, either 32 bits of
  arbitrary data and four bytes of matched payload (only in case of flex
  bytes matching), or eight bytes of matched payload (flex also) are added
  to meta data.

.. raw:: pdf

   PageBreak

``HASH``
~~~~~~~~

Not an actual filter type. Provides and retrieves the global device
configuration (per port or entire NIC) for hash functions and their
properties.

Hash function selection: "default" (keep current), XOR or Toeplitz.

This function can be configured per flow type (**RTE_ETH_FLOW_**
definitions), supported types are:

- Unknown.
- Raw.
- Fragmented or non-fragmented IPv4.
- Non-fragmented IPv4 with L4 (TCP, UDP, SCTP or other).
- Fragmented or non-fragmented IPv6.
- Non-fragmented IPv6 with L4 (TCP, UDP, SCTP or other).
- L2 payload.
- IPv6 with extensions.
- IPv6 with L4 (TCP, UDP) and extensions.

``L2_TUNNEL``
~~~~~~~~~~~~~

Matching:

- All packets received on a given port.

Action:

- Add tunnel encapsulation (VXLAN, GENEVE, TEREDO, NVGRE, IP over GRE,
  802.1BR E-Tag) using the provided Ethertype and tunnel ID (only E-Tag
  is implemented at the moment).
- VF ID to use for tag insertion (currently unused).
- Destination pool for tag based forwarding (pools are IDs that can be
  affected to ports, duplication occurs if the same ID is shared by several
  ports of the same NIC).

.. raw:: pdf

   PageBreak

Driver support
--------------

======== ======= ========= ======== === ====== ====== ==== ==== =========
Driver   MACVLAN ETHERTYPE FLEXIBLE SYN NTUPLE TUNNEL FDIR HASH L2_TUNNEL
======== ======= ========= ======== === ====== ====== ==== ==== =========
bnx2x
cxgbe
e1000            yes       yes      yes yes
ena
enic                                                  yes
fm10k
i40e     yes     yes                           yes    yes  yes
ixgbe            yes                yes yes           yes       yes
mlx4
mlx5                                                  yes
szedata2
======== ======= ========= ======== === ====== ====== ==== ==== =========

Flow director
-------------

Flow director (FDIR) is the name of the most capable filter type, which
covers most features offered by others. As such, it is the most widespread
in PMDs that support filtering (i.e. all of them besides **e1000**).

It is also the only type that allows an arbitrary 32 bits value provided by
applications to be attached to a filter and returned with matching packets
instead of relying on the destination queue to recognize flows.

Unfortunately, even FDIR requires applications to be aware of low-level
capabilities and limitations (most of which come directly from **ixgbe** and
**i40e**):

- Bitmasks are set globally per device (port?), not per filter.
- Configuration state is not expected to be saved by the driver, and
  stopping/restarting a port requires the application to perform it again
  (API documentation is also unclear about this).
- Monolithic approach with ABI issues as soon as a new kind of flow or
  combination needs to be supported.
- Cryptic global statistics/counters.
- Unclear about how priorities are managed; filters seem to be arranged as a
  linked list in hardware (possibly related to configuration order).

Packet alteration
-----------------

One interesting feature is that the L2 tunnel filter type implements the
ability to alter incoming packets through a filter (in this case to
encapsulate them), thus the **mlx5** flow encap/decap features are not a
foreign concept.

.. raw:: pdf

   PageBreak

Proposed API
============

Terminology
-----------

- **Filtering API**: overall framework affecting the fate of selected
  packets, covers everything described in this document.
- **Matching pattern**: properties to look for in received packets, a
  combination of any number of items.
- **Pattern item**: part of a pattern that either matches packet data
  (protocol header, payload or derived information), or specifies properties
  of the pattern itself.
- **Actions**: what needs to be done when a packet matches a pattern.
- **Flow rule**: this is the result of combining a *matching pattern* with
  *actions*.
- **Filter rule**: a less generic term than *flow rule*, can otherwise be
  used interchangeably.
- **Hit**: a flow rule is said to be *hit* when processing a matching
  packet.

Requirements
------------

As described in the previous section, there is a growing need for a common
method to configure filtering and related actions in a hardware independent
fashion.

The filtering API should not disallow any filter combination by design and
must remain as simple as possible to use. It can simply be defined as a
method to perform one or several actions on selected packets.

PMDs are aware of the capabilities of the device they manage and should be
responsible for preventing unsupported or conflicting combinations.

This approach is fundamentally different as it places most of the burden on
the software side of the PMD instead of having device capabilities directly
mapped to API functions, then expecting applications to work around ensuing
compatibility issues.

Requirements for a new API:

- Flexible and extensible without causing API/ABI problems for existing
  applications.
- Should be unambiguous and easy to use.
- Support existing filtering features and actions listed in `Filter types`_.
- Support packet alteration.
- In case of overlapping filters, their priority should be well documented.
- Support filter queries (for example to retrieve counters).

.. raw:: pdf

   PageBreak

High level design
-----------------

The chosen approach to make filtering as generic as possible is by
expressing matching patterns through lists of items instead of the flat
structures used in DPDK today, enabling combinations that are not predefined
and thus being more versatile.

Flow rules can have several distinct actions (such as counting,
encapsulating, decapsulating before redirecting packets to a particular
queue, etc.), instead of relying on several rules to achieve this and having
applications deal with hardware implementation details regarding their
order.

Support for different priority levels on a rule basis is provided, for
example in order to force a more specific rule come before a more generic
one for packets matched by both, however hardware support for more than a
single priority level cannot be guaranteed. When supported, the number of
available priority levels is usually low, which is why they can also be
implemented in software by PMDs (e.g. to simulate missing priority levels by
reordering rules).

In order to remain as hardware agnostic as possible, by default all rules
are considered to have the same priority, which means that the order between
overlapping rules (when a packet is matched by several filters) is
undefined, packet duplication may even occur as a result.

PMDs may refuse to create overlapping rules at a given priority level when
they can be detected (e.g. if a pattern matches an existing filter).

Thus predictable results for a given priority level can only be achieved
with non-overlapping rules, using perfect matching on all protocol layers.

Support for multiple actions per rule may be implemented internally on top
of non-default hardware priorities, as a result both features may not be
simultaneously available to applications.

Considering that allowed pattern/actions combinations cannot be known in
advance and would result in an unpractically large number of capabilities to
expose, a method is provided to validate a given rule from the current
device configuration state without actually adding it (akin to a "dry run"
mode).

This enables applications to check if the rule types they need is supported
at initialization time, before starting their data path. This method can be
used anytime, its only requirement being that the resources needed by a rule
must exist (e.g. a target RX queue must be configured first).

Each defined rule is associated with an opaque handle managed by the PMD,
applications are responsible for keeping it. These can be used for queries
and rules management, such as retrieving counters or other data and
destroying them.

Handles must be destroyed before releasing associated resources such as
queues.

Integration
-----------

To avoid ABI breakage, this new interface will be implemented through the
existing filtering control framework (``rte_eth_dev_filter_ctrl()``) using
**RTE_ETH_FILTER_GENERIC** as a new filter type.

However a public front-end API described in `Rules management`_ will
be added as the preferred method to use it.

Once discussions with the community have converged to a definite API, legacy
filter types should be deprecated and a deadline defined to remove their
support entirely.

PMDs will have to be gradually converted to **RTE_ETH_FILTER_GENERIC** or
drop filtering support entirely. Less maintained PMDs for older hardware may
lose support at this point.

The notion of filter type will then be deprecated and subsequently dropped
to avoid confusion between both frameworks.

Implementation details
======================

Flow rule
---------

A flow rule is the combination of a matching pattern with a list of actions,
and is the basis of this API.

Priorities
~~~~~~~~~~

A priority can be assigned to a matching pattern.

The default priority level is 0 and is also the highest. Support for more
than a single priority level in hardware is not guaranteed.

If a packet is matched by several filters at a given priority level, the
outcome is undefined. It can take any path and can even be duplicated.

Matching pattern
~~~~~~~~~~~~~~~~

A matching pattern comprises any number of items of various types.

Items are arranged in a list to form a matching pattern for packets. They
fall in two categories:

- Protocol matching (ANY, RAW, ETH, IPV4, IPV6, ICMP, UDP, TCP, VXLAN and so
  on), usually associated with a specification structure. These must be
  stacked in the same order as the protocol layers to match, starting from
  L2.

- Affecting how the pattern is processed (END, VOID, INVERT, PF, VF,
  SIGNATURE and so on), often without a specification structure. Since they
  are meta data that does not match packet contents, these can be specified
  anywhere within item lists without affecting the protocol matching items.

Most item specifications can be optionally paired with a mask to narrow the
specific fields or bits to be matched.

- Items are defined with ``struct rte_flow_item``.
- Patterns are defined with ``struct rte_flow_pattern``.

Example of an item specification matching an Ethernet header:

+-----------------------------------------+
| Ethernet                                |
+==========+=========+====================+
| ``spec`` | ``src`` | ``00:01:02:03:04`` |
|          +---------+--------------------+
|          | ``dst`` | ``00:2a:66:00:01`` |
+----------+---------+--------------------+
| ``mask`` | ``src`` | ``00:ff:ff:ff:00`` |
|          +---------+--------------------+
|          | ``dst`` | ``00:00:00:00:ff`` |
+----------+---------+--------------------+

Non-masked bits stand for any value, Ethernet headers with the following
properties are thus matched:

- ``src``: ``??:01:02:03:??``
- ``dst``: ``??:??:??:??:01``

Except for meta types that do not need one, ``spec`` must be a valid pointer
to a structure of the related item type. A ``mask`` of the same type can be
provided to tell which bits in ``spec`` are to be matched.

A mask is normally only needed for ``spec`` fields matching packet data,
ignored otherwise. See individual item types for more information.

A ``NULL`` mask pointer is allowed and is similar to matching with a full
mask (all ones) ``spec`` fields supported by hardware, the remaining fields
are ignored (all zeroes), there is thus no error checking for unsupported
fields.

Matching pattern items for packet data must be naturally stacked (ordered
from lowest to highest protocol layer), as in the following examples:

+--------------+
| TCPv4 as L4  |
+===+==========+
| 0 | Ethernet |
+---+----------+
| 1 | IPv4     |
+---+----------+
| 2 | TCP      |
+---+----------+

+----------------+
| TCPv6 in VXLAN |
+===+============+
| 0 | Ethernet   |
+---+------------+
| 1 | IPv4       |
+---+------------+
| 2 | UDP        |
+---+------------+
| 3 | VXLAN      |
+---+------------+
| 4 | Ethernet   |
+---+------------+
| 5 | IPv6       |
+---+------------+
| 6 | TCP        |
+---+------------+

+-----------------------------+
| TCPv4 as L4 with meta items |
+===+=========================+
| 0 | VOID                    |
+---+-------------------------+
| 1 | Ethernet                |
+---+-------------------------+
| 2 | VOID                    |
+---+-------------------------+
| 3 | IPv4                    |
+---+-------------------------+
| 4 | TCP                     |
+---+-------------------------+
| 5 | VOID                    |
+---+-------------------------+
| 6 | VOID                    |
+---+-------------------------+

The above example shows how meta items do not affect packet data matching
items, as long as those remain stacked properly. The resulting matching
pattern is identical to "TCPv4 as L4".

+----------------+
| UDPv6 anywhere |
+===+============+
| 0 | IPv6       |
+---+------------+
| 1 | UDP        |
+---+------------+

If supported by the PMD, omitting one or several protocol layers at the
bottom of the stack as in the above example (missing an Ethernet
specification) enables hardware to look anywhere in packets.

It is unspecified whether the payload of supported encapsulations
(e.g. VXLAN inner packet) is matched by such a pattern, which may apply to
inner, outer or both packets.

+---------------------+
| Invalid, missing L3 |
+===+=================+
| 0 | Ethernet        |
+---+-----------------+
| 1 | UDP             |
+---+-----------------+

The above pattern is invalid due to a missing L3 specification between L2
and L4. It is only allowed at the bottom and at the top of the stack.

Meta item types
~~~~~~~~~~~~~~~

These do not match packet data but affect how the pattern is processed, most
of them do not need a specification structure. This particularity allows
them to be specified anywhere without affecting other item types.

``END``
^^^^^^^

End marker for item lists. Prevents further processing of items, thereby
ending the pattern.

- Its numeric value is **0** for convenience.
- PMD support is mandatory.
- Both ``spec`` and ``mask`` are ignored.

+--------------------+
| END                |
+==========+=========+
| ``spec`` | ignored |
+----------+---------+
| ``mask`` | ignored |
+----------+---------+

``VOID``
^^^^^^^^

Used as a placeholder for convenience. It is ignored and simply discarded by
PMDs.

- PMD support is mandatory.
- Both ``spec`` and ``mask`` are ignored.

+--------------------+
| VOID               |
+==========+=========+
| ``spec`` | ignored |
+----------+---------+
| ``mask`` | ignored |
+----------+---------+

One usage example for this type is generating rules that share a common
prefix quickly without reallocating memory, only by updating item types:

+------------------------+
| TCP, UDP or ICMP as L4 |
+===+====================+
| 0 | Ethernet           |
+---+--------------------+
| 1 | IPv4               |
+---+------+------+------+
| 2 | UDP  | VOID | VOID |
+---+------+------+------+
| 3 | VOID | TCP  | VOID |
+---+------+------+------+
| 4 | VOID | VOID | ICMP |
+---+------+------+------+

.. raw:: pdf

   PageBreak

``INVERT``
^^^^^^^^^^

Inverted matching, i.e. process packets that do not match the pattern.

- Both ``spec`` and ``mask`` are ignored.

+--------------------+
| INVERT             |
+==========+=========+
| ``spec`` | ignored |
+----------+---------+
| ``mask`` | ignored |
+----------+---------+

Usage example in order to match non-TCPv4 packets only:

+--------------------+
| Anything but TCPv4 |
+===+================+
| 0 | INVERT         |
+---+----------------+
| 1 | Ethernet       |
+---+----------------+
| 2 | IPv4           |
+---+----------------+
| 3 | TCP            |
+---+----------------+

``PF``
^^^^^^

Matches packets addressed to the physical function of the device.

- Both ``spec`` and ``mask`` are ignored.

+--------------------+
| PF                 |
+==========+=========+
| ``spec`` | ignored |
+----------+---------+
| ``mask`` | ignored |
+----------+---------+

``VF``
^^^^^^

Matches packets addressed to the given virtual function ID of the device.

- Only ``spec`` needs to be defined, ``mask`` is ignored.

+----------------------------------------+
| VF                                     |
+==========+=========+===================+
| ``spec`` | ``vf``  | destination VF ID |
+----------+---------+-------------------+
| ``mask`` | ignored                     |
+----------+-----------------------------+

``SIGNATURE``
^^^^^^^^^^^^^

Requests hash-based signature dispatching for this rule.

Considering this is a global setting on devices that support it, all
subsequent filter rules may have to be created with it as well.

- Only ``spec`` needs to be defined, ``mask`` is ignored.

+--------------------+
| SIGNATURE          |
+==========+=========+
| ``spec`` | TBD     |
+----------+---------+
| ``mask`` | ignored |
+----------+---------+

.. raw:: pdf

   PageBreak

Data matching item types
~~~~~~~~~~~~~~~~~~~~~~~~

Most of these are basically protocol header definitions with associated
bitmasks. They must be specified (stacked) from lowest to highest protocol
layer.

The following list is not exhaustive as new protocols will be added in the
future.

``ANY``
^^^^^^^

Matches any protocol in place of the current layer, a single ANY may also
stand for several protocol layers.

This is usually specified as the first pattern item when looking for a
protocol anywhere in a packet.

- A maximum value of **0** requests matching any number of protocol layers
  above or equal to the minimum value, a maximum value lower than the
  minimum one is otherwise invalid.
- Only ``spec`` needs to be defined, ``mask`` is ignored.

+-----------------------------------------------------------------------+
| ANY                                                                   |
+==========+=========+==================================================+
| ``spec`` | ``min`` | minimum number of layers covered                 |
|          +---------+--------------------------------------------------+
|          | ``max`` | maximum number of layers covered, 0 for infinity |
+----------+---------+--------------------------------------------------+
| ``mask`` | ignored                                                    |
+----------+------------------------------------------------------------+

Example for VXLAN TCP payload matching regardless of outer L3 (IPv4 or IPv6)
and L4 (UDP) both matched by the first ANY specification, and inner L3 (IPv4
or IPv6) matched by the second ANY specification:

+----------------------------------+
| TCP in VXLAN with wildcards      |
+===+==============================+
| 0 | Ethernet                     |
+---+-----+----------+---------+---+
| 1 | ANY | ``spec`` | ``min`` | 2 |
|   |     |          +---------+---+
|   |     |          | ``max`` | 2 |
+---+-----+----------+---------+---+
| 2 | VXLAN                        |
+---+------------------------------+
| 3 | Ethernet                     |
+---+-----+----------+---------+---+
| 4 | ANY | ``spec`` | ``min`` | 1 |
|   |     |          +---------+---+
|   |     |          | ``max`` | 1 |
+---+-----+----------+---------+---+
| 5 | TCP                          |
+---+------------------------------+

.. raw:: pdf

   PageBreak

``RAW``
^^^^^^^

Matches a string of a given length at a given offset (in bytes), or anywhere
in the payload of the current protocol layer (including L2 header if used as
the first item in the stack).

This does not increment the protocol layer count as it is not a protocol
definition. Subsequent RAW items modulate the first absolute one with
relative offsets.

- Using **-1** as the ``offset`` of the first RAW item makes its absolute
  offset not fixed, i.e. the pattern is searched everywhere.
- ``mask`` only affects the pattern.

+--------------------------------------------------------------+
| RAW                                                          |
+==========+=============+=====================================+
| ``spec`` | ``offset``  | absolute or relative pattern offset |
|          +-------------+-------------------------------------+
|          | ``length``  | pattern length                      |
|          +-------------+-------------------------------------+
|          | ``pattern`` | byte string of the above length     |
+----------+-------------+-------------------------------------+
| ``mask`` | ``offset``  | ignored                             |
|          +-------------+-------------------------------------+
|          | ``length``  | ignored                             |
|          +-------------+-------------------------------------+
|          | ``pattern`` | bitmask with the same byte length   |
+----------+-------------+-------------------------------------+

Example pattern looking for several strings at various offsets of a UDP
payload, using combined RAW items:

+------------------------------------------+
| UDP payload matching                     |
+===+======================================+
| 0 | Ethernet                             |
+---+--------------------------------------+
| 1 | IPv4                                 |
+---+--------------------------------------+
| 2 | UDP                                  |
+---+-----+----------+-------------+-------+
| 3 | RAW | ``spec`` | ``offset``  | -1    |
|   |     |          +-------------+-------+
|   |     |          | ``length``  | 3     |
|   |     |          +-------------+-------+
|   |     |          | ``pattern`` | "foo" |
+---+-----+----------+-------------+-------+
| 4 | RAW | ``spec`` | ``offset``  | 20    |
|   |     |          +-------------+-------+
|   |     |          | ``length``  | 3     |
|   |     |          +-------------+-------+
|   |     |          | ``pattern`` | "bar" |
+---+-----+----------+-------------+-------+
| 5 | RAW | ``spec`` | ``offset``  | -30   |
|   |     |          +-------------+-------+
|   |     |          | ``length``  | 3     |
|   |     |          +-------------+-------+
|   |     |          | ``pattern`` | "baz" |
+---+-----+----------+-------------+-------+

This translates to:

- Locate "foo" in UDP payload, remember its offset.
- Check "bar" at "foo"'s offset plus 20 bytes.
- Check "baz" at "foo"'s offset minus 30 bytes.

.. raw:: pdf

   PageBreak

``ETH``
^^^^^^^

Matches an Ethernet header.

- ``dst``: destination MAC.
- ``src``: source MAC.
- ``type``: EtherType.
- ``tags``: number of 802.1Q/ad tags defined.
- ``tag[]``: 802.1Q/ad tag definitions, innermost first. For each one:

 - ``tpid``: Tag protocol identifier.
 - ``tci``: Tag control information.

``IPV4``
^^^^^^^^

Matches an IPv4 header.

- ``src``: source IP address.
- ``dst``: destination IP address.
- ``tos``: ToS/DSCP field.
- ``ttl``: TTL field.
- ``proto``: protocol number for the next layer.

``IPV6``
^^^^^^^^

Matches an IPv6 header.

- ``src``: source IP address.
- ``dst``: destination IP address.
- ``tc``: traffic class field.
- ``nh``: Next header field (protocol).
- ``hop_limit``: hop limit field (TTL).

``ICMP``
^^^^^^^^

Matches an ICMP header.

- TBD.

``UDP``
^^^^^^^

Matches a UDP header.

- ``sport``: source port.
- ``dport``: destination port.
- ``length``: UDP length.
- ``checksum``: UDP checksum.

.. raw:: pdf

   PageBreak

``TCP``
^^^^^^^

Matches a TCP header.

- ``sport``: source port.
- ``dport``: destination port.
- All other TCP fields and bits.

``VXLAN``
^^^^^^^^^

Matches a VXLAN header.

- TBD.

.. raw:: pdf

   PageBreak

Actions
~~~~~~~

Each possible action is represented by a type. Some have associated
configuration structures. Several actions combined in a list can be affected
to a flow rule. That list is not ordered.

At least one action must be defined in a filter rule in order to do
something with matched packets.

- Actions are defined with ``struct rte_flow_action``.
- A list of actions is defined with ``struct rte_flow_actions``.

They fall in three categories:

- Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
  processing matched packets by subsequent flow rules, unless overridden
  with PASSTHRU.

- Non terminating actions (PASSTHRU, DUP) that leave matched packets up for
  additional processing by subsequent flow rules.

- Other non terminating meta actions that do not affect the fate of packets
  (END, VOID, ID, COUNT).

When several actions are combined in a flow rule, they should all have
different types (e.g. dropping a packet twice is not possible). However
considering the VOID type is an exception to this rule, the defined behavior
is for PMDs to only take into account the last action of a given type found
in the list. PMDs still perform error checking on the entire list.

*Note that PASSTHRU is the only action able to override a terminating rule.*

.. raw:: pdf

   PageBreak

Example of an action that redirects packets to queue index 10:

+----------------+
| QUEUE          |
+===========+====+
| ``queue`` | 10 |
+-----------+----+

Action lists examples, their order is not significant, applications must
consider all actions to be performed simultaneously:

+----------------+
| Count and drop |
+=======+========+
| COUNT |        |
+-------+--------+
| DROP  |        |
+-------+--------+

+--------------------------+
| Tag, count and redirect  |
+=======+===========+======+
| ID    | ``id``    | 0x2a |
+-------+-----------+------+
| COUNT |                  |
+-------+-----------+------+
| QUEUE | ``queue`` | 10   |
+-------+-----------+------+

+-----------------------+
| Redirect to queue 5   |
+=======+===============+
| DROP  |               |
+-------+-----------+---+
| QUEUE | ``queue`` | 5 |
+-------+-----------+---+

In the above example, considering both actions are performed simultaneously,
its end result is that only QUEUE has any effect.

+-----------------------+
| Redirect to queue 3   |
+=======+===========+===+
| QUEUE | ``queue`` | 5 |
+-------+-----------+---+
| VOID  |               |
+-------+-----------+---+
| QUEUE | ``queue`` | 3 |
+-------+-----------+---+

As previously described, only the last action of a given type found in the
list is taken into account. The above example also shows that VOID is
ignored.

.. raw:: pdf

   PageBreak

Action types
~~~~~~~~~~~~

Common action types are described in this section. Like pattern item types,
this list is not exhaustive as new actions will be added in the future.

``END`` (action)
^^^^^^^^^^^^^^^^

End marker for action lists. Prevents further processing of actions, thereby
ending the list.

- Its numeric value is **0** for convenience.
- PMD support is mandatory.
- No configurable property.

+---------------+
| END           |
+===============+
| no properties |
+---------------+

``VOID`` (action)
^^^^^^^^^^^^^^^^^

Used as a placeholder for convenience. It is ignored and simply discarded by
PMDs.

- PMD support is mandatory.
- No configurable property.

+---------------+
| VOID          |
+===============+
| no properties |
+---------------+

``PASSTHRU``
^^^^^^^^^^^^

Leaves packets up for additional processing by subsequent flow rules. This
is the default when a rule does not contain a terminating action, but can be
specified to force a rule to become non-terminating.

- No configurable property.

+---------------+
| PASSTHRU      |
+===============+
| no properties |
+---------------+

Example to copy a packet to a queue and continue processing by subsequent
flow rules:

+--------------------------+
| Copy to queue 8          |
+==========+===============+
| PASSTHRU |               |
+----------+-----------+---+
| QUEUE    | ``queue`` | 8 |
+----------+-----------+---+

``ID``
^^^^^^

Attaches a 32 bit value to packets.

+----------------------------------------------+
| ID                                           |
+========+=====================================+
| ``id`` | 32 bit value to return with packets |
+--------+-------------------------------------+

.. raw:: pdf

   PageBreak

``QUEUE``
^^^^^^^^^

Assigns packets to a given queue index.

- Terminating by default.

+--------------------------------+
| QUEUE                          |
+===========+====================+
| ``queue`` | queue index to use |
+-----------+--------------------+

``DROP``
^^^^^^^^

Drop packets.

- No configurable property.
- Terminating by default.
- PASSTHRU overrides this action if both are specified.

+---------------+
| DROP          |
+===============+
| no properties |
+---------------+

``COUNT``
^^^^^^^^^

Enables hits counter for this rule.

This counter can be retrieved and reset through ``rte_flow_query()``, see
``struct rte_flow_query_count``.

- Counters can be retrieved with ``rte_flow_query()``.
- No configurable property.

+---------------+
| COUNT         |
+===============+
| no properties |
+---------------+

Query structure to retrieve and reset the flow rule hits counter:

+------------------------------------------------+
| COUNT query                                    |
+===========+=====+==============================+
| ``reset`` | in  | reset counter after query    |
+-----------+-----+------------------------------+
| ``hits``  | out | number of hits for this flow |
+-----------+-----+------------------------------+

``DUP``
^^^^^^^

Duplicates packets to a given queue index.

This is normally combined with QUEUE, however when used alone, it is
actually similar to QUEUE + PASSTHRU.

- Non-terminating by default.

+------------------------------------------------+
| DUP                                            |
+===========+====================================+
| ``queue`` | queue index to duplicate packet to |
+-----------+------------------------------------+

.. raw:: pdf

   PageBreak

``RSS``
^^^^^^^

Similar to QUEUE, except RSS is additionally performed on packets to spread
them among several queues according to the provided parameters.

- Terminating by default.

+---------------------------------------------+
| RSS                                         |
+==============+==============================+
| ``rss_conf`` | RSS parameters               |
+--------------+------------------------------+
| ``queues``   | number of entries in queue[] |
+--------------+------------------------------+
| ``queue[]``  | queue indices to use         |
+--------------+------------------------------+

``PF`` (action)
^^^^^^^^^^^^^^^

Redirects packets to the physical function (PF) of the current device.

- No configurable property.
- Terminating by default.

+---------------+
| PF            |
+===============+
| no properties |
+---------------+

``VF`` (action)
^^^^^^^^^^^^^^^

Redirects packets to the virtual function (VF) of the current device with
the specified ID.

- Terminating by default.

+---------------------------------------+
| VF                                    |
+========+==============================+
| ``id`` | VF ID to redirect packets to |
+--------+------------------------------+

Planned types
~~~~~~~~~~~~~

Other action types are planned but not defined yet. These actions will add
the ability to alter matching packets in several ways, such as performing
encapsulation/decapsulation of tunnel headers on specific flows.

.. raw:: pdf

   PageBreak

Rules management
----------------

A simple API with only four functions is provided to fully manage flows.

Each created flow rule is associated with an opaque, PMD-specific handle
pointer. The application is responsible for keeping it until the rule is
destroyed.

Flows rules are defined with ``struct rte_flow``.

Validation
~~~~~~~~~~

Given that expressing a definite set of device capabilities with this API is
not practical, a dedicated function is provided to check if a flow rule is
supported and can be created.

::

 int
 rte_flow_validate(uint8_t port_id,
                   const struct rte_flow_pattern *pattern,
                   const struct rte_flow_actions *actions);

While this function has no effect on the target device, the flow rule is
validated against its current configuration state and the returned value
should be considered valid by the caller for that state only.

The returned value is guaranteed to remain valid only as long as no
successful calls to rte_flow_create() or rte_flow_destroy() are made in the
meantime and no device parameter affecting flow rules in any way are
modified, due to possible collisions or resource limitations (although in
such cases ``EINVAL`` should not be returned).

Arguments:

- ``port_id``: port identifier of Ethernet device.
- ``pattern``: pattern specification to check.
- ``actions``: actions associated with the flow definition.

Return value:

- **0** if flow rule is valid and can be created. A negative errno value
  otherwise (``rte_errno`` is also set), the following errors are defined.
- ``-EINVAL``: unknown or invalid rule specification.
- ``-ENOTSUP``: valid but unsupported rule specification (e.g. partial masks
  are unsupported).
- ``-EEXIST``: collision with an existing rule.
- ``-ENOMEM``: not enough resources.

.. raw:: pdf

   PageBreak

Creation
~~~~~~~~

Creating a flow rule is similar to validating one, except the rule is
actually created.

::

 struct rte_flow *
 rte_flow_create(uint8_t port_id,
                 const struct rte_flow_pattern *pattern,
                 const struct rte_flow_actions *actions);

Arguments:

- ``port_id``: port identifier of Ethernet device.
- ``pattern``: pattern specification to add.
- ``actions``: actions associated with the flow definition.

Return value:

A valid flow pointer in case of success, NULL otherwise and ``rte_errno`` is
set to the positive version of one of the error codes defined for
``rte_flow_validate()``.

Destruction
~~~~~~~~~~~

Flow rules destruction is not automatic, and a queue should not be released
if any are still attached to it. Applications must take care of performing
this step before releasing resources.

::

 int
 rte_flow_destroy(uint8_t port_id,
                  struct rte_flow *flow);


Failure to destroy a flow rule may occur when other flow rules depend on it,
and destroying it would result in an inconsistent state.

This function is only guaranteed to succeed if flow rules are destroyed in
reverse order of their creation.

Arguments:

- ``port_id``: port identifier of Ethernet device.
- ``flow``: flow rule to destroy.

Return value:

- **0** on success, a negative errno value otherwise and ``rte_errno`` is
  set.

.. raw:: pdf

   PageBreak

Query
~~~~~

Query an existing flow rule.

This function allows retrieving flow-specific data such as counters. Data
is gathered by special actions which must be present in the flow rule
definition.

::

 int
 rte_flow_query(uint8_t port_id,
                struct rte_flow *flow,
                enum rte_flow_action_type action,
                void *data);

Arguments:

- ``port_id``: port identifier of Ethernet device.
- ``flow``: flow rule to query.
- ``action``: action type to query.
- ``data``: pointer to storage for the associated query data type.

Return value:

- **0** on success, a negative errno value otherwise and ``rte_errno`` is
  set.

.. raw:: pdf

   PageBreak

Behavior
--------

- API operations are synchronous and blocking (``EAGAIN`` cannot be
  returned).

- There is no provision for reentrancy/multi-thread safety, although nothing
  should prevent different devices from being configured at the same
  time. PMDs may protect their control path functions accordingly.

- Stopping the data path (TX/RX) should not be necessary when managing flow
  rules. If this cannot be achieved naturally or with workarounds (such as
  temporarily replacing the burst function pointers), an appropriate error
  code must be returned (``EBUSY``).

- PMDs, not applications, are responsible for maintaining flow rules
  configuration when stopping and restarting a port or performing other
  actions which may affect them. They can only be destroyed explicitly.

.. raw:: pdf

   PageBreak

Compatibility
-------------

No known hardware implementation supports all the features described in this
document.

Unsupported features or combinations are not expected to be fully emulated
in software by PMDs for performance reasons. Partially supported features
may be completed in software as long as hardware performs most of the work
(such as queue redirection and packet recognition).

However PMDs are expected to do their best to satisfy application requests
by working around hardware limitations as long as doing so does not affect
the behavior of existing flow rules.

The following sections provide a few examples of such cases, they are based
on limitations built into the previous APIs.

Global bitmasks
~~~~~~~~~~~~~~~

Each flow rule comes with its own, per-layer bitmasks, while hardware may
support only a single, device-wide bitmask for a given layer type, so that
two IPv4 rules cannot use different bitmasks.

The expected behavior in this case is that PMDs automatically configure
global bitmasks according to the needs of the first created flow rule.

Subsequent rules are allowed only if their bitmasks match those, the
``EEXIST`` error code should be returned otherwise.

Unsupported layer types
~~~~~~~~~~~~~~~~~~~~~~~

Many protocols can be simulated by crafting patterns with the `RAW`_ type.

PMDs can rely on this capability to simulate support for protocols with
fixed headers not directly recognized by hardware.

``ANY`` pattern item
~~~~~~~~~~~~~~~~~~~~

This pattern item stands for anything, which can be difficult to translate
to something hardware would understand, particularly if followed by more
specific types.

Consider the following pattern:

+---+--------------------------------+
| 0 | ETHER                          |
+---+--------------------------------+
| 1 | ANY (``min`` = 1, ``max`` = 1) |
+---+--------------------------------+
| 2 | TCP                            |
+---+--------------------------------+

Knowing that TCP does not make sense with something other than IPv4 and IPv6
as L3, such a pattern may be translated to two flow rules instead:

+---+--------------------+
| 0 | ETHER              |
+---+--------------------+
| 1 | IPV4 (zeroed mask) |
+---+--------------------+
| 2 | TCP                |
+---+--------------------+

+---+--------------------+
| 0 | ETHER              |
+---+--------------------+
| 1 | IPV6 (zeroed mask) |
+---+--------------------+
| 2 | TCP                |
+---+--------------------+

Note that as soon as a ANY rule covers several layers, this approach may
yield a large number of hidden flow rules. It is thus suggested to only
support the most common scenarios (anything as L2 and/or L3).

.. raw:: pdf

   PageBreak

Unsupported actions
~~~~~~~~~~~~~~~~~~~

- When combined with a `QUEUE`_ action, packet counting (`COUNT`_) and
  tagging (`ID`_) may be implemented in software as long as the target queue
  is used by a single rule.

- A rule specifying both `DUP`_ + `QUEUE`_ may be translated to two hidden
  rules combining `QUEUE`_ and `PASSTHRU`_.

- When a single target queue is provided, `RSS`_ can also be implemented
  through `QUEUE`_.

Flow rules priority
~~~~~~~~~~~~~~~~~~~

While it would naturally make sense, flow rules cannot be assumed to be
processed by hardware in the same order as their creation for several
reasons:

- They may be managed internally as a tree or a hash table instead of a
  list.
- Removing a flow rule before adding another one can either put the new rule
  at the end of the list or reuse a freed entry.
- Duplication may occur when packets are matched by several rules.

For overlapping rules (particularly in order to use the `PASSTHRU`_ action)
predictable behavior is only guaranteed by using different priority levels.

Priority levels are not necessarily implemented in hardware, or may be
severely limited (e.g. a single priority bit).

For these reasons, priority levels may be implemented purely in software by
PMDs.

- For devices expecting flow rules to be added in the correct order, PMDs
  may destroy and re-create existing rules after adding a new one with
  a higher priority.

- A configurable number of dummy or empty rules can be created at
  initialization time to save high priority slots for later.

- In order to save priority levels, PMDs may evaluate whether rules are
  likely to collide and adjust their priority accordingly.

.. raw:: pdf

   PageBreak

API migration
=============

Exhaustive list of deprecated filter types and how to convert them to
generic flow rules.

``MACVLAN`` to ``ETH`` → ``VF``, ``PF``
---------------------------------------

`MACVLAN`_ can be translated to a basic `ETH`_ flow rule with a `VF
(action)`_ or `PF (action)`_ terminating action.

+------------------------------------+
| MACVLAN                            |
+--------------------------+---------+
| Pattern                  | Actions |
+===+=====+==========+=====+=========+
| 0 | ETH | ``spec`` | any | VF,     |
|   |     +----------+-----+ PF      |
|   |     | ``mask`` | any |         |
+---+-----+----------+-----+---------+

``ETHERTYPE`` to ``ETH`` → ``QUEUE``, ``DROP``
----------------------------------------------

`ETHERTYPE`_ is basically an `ETH`_ flow rule with `QUEUE`_ or `DROP`_ as
a terminating action.

+------------------------------------+
| ETHERTYPE                          |
+--------------------------+---------+
| Pattern                  | Actions |
+===+=====+==========+=====+=========+
| 0 | ETH | ``spec`` | any | QUEUE,  |
|   |     +----------+-----+ DROP    |
|   |     | ``mask`` | any |         |
+---+-----+----------+-----+---------+

``FLEXIBLE`` to ``RAW`` → ``QUEUE``
-----------------------------------

`FLEXIBLE`_ can be translated to one `RAW`_ pattern with `QUEUE`_ as the
terminating action and a defined priority level.

+------------------------------------+
| FLEXIBLE                           |
+--------------------------+---------+
| Pattern                  | Actions |
+===+=====+==========+=====+=========+
| 0 | RAW | ``spec`` | any | QUEUE   |
|   |     +----------+-----+         |
|   |     | ``mask`` | any |         |
+---+-----+----------+-----+---------+

``SYN`` to ``TCP`` → ``QUEUE``
------------------------------

`SYN`_ is a `TCP`_ rule with only the ``syn`` bit enabled and masked, and
`QUEUE`_ as the terminating action.

Priority level can be set to simulate the high priority bit.

+---------------------------------------------+
| SYN                                         |
+-----------------------------------+---------+
| Pattern                           | Actions |
+===+======+==========+=============+=========+
| 0 | ETH  | ``spec`` | N/A         | QUEUE   |
|   |      +----------+-------------+         |
|   |      | ``mask`` | empty       |         |
+---+------+----------+-------------+         |
| 1 | IPV4 | ``spec`` | N/A         |         |
|   |      +----------+-------------+         |
|   |      | ``mask`` | empty       |         |
+---+------+----------+-------------+         |
| 2 | TCP  | ``spec`` | ``syn`` = 1 |         |
|   |      +----------+-------------+         |
|   |      | ``mask`` | ``syn`` = 1 |         |
+---+------+----------+-------------+---------+

``NTUPLE`` to ``IPV4``, ``TCP``, ``UDP`` → ``QUEUE``
----------------------------------------------------

`NTUPLE`_ is similar to specifying an empty L2, `IPV4`_ as L3 with `TCP`_ or
`UDP`_ as L4 and `QUEUE`_ as the terminating action.

A priority level can be specified as well.

+---------------------------------------+
| NTUPLE                                |
+-----------------------------+---------+
| Pattern                     | Actions |
+===+======+==========+=======+=========+
| 0 | ETH  | ``spec`` | N/A   | QUEUE   |
|   |      +----------+-------+         |
|   |      | ``mask`` | empty |         |
+---+------+----------+-------+         |
| 1 | IPV4 | ``spec`` | any   |         |
|   |      +----------+-------+         |
|   |      | ``mask`` | any   |         |
+---+------+----------+-------+         |
| 2 | TCP, | ``spec`` | any   |         |
|   | UDP  +----------+-------+         |
|   |      | ``mask`` | any   |         |
+---+------+----------+-------+---------+

``TUNNEL`` to ``ETH``, ``IPV4``, ``IPV6``, ``VXLAN`` (or other) → ``QUEUE``
---------------------------------------------------------------------------

`TUNNEL`_ matches common IPv4 and IPv6 L3/L4-based tunnel types.

In the following table, `ANY`_ is used to cover the optional L4.

+------------------------------------------------+
| TUNNEL                                         |
+--------------------------------------+---------+
| Pattern                              | Actions |
+===+=========+==========+=============+=========+
| 0 | ETH     | ``spec`` | any         | QUEUE   |
|   |         +----------+-------------+         |
|   |         | ``mask`` | any         |         |
+---+---------+----------+-------------+         |
| 1 | IPV4,   | ``spec`` | any         |         |
|   | IPV6    +----------+-------------+         |
|   |         | ``mask`` | any         |         |
+---+---------+----------+-------------+         |
| 2 | ANY     | ``spec`` | ``min`` = 0 |         |
|   |         |          +-------------+         |
|   |         |          | ``max`` = 0 |         |
|   |         +----------+-------------+         |
|   |         | ``mask`` | N/A         |         |
+---+---------+----------+-------------+         |
| 3 | VXLAN,  | ``spec`` | any         |         |
|   | GENEVE, +----------+-------------+         |
|   | TEREDO, | ``mask`` | any         |         |
|   | NVGRE,  |          |             |         |
|   | GRE,    |          |             |         |
|   | ...     |          |             |         |
+---+---------+----------+-------------+---------+

.. raw:: pdf

   PageBreak

``FDIR`` to most item types → ``QUEUE``, ``DROP``, ``PASSTHRU``
---------------------------------------------------------------

`FDIR`_ is more complex than any other type, there are several methods to
emulate its functionality. It is summarized for the most part in the table
below.

A few features are intentionally not supported:

- The ability to configure the matching input set and masks for the entire
  device, PMDs should take care of it automatically according to flow rules.

- Returning four or eight bytes of matched data when using flex bytes
  filtering. Although a specific action could implement it, it conflicts
  with the much more useful 32 bits tagging on devices that support it.

- Side effects on RSS processing of the entire device. Flow rules that
  conflict with the current device configuration should not be
  allowed. Similarly, device configuration should not be allowed when it
  affects existing flow rules.

- Device modes of operation. "none" is unsupported since filtering cannot be
  disabled as long as a flow rule is present.

- "MAC VLAN" or "tunnel" perfect matching modes should be automatically set
  according to the created flow rules.

+----------------------------------------------+
| FDIR                                         |
+---------------------------------+------------+
| Pattern                         | Actions    |
+===+============+==========+=====+============+
| 0 | ETH,       | ``spec`` | any | QUEUE,     |
|   | RAW        +----------+-----+ DROP,      |
|   |            | ``mask`` | any | PASSTHRU   |
+---+------------+----------+-----+------------+
| 1 | IPV4,      | ``spec`` | any | ID         |
|   | IPV6       +----------+-----+ (optional) |
|   |            | ``mask`` | any |            |
+---+------------+----------+-----+            |
| 2 | TCP,       | ``spec`` | any |            |
|   | UDP,       +----------+-----+            |
|   | SCTP       | ``mask`` | any |            |
+---+------------+----------+-----+            |
| 3 | VF,        | ``spec`` | any |            |
|   | PF,        +----------+-----+            |
|   | SIGNATURE  | ``mask`` | any |            |
|   | (optional) |          |     |            |
+---+------------+----------+-----+------------+

``HASH``
~~~~~~~~

Hashing configuration is set per rule through the `SIGNATURE`_ item.

Since it is usually a global device setting, all flow rules created with
this item may have to share the same specification.

``L2_TUNNEL`` to ``VOID`` → ``VXLAN`` (or others)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All packets are matched. This type alters incoming packets to encapsulate
them in a chosen tunnel type, optionally redirect them to a VF as well.

The destination pool for tag based forwarding can be emulated with other
flow rules using `DUP`_ as the action.

+----------------------------------------+
| L2_TUNNEL                              |
+---------------------------+------------+
| Pattern                   | Actions    |
+===+======+==========+=====+============+
| 0 | VOID | ``spec`` | N/A | VXLAN,     |
|   |      |          |     | GENEVE,    |
|   |      |          |     | ...        |
|   |      +----------+-----+------------+
|   |      | ``mask`` | N/A | VF         |
|   |      |          |     | (optional) |
+---+------+----------+-----+------------+

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2016-08-19 21:13 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-01 15:08 [dpdk-dev] [RFC] Generic flow director/filtering/classification API Kieran Mansley
2016-08-03 15:19 ` Adrien Mazarguil
  -- strict thread matches above, loose matches on Subject: below --
2016-07-05 18:16 Adrien Mazarguil
2016-07-07  7:14 ` Lu, Wenzhuo
2016-07-07 10:26   ` Adrien Mazarguil
2016-07-19  8:11     ` Lu, Wenzhuo
2016-07-19 13:12       ` Adrien Mazarguil
2016-07-20  2:16         ` Lu, Wenzhuo
2016-07-20 10:41           ` Adrien Mazarguil
2016-07-21  3:18             ` Lu, Wenzhuo
2016-07-21 12:47               ` Adrien Mazarguil
2016-07-22  1:38                 ` Lu, Wenzhuo
2016-07-07 23:15 ` Chandran, Sugesh
2016-07-08 13:03   ` Adrien Mazarguil
2016-07-11 10:42     ` Chandran, Sugesh
2016-07-13 20:03       ` Adrien Mazarguil
2016-07-15  9:23         ` Chandran, Sugesh
2016-07-15 10:02           ` Chilikin, Andrey
2016-07-18 13:26             ` Chandran, Sugesh
2016-07-15 15:04           ` Adrien Mazarguil
2016-07-18 13:26             ` Chandran, Sugesh
2016-07-18 15:00               ` Adrien Mazarguil
2016-07-20 16:32                 ` Chandran, Sugesh
2016-07-20 17:10                   ` Adrien Mazarguil
2016-07-21 11:06                     ` Chandran, Sugesh
2016-07-21 13:37                       ` Adrien Mazarguil
2016-07-22 16:32                         ` Chandran, Sugesh
2016-07-08 11:11 ` Liang, Cunming
2016-07-08 12:38   ` Bruce Richardson
2016-07-08 13:25   ` Adrien Mazarguil
2016-07-11  3:18     ` Liang, Cunming
2016-07-11 10:06       ` Adrien Mazarguil
2016-07-11 10:41 ` Jerin Jacob
2016-07-21 19:20   ` Adrien Mazarguil
2016-07-23 21:10     ` John Fastabend
2016-08-02 18:19       ` John Fastabend
2016-08-03 14:30         ` Adrien Mazarguil
2016-08-03 18:10           ` John Fastabend
2016-08-04 13:05             ` Adrien Mazarguil
2016-08-09 21:24               ` John Fastabend
2016-08-10 11:02                 ` Adrien Mazarguil
2016-08-10 16:35                   ` John Fastabend
2016-07-21  8:13 ` Rahul Lakkireddy
2016-07-21 17:07   ` Adrien Mazarguil
2016-07-25 11:32     ` Rahul Lakkireddy
2016-07-25 16:40       ` John Fastabend
2016-07-26 10:07         ` Rahul Lakkireddy
2016-08-03 16:44           ` Adrien Mazarguil
2016-08-03 19:11             ` John Fastabend
2016-08-04 13:24               ` Adrien Mazarguil
2016-08-09 21:47                 ` John Fastabend
2016-08-10 13:37                   ` Adrien Mazarguil
2016-08-10 16:46                     ` John Fastabend
2016-08-19 21:13           ` John Daley (johndale)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).