DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
@ 2019-05-26 10:18 Matan Azrad
  2019-06-06 10:24 ` Jerin Jacob Kollanukkaran
                   ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Matan Azrad @ 2019-05-26 10:18 UTC (permalink / raw)
  To: Adrien Mazarguil, dev

One of the reasons to destroy a flow is the fact that no packet matches the
flow for "timeout" time.
For example, when TCP\UDP sessions are suddenly closed.

Currently, there is no any dpdk mechanism for flow aging and the
applications use there own ways to detect and destroy aged-out flows.

This RFC introduces flow aging APIs to offload the flow aging task from
the application to the port.

Design:
- A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
  the application flow context for each flow.
- A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
  that there are new aged-out flows.
- A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
  contexts from the port.

By this design each PMD can use its best way to do the aging with the
device offloads supported by its HW.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 lib/librte_ethdev/rte_ethdev.h |  1 +
 lib/librte_ethdev/rte_flow.h   | 56 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 1f35e1d..6fc1531 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in the port */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
index 63f84fc..757e65f 100644
--- a/lib/librte_ethdev/rte_flow.h
+++ b/lib/librte_ethdev/rte_flow.h
@@ -1650,6 +1650,12 @@ enum rte_flow_action_type {
 	 * See struct rte_flow_action_set_mac.
 	 */
 	RTE_FLOW_ACTION_TYPE_SET_MAC_DST,
+	/**
+	 * Report as aged-out if timeout passed without any matching on the flow.
+	 *
+	 * See struct rte_flow_action_age.
+	 */
+	RTE_FLOW_ACTION_TYPE_AGE,
 };
 
 /**
@@ -2131,6 +2137,22 @@ struct rte_flow_action_set_mac {
 	uint8_t mac_addr[ETHER_ADDR_LEN];
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice
+ *
+ * RTE_FLOW_ACTION_TYPE_AGE
+ *
+ * Report as aged-out if timeout passed without any matching on the flow.
+ *
+ * The flow context and the flow handle will be reported by the
+ * rte_flow_get_aged_flows API.
+ */
+struct rte_flow_action_age {
+	uint16_t timeout; /**< Time in seconds. */
+	void *context; /**< The user flow context. */
+};
+
 /*
  * Definition of a single action.
  *
@@ -2686,6 +2708,40 @@ struct rte_flow_desc {
 	      const void *src,
 	      struct rte_flow_error *error);
 
+/**
+ * Get aged-out flows of a given port.
+ *
+ * RTE_ETH_EVENT_FLOW_AGED is triggered when a port detects aged-out flows.
+ * This function can be called to get the aged flows usynchronously from the
+ * event callback or synchronously when the user wants it.
+ * The callback synchronization is on the user responsibility.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in/out] flows
+ *   An allocated array to get the aged-out flows handles.
+ *   NULL indicates the flow handles should not be reported.
+ * @param[in/out] contexts
+ *   An allocated array to get the aged-out flows contexts.
+ *   NULL indicates the flow contexts should not be reported.
+ * @param[in] n
+ *   The allocated array entries number of @p flows and @p contexts if exist.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   0 in case there are not any aged-out flows, otherwise if positive
+ *   is the number of the reported aged-out flows to @p flows and/or
+ *   @p contexts, a negative errno value otherwise and rte_errno is set.
+ *
+ * @see rte_flow_action_age
+ */
+__rte_experimental
+int
+rte_flow_get_aged_flows(uint16_t port_id, struct rte_flow *flows[],
+			void *contexts[], int n, struct rte_flow_error *error);
+
 #ifdef __cplusplus
 }
 #endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
  2019-05-26 10:18 [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging Matan Azrad
@ 2019-06-06 10:24 ` Jerin Jacob Kollanukkaran
  2019-06-06 10:51   ` Matan Azrad
  2020-03-16 10:22 ` [dpdk-dev] [PATCH v2] " BillZhou
  2020-03-16 12:52 ` BillZhou
  2 siblings, 1 reply; 50+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-06-06 10:24 UTC (permalink / raw)
  To: Matan Azrad, Adrien Mazarguil, dev

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> Sent: Sunday, May 26, 2019 3:48 PM
> To: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
> 
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is no any dpdk mechanism for flow aging and the
> applications use there own ways to detect and destroy aged-out flows.
> 
> This RFC introduces flow aging APIs to offload the flow aging task from the
> application to the port.
> 
> Design:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
> and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
> 
> By this design each PMD can use its best way to do the aging with the device
> offloads supported by its HW.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  lib/librte_ethdev/rte_ethdev.h |  1 +
>  lib/librte_ethdev/rte_flow.h   | 56
> ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 1f35e1d..6fc1531 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in
> the port

Does this event supported in HW? Or Are planning to implement with alarm or
timer. Just asking because, if none of the HW supports the interrupt then
only rte_flow_get_aged_flows sync API be enough()


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
  2019-06-06 10:24 ` Jerin Jacob Kollanukkaran
@ 2019-06-06 10:51   ` Matan Azrad
  2019-06-06 12:15     ` Jerin Jacob Kollanukkaran
  0 siblings, 1 reply; 50+ messages in thread
From: Matan Azrad @ 2019-06-06 10:51 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, Adrien Mazarguil, dev

Hi Jerin

From: Jerin Jacob 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> > Sent: Sunday, May 26, 2019 3:48 PM
> > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
> >
> > One of the reasons to destroy a flow is the fact that no packet
> > matches the flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> >
> > Currently, there is no any dpdk mechanism for flow aging and the
> > applications use there own ways to detect and destroy aged-out flows.
> >
> > This RFC introduces flow aging APIs to offload the flow aging task
> > from the application to the port.
> >
> > Design:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
> > and
> >   the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report
> >   that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >   contexts from the port.
> >
> > By this design each PMD can use its best way to do the aging with the
> > device offloads supported by its HW.
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  lib/librte_ethdev/rte_ethdev.h |  1 +
> >  lib/librte_ethdev/rte_flow.h   | 56
> > ++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 57 insertions(+)
> >
> > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > b/lib/librte_ethdev/rte_ethdev.h index 1f35e1d..6fc1531 100644
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > @@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
> >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in
> > the port
> Does this event supported in HW?
It depends in the PMD implementation and HW capability.

> Or Are planning to implement with alarm
> or timer.
Again, depends in the PMD implementation.

> Just asking because, if none of the HW supports the interrupt then
> only rte_flow_get_aged_flows sync API be enough()
Why?

According to the above design this is the way for the PMD to notify the application when it has some aged flows ASAP.
So, if the PMD uses an alarm\timer or any other way to support aging action it is better in part of the cases to notify the user asynchronically instead of doing polling by the application.
The idea is to let the application to decide what is better for its usage.

For mlx5 case,
The plan is to raise this event from an HW interrupt handling(same as link event).

Matan. 
 








^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
  2019-06-06 10:51   ` Matan Azrad
@ 2019-06-06 12:15     ` Jerin Jacob Kollanukkaran
  2019-06-18  5:56       ` Matan Azrad
  2020-03-16 16:13       ` Stephen Hemminger
  0 siblings, 2 replies; 50+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-06-06 12:15 UTC (permalink / raw)
  To: Matan Azrad, Adrien Mazarguil, dev

> -----Original Message-----
> From: Matan Azrad <matan@mellanox.com>
> Sent: Thursday, June 6, 2019 4:22 PM
> To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Adrien Mazarguil
> <adrien.mazarguil@6wind.com>; dev@dpdk.org
> Subject: [EXT] RE: [PATCH] [RFC] ethdev: support flow aging
> 
> Hi Jerin

Hi Matan,

> 
> From: Jerin Jacob
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> > > Sent: Sunday, May 26, 2019 3:48 PM
> > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > > Subject: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
> > >
> > > One of the reasons to destroy a flow is the fact that no packet
> > > matches the flow for "timeout" time.
> > > For example, when TCP\UDP sessions are suddenly closed.
> > >
> > > Currently, there is no any dpdk mechanism for flow aging and the
> > > applications use there own ways to detect and destroy aged-out flows.
> > >
> > > This RFC introduces flow aging APIs to offload the flow aging task
> > > from the application to the port.
> > >
> > > Design:
> > > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the
> timeout
> > > and
> > >   the application flow context for each flow.
> > > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> > report
> > >   that there are new aged-out flows.
> > > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out
> flows
> > >   contexts from the port.
> > >
> > > By this design each PMD can use its best way to do the aging with
> > > the device offloads supported by its HW.
> > >
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > ---
> > >  lib/librte_ethdev/rte_ethdev.h |  1 +
> > >  lib/librte_ethdev/rte_flow.h   | 56
> > > ++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 57 insertions(+)
> > >
> > > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > > b/lib/librte_ethdev/rte_ethdev.h index 1f35e1d..6fc1531 100644
> > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > @@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
> > >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> > >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in
> > > the port
> > Does this event supported in HW?
> It depends in the PMD implementation and HW capability.
> 
> > Or Are planning to implement with alarm or timer.
> Again, depends in the PMD implementation.
> 
> > Just asking because, if none of the HW supports the interrupt then
> > only rte_flow_get_aged_flows sync API be enough()
> Why?

If none of the HW supports it then application/common code can periodically polls it.
If mlx5 hw supports it then it fine to have interrupt. 
But I think, we need to have means to express a HW/Implementation does not support its
As there may following reasons why drivers choose to not take timer/alarm path 
1) Some EAL port does not support timer/alarm example: FreeBSD DPDK port
2) If we need to support a few killo rules then timer/alarm implementation will be heavy
So an option to express un supported event would be fine.

> 
> According to the above design this is the way for the PMD to notify the
> application when it has some aged flows ASAP.
> So, if the PMD uses an alarm\timer or any other way to support aging action
> it is better in part of the cases to notify the user asynchronically instead of
> doing polling by the application.
> The idea is to let the application to decide what is better for its usage.
> 
> For mlx5 case,
> The plan is to raise this event from an HW interrupt handling(same as link
> event).

Good to know.

> 
> Matan.
> 
> 
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
  2019-06-06 12:15     ` Jerin Jacob Kollanukkaran
@ 2019-06-18  5:56       ` Matan Azrad
  2019-06-24  6:26         ` Jerin Jacob Kollanukkaran
  2020-03-16 16:13       ` Stephen Hemminger
  1 sibling, 1 reply; 50+ messages in thread
From: Matan Azrad @ 2019-06-18  5:56 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, Adrien Mazarguil, dev

Hi Jerin

From: Jerin Jacob
> Sent: Thursday, June 6, 2019 3:16 PM
> To: Matan Azrad <matan@mellanox.com>; Adrien Mazarguil
> <adrien.mazarguil@6wind.com>; dev@dpdk.org
> Subject: RE: [PATCH] [RFC] ethdev: support flow aging
> 
> > -----Original Message-----
> > From: Matan Azrad <matan@mellanox.com>
> > Sent: Thursday, June 6, 2019 4:22 PM
> > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Adrien Mazarguil
> > <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > Subject: [EXT] RE: [PATCH] [RFC] ethdev: support flow aging
> >
> > Hi Jerin
> 
> Hi Matan,
> 
> >
> > From: Jerin Jacob
> > > > -----Original Message-----
> > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> > > > Sent: Sunday, May 26, 2019 3:48 PM
> > > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > > > Subject: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
> > > >
> > > > One of the reasons to destroy a flow is the fact that no packet
> > > > matches the flow for "timeout" time.
> > > > For example, when TCP\UDP sessions are suddenly closed.
> > > >
> > > > Currently, there is no any dpdk mechanism for flow aging and the
> > > > applications use there own ways to detect and destroy aged-out flows.
> > > >
> > > > This RFC introduces flow aging APIs to offload the flow aging task
> > > > from the application to the port.
> > > >
> > > > Design:
> > > > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the
> > timeout
> > > > and
> > > >   the application flow context for each flow.
> > > > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> > > report
> > > >   that there are new aged-out flows.
> > > > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out
> > flows
> > > >   contexts from the port.
> > > >
> > > > By this design each PMD can use its best way to do the aging with
> > > > the device offloads supported by its HW.
> > > >
> > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > ---
> > > >  lib/librte_ethdev/rte_ethdev.h |  1 +
> > > >  lib/librte_ethdev/rte_flow.h   | 56
> > > > ++++++++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 57 insertions(+)
> > > >
> > > > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > > > b/lib/librte_ethdev/rte_ethdev.h index 1f35e1d..6fc1531 100644
> > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > @@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
> > > >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> > > >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > > >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > > > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in
> > > > the port
> > > Does this event supported in HW?
> > It depends in the PMD implementation and HW capability.
> >
> > > Or Are planning to implement with alarm or timer.
> > Again, depends in the PMD implementation.
> >
> > > Just asking because, if none of the HW supports the interrupt then
> > > only rte_flow_get_aged_flows sync API be enough()
> > Why?
> 
> If none of the HW supports it then application/common code can periodically
> polls it.
> If mlx5 hw supports it then it fine to have interrupt.

Actually MLX5 doesn't support aging fully by HW but the HW can help to do it better.
Look, the PMD is the best one to know what is the best way to do aging by its HW even if aging is not fully supported by it.
And it may add a meaningful efficiency to the application. 

> But I think, we need to have means to express a HW/Implementation does
> not support its As there may following reasons why drivers choose to not
> take timer/alarm path
> 1) Some EAL port does not support timer/alarm example: FreeBSD DPDK port
	OK, but why not to support it for other cases (no FreeBSD port)?

> 2) If we need to support a few killo rules then timer/alarm implementation
> will be heavy

Not sure, Depend in the HW ability.

> So an option to express un supported event would be fine.

Can you explain more what is your intension here (2)?

> > According to the above design this is the way for the PMD to notify
> > the application when it has some aged flows ASAP.
> > So, if the PMD uses an alarm\timer or any other way to support aging
> > action it is better in part of the cases to notify the user
> > asynchronically instead of doing polling by the application.
> > The idea is to let the application to decide what is better for its usage.
> >
> > For mlx5 case,
> > The plan is to raise this event from an HW interrupt handling(same as
> > link event).
> 
> Good to know.

The MLX5 plan is still to use timer/alarm and interrupt mechanism to support aging:
 The HW help here is the ability to query batch of flows counters asynchronically, so getting the response of the new counters values by an interrupt.

The timer\alarm will call to devX operation to read batch of counters asynchronically - fast command.
The interrupt handler to catch the response and to check timeout for each flow
(no need to copy the counters from the HW memory - the values are in the PMD memory) - if there is a new aged flow - raise the event.




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
  2019-06-18  5:56       ` Matan Azrad
@ 2019-06-24  6:26         ` Jerin Jacob Kollanukkaran
  2019-06-27  8:26           ` Matan Azrad
  0 siblings, 1 reply; 50+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-06-24  6:26 UTC (permalink / raw)
  To: Matan Azrad, Adrien Mazarguil, dev


> -----Original Message-----
> From: Matan Azrad <matan@mellanox.com>
> Sent: Tuesday, June 18, 2019 11:27 AM
> To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Adrien Mazarguil
> <adrien.mazarguil@6wind.com>; dev@dpdk.org
> Subject: [EXT] RE: [PATCH] [RFC] ethdev: support flow aging
> 
> Hi Jerin

Hi Matan,

> 
> From: Jerin Jacob
> > Sent: Thursday, June 6, 2019 3:16 PM
> > To: Matan Azrad <matan@mellanox.com>; Adrien Mazarguil
> > <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > Subject: RE: [PATCH] [RFC] ethdev: support flow aging
> >
> > > -----Original Message-----
> > > From: Matan Azrad <matan@mellanox.com>
> > > Sent: Thursday, June 6, 2019 4:22 PM
> > > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Adrien Mazarguil
> > > <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > > Subject: [EXT] RE: [PATCH] [RFC] ethdev: support flow aging
> > >
> > > Hi Jerin
> >
> > Hi Matan,
> >
> > >
> > > From: Jerin Jacob
> > > > > -----Original Message-----
> > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> > > > > Sent: Sunday, May 26, 2019 3:48 PM
> > > > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > > > > Subject: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
> > > > >
> > > > > One of the reasons to destroy a flow is the fact that no packet
> > > > > matches the flow for "timeout" time.
> > > > > For example, when TCP\UDP sessions are suddenly closed.
> > > > >
> > > > > Currently, there is no any dpdk mechanism for flow aging and the
> > > > > applications use there own ways to detect and destroy aged-out
> flows.
> > > > >
> > > > > This RFC introduces flow aging APIs to offload the flow aging
> > > > > task from the application to the port.
> > > > >
> > > > > Design:
> > > > > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the
> > > timeout
> > > > > and
> > > > >   the application flow context for each flow.
> > > > > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver
> to
> > > > report
> > > > >   that there are new aged-out flows.
> > > > > - A new rte_flow API: rte_flow_get_aged_flows to get the
> > > > > aged-out
> > > flows
> > > > >   contexts from the port.
> > > > >
> > > > > By this design each PMD can use its best way to do the aging
> > > > > with the device offloads supported by its HW.
> > > > >
> > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > ---
> > > > >  lib/librte_ethdev/rte_ethdev.h |  1 +
> > > > >  lib/librte_ethdev/rte_flow.h   | 56
> > > > > ++++++++++++++++++++++++++++++++++++++++++
> > > > >  2 files changed, 57 insertions(+)
> > > > >
> > > > > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > > > > b/lib/librte_ethdev/rte_ethdev.h index 1f35e1d..6fc1531 100644
> > > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > > @@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
> > > > >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> > > > >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > > > >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > > > > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows
> detected in
> > > > > the port
> > > > Does this event supported in HW?
> > > It depends in the PMD implementation and HW capability.
> > >
> > > > Or Are planning to implement with alarm or timer.
> > > Again, depends in the PMD implementation.
> > >
> > > > Just asking because, if none of the HW supports the interrupt then
> > > > only rte_flow_get_aged_flows sync API be enough()
> > > Why?
> >
> > If none of the HW supports it then application/common code can
> > periodically polls it.
> > If mlx5 hw supports it then it fine to have interrupt.
> 
> Actually MLX5 doesn't support aging fully by HW but the HW can help to do it
> better.
> Look, the PMD is the best one to know what is the best way to do aging by its
> HW even if aging is not fully supported by it.
> And it may add a meaningful efficiency to the application.
> 
> > But I think, we need to have means to express a HW/Implementation does
> > not support its As there may following reasons why drivers choose to
> > not take timer/alarm path
> > 1) Some EAL port does not support timer/alarm example: FreeBSD DPDK
> > port
> 	OK, but why not to support it for other cases (no FreeBSD port)?
> 
> > 2) If we need to support a few killo rules then timer/alarm
> > implementation will be heavy
> 
> Not sure, Depend in the HW ability.

Yes when HW does not support at all.

> 
> > So an option to express un supported event would be fine.
> 
> Can you explain more what is your intension here (2)?

To address the case where HW and/or OS(Like FreeBSD) does not support at all . In such case,
Expressing the unsupported would help application to handle in synchronous manner.

> 
> > > According to the above design this is the way for the PMD to notify
> > > the application when it has some aged flows ASAP.
> > > So, if the PMD uses an alarm\timer or any other way to support aging
> > > action it is better in part of the cases to notify the user
> > > asynchronically instead of doing polling by the application.
> > > The idea is to let the application to decide what is better for its usage.
> > >
> > > For mlx5 case,
> > > The plan is to raise this event from an HW interrupt handling(same
> > > as link event).
> >
> > Good to know.
> 
> The MLX5 plan is still to use timer/alarm and interrupt mechanism to support
> aging:
>  The HW help here is the ability to query batch of flows counters
> asynchronically, so getting the response of the new counters values by an
> interrupt.
> 
> The timer\alarm will call to devX operation to read batch of counters
> asynchronically - fast command.
> The interrupt handler to catch the response and to check timeout for each
> flow (no need to copy the counters from the HW memory - the values are in
> the PMD memory) - if there is a new aged flow - raise the event.
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
  2019-06-24  6:26         ` Jerin Jacob Kollanukkaran
@ 2019-06-27  8:26           ` Matan Azrad
  0 siblings, 0 replies; 50+ messages in thread
From: Matan Azrad @ 2019-06-27  8:26 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, Adrien Mazarguil, dev

Hi all

Thanks Jerin for your comments.
Looks like we agree that the feature is relevant at least for mlx5...

Anyone else has more comments?


From: Jerin Jacob Kollanukkaran 
> > -----Original Message-----
> > From: Matan Azrad <matan@mellanox.com>
> > Sent: Tuesday, June 18, 2019 11:27 AM
> > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Adrien Mazarguil
> > <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > Subject: [EXT] RE: [PATCH] [RFC] ethdev: support flow aging
> >
> > Hi Jerin
> 
> Hi Matan,
> 
> >
> > From: Jerin Jacob
> > > Sent: Thursday, June 6, 2019 3:16 PM
> > > To: Matan Azrad <matan@mellanox.com>; Adrien Mazarguil
> > > <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > > Subject: RE: [PATCH] [RFC] ethdev: support flow aging
> > >
> > > > -----Original Message-----
> > > > From: Matan Azrad <matan@mellanox.com>
> > > > Sent: Thursday, June 6, 2019 4:22 PM
> > > > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Adrien
> > > > Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > > > Subject: [EXT] RE: [PATCH] [RFC] ethdev: support flow aging
> > > >
> > > > Hi Jerin
> > >
> > > Hi Matan,
> > >
> > > >
> > > > From: Jerin Jacob
> > > > > > -----Original Message-----
> > > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> > > > > > Sent: Sunday, May 26, 2019 3:48 PM
> > > > > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>;
> > > > > > dev@dpdk.org
> > > > > > Subject: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
> > > > > >
> > > > > > One of the reasons to destroy a flow is the fact that no
> > > > > > packet matches the flow for "timeout" time.
> > > > > > For example, when TCP\UDP sessions are suddenly closed.
> > > > > >
> > > > > > Currently, there is no any dpdk mechanism for flow aging and
> > > > > > the applications use there own ways to detect and destroy
> > > > > > aged-out
> > flows.
> > > > > >
> > > > > > This RFC introduces flow aging APIs to offload the flow aging
> > > > > > task from the application to the port.
> > > > > >
> > > > > > Design:
> > > > > > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the
> > > > timeout
> > > > > > and
> > > > > >   the application flow context for each flow.
> > > > > > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver
> > to
> > > > > report
> > > > > >   that there are new aged-out flows.
> > > > > > - A new rte_flow API: rte_flow_get_aged_flows to get the
> > > > > > aged-out
> > > > flows
> > > > > >   contexts from the port.
> > > > > >
> > > > > > By this design each PMD can use its best way to do the aging
> > > > > > with the device offloads supported by its HW.
> > > > > >
> > > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > > ---
> > > > > >  lib/librte_ethdev/rte_ethdev.h |  1 +
> > > > > >  lib/librte_ethdev/rte_flow.h   | 56
> > > > > > ++++++++++++++++++++++++++++++++++++++++++
> > > > > >  2 files changed, 57 insertions(+)
> > > > > >
> > > > > > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > > > > > b/lib/librte_ethdev/rte_ethdev.h index 1f35e1d..6fc1531 100644
> > > > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > > > @@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
> > > > > >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> > > > > >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > > > > >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > > > > > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows
> > detected in
> > > > > > the port
> > > > > Does this event supported in HW?
> > > > It depends in the PMD implementation and HW capability.
> > > >
> > > > > Or Are planning to implement with alarm or timer.
> > > > Again, depends in the PMD implementation.
> > > >
> > > > > Just asking because, if none of the HW supports the interrupt
> > > > > then only rte_flow_get_aged_flows sync API be enough()
> > > > Why?
> > >
> > > If none of the HW supports it then application/common code can
> > > periodically polls it.
> > > If mlx5 hw supports it then it fine to have interrupt.
> >
> > Actually MLX5 doesn't support aging fully by HW but the HW can help to
> > do it better.
> > Look, the PMD is the best one to know what is the best way to do aging
> > by its HW even if aging is not fully supported by it.
> > And it may add a meaningful efficiency to the application.
> >
> > > But I think, we need to have means to express a HW/Implementation
> > > does not support its As there may following reasons why drivers
> > > choose to not take timer/alarm path
> > > 1) Some EAL port does not support timer/alarm example: FreeBSD DPDK
> > > port
> > 	OK, but why not to support it for other cases (no FreeBSD port)?
> >
> > > 2) If we need to support a few killo rules then timer/alarm
> > > implementation will be heavy
> >
> > Not sure, Depend in the HW ability.
> 
> Yes when HW does not support at all.
> 
> >
> > > So an option to express un supported event would be fine.
> >
> > Can you explain more what is your intension here (2)?
> 
> To address the case where HW and/or OS(Like FreeBSD) does not support at
> all . In such case, Expressing the unsupported would help application to
> handle in synchronous manner.
> 
> >
> > > > According to the above design this is the way for the PMD to
> > > > notify the application when it has some aged flows ASAP.
> > > > So, if the PMD uses an alarm\timer or any other way to support
> > > > aging action it is better in part of the cases to notify the user
> > > > asynchronically instead of doing polling by the application.
> > > > The idea is to let the application to decide what is better for its usage.
> > > >
> > > > For mlx5 case,
> > > > The plan is to raise this event from an HW interrupt handling(same
> > > > as link event).
> > >
> > > Good to know.
> >
> > The MLX5 plan is still to use timer/alarm and interrupt mechanism to
> > support
> > aging:
> >  The HW help here is the ability to query batch of flows counters
> > asynchronically, so getting the response of the new counters values by
> > an interrupt.
> >
> > The timer\alarm will call to devX operation to read batch of counters
> > asynchronically - fast command.
> > The interrupt handler to catch the response and to check timeout for
> > each flow (no need to copy the counters from the HW memory - the
> > values are in the PMD memory) - if there is a new aged flow - raise the
> event.
> >
> >


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v2] [RFC] ethdev: support flow aging
  2019-05-26 10:18 [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging Matan Azrad
  2019-06-06 10:24 ` Jerin Jacob Kollanukkaran
@ 2020-03-16 10:22 ` BillZhou
  2020-03-16 12:52 ` BillZhou
  2 siblings, 0 replies; 50+ messages in thread
From: BillZhou @ 2020-03-16 10:22 UTC (permalink / raw)
  To: adrien.mazarguil, matan; +Cc: dev

One of the reasons to destroy a flow is the fact that no packet matches the
flow for "timeout" time.
For example, when TCP\UDP sessions are suddenly closed.

Currently, there is no any dpdk mechanism for flow aging and the
applications use there own ways to detect and destroy aged-out flows.

This RFC introduces flow aging APIs to offload the flow aging task from
the application to the port.

Design:
- A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
  the application flow context for each flow.
- A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
  that there are new aged-out flows.
- A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
  contexts from the port.

By this design each PMD can use its best way to do the aging with the
device offloads supported by its HW.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
v2:For API rte_flow_get_aged_flows, delete "struct rte_flow *flows[]"
this parameter.
---
 lib/librte_ethdev/rte_ethdev.h |  1 +
 lib/librte_ethdev/rte_flow.h   | 56 ++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d1a593ad11..03135a7138 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in the port */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
index 5625dc4917..1fc05bf56c 100644
--- a/lib/librte_ethdev/rte_flow.h
+++ b/lib/librte_ethdev/rte_flow.h
@@ -2051,6 +2051,14 @@ enum rte_flow_action_type {
 	 * See struct rte_flow_action_set_dscp.
 	 */
 	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
+
+	/**
+	 * Report as aged-out if timeout passed without any matching on the
+	 * flow.
+	 *
+	 * See struct rte_flow_action_age.
+	 */
+	RTE_FLOW_ACTION_TYPE_AGE,
 };
 
 /**
@@ -2633,6 +2641,22 @@ struct rte_flow_action {
 	const void *conf; /**< Pointer to action configuration object. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice
+ *
+ * RTE_FLOW_ACTION_TYPE_AGE
+ *
+ * Report as aged-out if timeout passed without any matching on the flow.
+ *
+ * The flow context and the flow handle will be reported by the
+ * rte_flow_get_aged_flows API.
+ */
+struct rte_flow_action_age {
+	uint16_t timeout; /**< Time in seconds. */
+	void *context; /**< The user flow context. */
+};
+
 /**
  * Opaque type returned after successfully creating a flow.
  *
@@ -3224,6 +3248,38 @@ rte_flow_conv(enum rte_flow_conv_op op,
 	      const void *src,
 	      struct rte_flow_error *error);
 
+/**
+ * Get aged-out flows of a given port.
+ *
+ * RTE_ETH_EVENT_FLOW_AGED is triggered when a port detects aged-out flows.
+ * This function can be called to get the aged flows usynchronously from the
+ * event callback or synchronously when the user wants it.
+ * The callback synchronization is on the user responsibility.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in/out] contexts
+ *   An allocated array to get the aged-out flows contexts from input age
+ *   action config, if input contexts is null, return the aged-out flows.
+ *   NULL indicates the flow contexts should not be reported.
+ * @param[in] nb_context
+ *   The allocated array entries number of @p contexts if exist.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   0 in case there are not any aged-out contexts or flows, otherwise if
+ *   positive is the number of the reported aged-out contexts or flows to
+ *   @p contexts, a negative errno value otherwise and rte_errno is set.
+ *
+ * @see rte_flow_action_age
+ */
+__rte_experimental
+int
+rte_flow_get_aged_flows(uint16_t port_id, void *contexts[],
+			int nb_context, struct rte_flow_error *error);
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v2] [RFC] ethdev: support flow aging
  2019-05-26 10:18 [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging Matan Azrad
  2019-06-06 10:24 ` Jerin Jacob Kollanukkaran
  2020-03-16 10:22 ` [dpdk-dev] [PATCH v2] " BillZhou
@ 2020-03-16 12:52 ` BillZhou
  2020-03-20  6:59   ` Jerin Jacob
                     ` (3 more replies)
  2 siblings, 4 replies; 50+ messages in thread
From: BillZhou @ 2020-03-16 12:52 UTC (permalink / raw)
  To: adrien.mazarguil, matan; +Cc: dev

One of the reasons to destroy a flow is the fact that no packet matches the
flow for "timeout" time.
For example, when TCP\UDP sessions are suddenly closed.

Currently, there is no any dpdk mechanism for flow aging and the
applications use there own ways to detect and destroy aged-out flows.

This RFC introduces flow aging APIs to offload the flow aging task from
the application to the port.

Design:
- A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
  the application flow context for each flow.
- A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
  that there are new aged-out flows.
- A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
  contexts from the port.

By this design each PMD can use its best way to do the aging with the
device offloads supported by its HW.

Signed-off-by: BillZhou <dongz@mellanox.com>
---
v2:For API rte_flow_get_aged_flows, delete "struct rte_flow *flows[]"
this parameter.
---
 lib/librte_ethdev/rte_ethdev.h |  1 +
 lib/librte_ethdev/rte_flow.h   | 56 ++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d1a593ad11..03135a7138 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in the port */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
index 5625dc4917..1fc05bf56c 100644
--- a/lib/librte_ethdev/rte_flow.h
+++ b/lib/librte_ethdev/rte_flow.h
@@ -2051,6 +2051,14 @@ enum rte_flow_action_type {
 	 * See struct rte_flow_action_set_dscp.
 	 */
 	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
+
+	/**
+	 * Report as aged-out if timeout passed without any matching on the
+	 * flow.
+	 *
+	 * See struct rte_flow_action_age.
+	 */
+	RTE_FLOW_ACTION_TYPE_AGE,
 };
 
 /**
@@ -2633,6 +2641,22 @@ struct rte_flow_action {
 	const void *conf; /**< Pointer to action configuration object. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice
+ *
+ * RTE_FLOW_ACTION_TYPE_AGE
+ *
+ * Report as aged-out if timeout passed without any matching on the flow.
+ *
+ * The flow context and the flow handle will be reported by the
+ * rte_flow_get_aged_flows API.
+ */
+struct rte_flow_action_age {
+	uint16_t timeout; /**< Time in seconds. */
+	void *context; /**< The user flow context. */
+};
+
 /**
  * Opaque type returned after successfully creating a flow.
  *
@@ -3224,6 +3248,38 @@ rte_flow_conv(enum rte_flow_conv_op op,
 	      const void *src,
 	      struct rte_flow_error *error);
 
+/**
+ * Get aged-out flows of a given port.
+ *
+ * RTE_ETH_EVENT_FLOW_AGED is triggered when a port detects aged-out flows.
+ * This function can be called to get the aged flows usynchronously from the
+ * event callback or synchronously when the user wants it.
+ * The callback synchronization is on the user responsibility.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in/out] contexts
+ *   An allocated array to get the aged-out flows contexts from input age
+ *   action config, if input contexts is null, return the aged-out flows.
+ *   NULL indicates the flow contexts should not be reported.
+ * @param[in] nb_context
+ *   The allocated array entries number of @p contexts if exist.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   0 in case there are not any aged-out contexts or flows, otherwise if
+ *   positive is the number of the reported aged-out contexts or flows to
+ *   @p contexts, a negative errno value otherwise and rte_errno is set.
+ *
+ * @see rte_flow_action_age
+ */
+__rte_experimental
+int
+rte_flow_get_aged_flows(uint16_t port_id, void *contexts[],
+			int nb_context, struct rte_flow_error *error);
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
  2019-06-06 12:15     ` Jerin Jacob Kollanukkaran
  2019-06-18  5:56       ` Matan Azrad
@ 2020-03-16 16:13       ` Stephen Hemminger
  1 sibling, 0 replies; 50+ messages in thread
From: Stephen Hemminger @ 2020-03-16 16:13 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran; +Cc: Matan Azrad, Adrien Mazarguil, dev

On Thu, 6 Jun 2019 12:15:50 +0000
Jerin Jacob Kollanukkaran <jerinj@marvell.com> wrote:

> > -----Original Message-----
> > From: Matan Azrad <matan@mellanox.com>
> > Sent: Thursday, June 6, 2019 4:22 PM
> > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Adrien Mazarguil
> > <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > Subject: [EXT] RE: [PATCH] [RFC] ethdev: support flow aging
> > 
> > Hi Jerin  
> 
> Hi Matan,
> 
> > 
> > From: Jerin Jacob  
> > > > -----Original Message-----
> > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> > > > Sent: Sunday, May 26, 2019 3:48 PM
> > > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org
> > > > Subject: [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging
> > > >
> > > > One of the reasons to destroy a flow is the fact that no packet
> > > > matches the flow for "timeout" time.
> > > > For example, when TCP\UDP sessions are suddenly closed.
> > > >
> > > > Currently, there is no any dpdk mechanism for flow aging and the
> > > > applications use there own ways to detect and destroy aged-out flows.
> > > >
> > > > This RFC introduces flow aging APIs to offload the flow aging task
> > > > from the application to the port.
> > > >
> > > > Design:
> > > > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the  
> > timeout  
> > > > and
> > > >   the application flow context for each flow.
> > > > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to  
> > > report  
> > > >   that there are new aged-out flows.
> > > > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out  
> > flows  
> > > >   contexts from the port.
> > > >
> > > > By this design each PMD can use its best way to do the aging with
> > > > the device offloads supported by its HW.
> > > >
> > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > ---
> > > >  lib/librte_ethdev/rte_ethdev.h |  1 +
> > > >  lib/librte_ethdev/rte_flow.h   | 56
> > > > ++++++++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 57 insertions(+)
> > > >
> > > > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > > > b/lib/librte_ethdev/rte_ethdev.h index 1f35e1d..6fc1531 100644
> > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > @@ -2771,6 +2771,7 @@ enum rte_eth_event_type {
> > > >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> > > >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > > >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > > > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in
> > > > the port  
> > > Does this event supported in HW?  
> > It depends in the PMD implementation and HW capability.
> >   
> > > Or Are planning to implement with alarm or timer.  
> > Again, depends in the PMD implementation.
> >   
> > > Just asking because, if none of the HW supports the interrupt then
> > > only rte_flow_get_aged_flows sync API be enough()  
> > Why?  
> 
> If none of the HW supports it then application/common code can periodically polls it.
> If mlx5 hw supports it then it fine to have interrupt. 
> But I think, we need to have means to express a HW/Implementation does not support its
> As there may following reasons why drivers choose to not take timer/alarm path 
> 1) Some EAL port does not support timer/alarm example: FreeBSD DPDK port
> 2) If we need to support a few killo rules then timer/alarm implementation will be heavy
> So an option to express un supported event would be fine.

This API needs to be defined in a way that it is possible to write
an application that works on multiple types of hardware. This is often hard
to do with DPDK because too often API's are added that are convenient for the
driver writer.

There must be only one way that flow aging notifications happen, and they must
only occur in a specific context. Is this in a normal DPDK thread, or in interrupt thread,
or alarm thread. Choose one and make all drivers do the same thing.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] [RFC] ethdev: support flow aging
  2020-03-16 12:52 ` BillZhou
@ 2020-03-20  6:59   ` Jerin Jacob
  2020-03-24 10:18   ` Andrew Rybchenko
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 50+ messages in thread
From: Jerin Jacob @ 2020-03-20  6:59 UTC (permalink / raw)
  To: BillZhou; +Cc: Adrien Mazarguil, Matan Azrad, dpdk-dev

On Mon, Mar 16, 2020 at 6:22 PM BillZhou <dongz@mellanox.com> wrote:
>
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
>
> Currently, there is no any dpdk mechanism for flow aging and the
> applications use there own ways to detect and destroy aged-out flows.
>
> This RFC introduces flow aging APIs to offload the flow aging task from
> the application to the port.
>
> Design:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
>
> By this design each PMD can use its best way to do the aging with the
> device offloads supported by its HW.
>
> Signed-off-by: BillZhou <dongz@mellanox.com>
> ---
> v2:For API rte_flow_get_aged_flows, delete "struct rte_flow *flows[]"
> this parameter.
> ---
>  lib/librte_ethdev/rte_ethdev.h |  1 +
>  lib/librte_ethdev/rte_flow.h   | 56 ++++++++++++++++++++++++++++++++++
>  2 files changed, 57 insertions(+)
>
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index d1a593ad11..03135a7138 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>         RTE_ETH_EVENT_NEW,      /**< port is probed */
>         RTE_ETH_EVENT_DESTROY,  /**< port is released */
>         RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> +       RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in the port */
>         RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };
>
> diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
> index 5625dc4917..1fc05bf56c 100644
> --- a/lib/librte_ethdev/rte_flow.h
> +++ b/lib/librte_ethdev/rte_flow.h
> @@ -2051,6 +2051,14 @@ enum rte_flow_action_type {
>          * See struct rte_flow_action_set_dscp.
>          */
>         RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
> +
> +       /**
> +        * Report as aged-out if timeout passed without any matching on the
> +        * flow.
> +        *
> +        * See struct rte_flow_action_age.
> +        */
> +       RTE_FLOW_ACTION_TYPE_AGE,
>  };
>
>  /**
> @@ -2633,6 +2641,22 @@ struct rte_flow_action {
>         const void *conf; /**< Pointer to action configuration object. */
>  };
>
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this structure may change without prior notice
> + *
> + * RTE_FLOW_ACTION_TYPE_AGE
> + *
> + * Report as aged-out if timeout passed without any matching on the flow.
> + *
> + * The flow context and the flow handle will be reported by the
> + * rte_flow_get_aged_flows API.
> + */
> +struct rte_flow_action_age {
> +       uint16_t timeout; /**< Time in seconds. */
> +       void *context; /**< The user flow context. */
> +};
> +
>  /**
>   * Opaque type returned after successfully creating a flow.
>   *
> @@ -3224,6 +3248,38 @@ rte_flow_conv(enum rte_flow_conv_op op,
>               const void *src,
>               struct rte_flow_error *error);
>
> +/**
> + * Get aged-out flows of a given port.
> + *
> + * RTE_ETH_EVENT_FLOW_AGED is triggered when a port detects aged-out flows.
> + * This function can be called to get the aged flows usynchronously from the

s/usynchronously/ asynchronously


> + * event callback or synchronously when the user wants it.
> + * The callback synchronization is on the user responsibility.
> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param[in/out] contexts
> + *   An allocated array to get the aged-out flows contexts from input age
> + *   action config, if input contexts is null, return the aged-out flows.
> + *   NULL indicates the flow contexts should not be reported.
> + * @param[in] nb_context

By default, everything is [in]. Not need to mention [in] explicitly.

> + *   The allocated array entries number of @p contexts if exist.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL. Initialized in case of
> + *   error only.
> + *
> + * @return
> + *   0 in case there are not any aged-out contexts or flows, otherwise if
> + *   positive is the number of the reported aged-out contexts or flows to
> + *   @p contexts, a negative errno value otherwise and rte_errno is set.
> + *
> + * @see rte_flow_action_age

RTE_ETH_EVENT_FLOW_AGED can be added in @see

Other than the above nits,

This RFC looks good to me.

> + */
> +__rte_experimental
> +int
> +rte_flow_get_aged_flows(uint16_t port_id, void *contexts[],
> +                       int nb_context, struct rte_flow_error *error);
> +
>  #ifdef __cplusplus
>  }
>  #endif
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] [RFC] ethdev: support flow aging
  2020-03-16 12:52 ` BillZhou
  2020-03-20  6:59   ` Jerin Jacob
@ 2020-03-24 10:18   ` Andrew Rybchenko
  2020-04-10  9:46   ` [dpdk-dev] [PATCH] " BillZhou
  2020-04-13 14:53   ` [dpdk-dev] [PATCH 0/2] " Dong Zhou
  3 siblings, 0 replies; 50+ messages in thread
From: Andrew Rybchenko @ 2020-03-24 10:18 UTC (permalink / raw)
  To: BillZhou, adrien.mazarguil, matan; +Cc: dev

On 3/16/20 3:52 PM, BillZhou wrote:
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is no any dpdk mechanism for flow aging and the
> applications use there own ways to detect and destroy aged-out flows.
> 
> This RFC introduces flow aging APIs to offload the flow aging task from
> the application to the port.
> 
> Design:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
> 
> By this design each PMD can use its best way to do the aging with the
> device offloads supported by its HW.
> 
> Signed-off-by: BillZhou <dongz@mellanox.com>

LGTM

> ---
> v2:For API rte_flow_get_aged_flows, delete "struct rte_flow *flows[]"
> this parameter.
> ---
>  lib/librte_ethdev/rte_ethdev.h |  1 +
>  lib/librte_ethdev/rte_flow.h   | 56 ++++++++++++++++++++++++++++++++++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index d1a593ad11..03135a7138 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows detected in the port */
>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };
>  
> diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
> index 5625dc4917..1fc05bf56c 100644
> --- a/lib/librte_ethdev/rte_flow.h
> +++ b/lib/librte_ethdev/rte_flow.h
> @@ -2051,6 +2051,14 @@ enum rte_flow_action_type {
>  	 * See struct rte_flow_action_set_dscp.
>  	 */
>  	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
> +
> +	/**
> +	 * Report as aged-out if timeout passed without any matching on the
> +	 * flow.
> +	 *
> +	 * See struct rte_flow_action_age.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_AGE,
>  };
>  
>  /**
> @@ -2633,6 +2641,22 @@ struct rte_flow_action {
>  	const void *conf; /**< Pointer to action configuration object. */
>  };
>  
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this structure may change without prior notice
> + *
> + * RTE_FLOW_ACTION_TYPE_AGE
> + *
> + * Report as aged-out if timeout passed without any matching on the flow.
> + *
> + * The flow context and the flow handle will be reported by the
> + * rte_flow_get_aged_flows API.
> + */
> +struct rte_flow_action_age {
> +	uint16_t timeout; /**< Time in seconds. */

Is it intentionally defined small? May be it is better to
use uint32_t? I just want to understand the rational behind
the type choice.

> +	void *context; /**< The user flow context. */
> +};
> +
>  /**
>   * Opaque type returned after successfully creating a flow.
>   *
> @@ -3224,6 +3248,38 @@ rte_flow_conv(enum rte_flow_conv_op op,
>  	      const void *src,
>  	      struct rte_flow_error *error);
>  
> +/**
> + * Get aged-out flows of a given port.
> + *
> + * RTE_ETH_EVENT_FLOW_AGED is triggered when a port detects aged-out flows.
> + * This function can be called to get the aged flows usynchronously from the
> + * event callback or synchronously when the user wants it.
> + * The callback synchronization is on the user responsibility.
> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param[in/out] contexts
> + *   An allocated array to get the aged-out flows contexts from input age
> + *   action config, if input contexts is null, return the aged-out flows.
> + *   NULL indicates the flow contexts should not be reported.
> + * @param[in] nb_context
> + *   The allocated array entries number of @p contexts if exist.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL. Initialized in case of
> + *   error only.
> + *
> + * @return
> + *   0 in case there are not any aged-out contexts or flows, otherwise if
> + *   positive is the number of the reported aged-out contexts or flows to
> + *   @p contexts, a negative errno value otherwise and rte_errno is set.
> + *
> + * @see rte_flow_action_age
> + */
> +__rte_experimental
> +int
> +rte_flow_get_aged_flows(uint16_t port_id, void *contexts[],
> +			int nb_context, struct rte_flow_error *error);
> +
>  #ifdef __cplusplus
>  }
>  #endif
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH] ethdev: support flow aging
  2020-03-16 12:52 ` BillZhou
  2020-03-20  6:59   ` Jerin Jacob
  2020-03-24 10:18   ` Andrew Rybchenko
@ 2020-04-10  9:46   ` BillZhou
  2020-04-10 10:14     ` Thomas Monjalon
                       ` (3 more replies)
  2020-04-13 14:53   ` [dpdk-dev] [PATCH 0/2] " Dong Zhou
  3 siblings, 4 replies; 50+ messages in thread
From: BillZhou @ 2020-04-10  9:46 UTC (permalink / raw)
  To: matan, dongz, orika, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, thomas, ferruh.yigit, arybchenko
  Cc: dev

One of the reasons to destroy a flow is the fact that no packet matches the
flow for "timeout" time.
For example, when TCP\UDP sessions are suddenly closed.

Currently, there is no any DPDK mechanism for flow aging and the
applications use their own ways to detect and destroy aged-out flows.

The flow aging implementation need include:
- A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
  the application flow context for each flow.
- A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
  that there are new aged-out flows.
- A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
  contexts from the port.
- Support input flow aging command line in Testpmd.

Signed-off-by: BillZhou <dongz@mellanox.com>
---
 app/test-pmd/cmdline_flow.c              | 26 ++++++++++
 doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
 doc/guides/rel_notes/release_20_05.rst   | 11 +++++
 lib/librte_ethdev/rte_ethdev.h           |  1 +
 lib/librte_ethdev/rte_ethdev_version.map |  3 ++
 lib/librte_ethdev/rte_flow.c             | 18 +++++++
 lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_flow_driver.h      |  6 +++
 8 files changed, 149 insertions(+)

diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
index 4877ac6c8a..9787dc5907 100644
--- a/app/test-pmd/cmdline_flow.c
+++ b/app/test-pmd/cmdline_flow.c
@@ -343,6 +343,8 @@ enum index {
 	ACTION_SET_IPV4_DSCP_VALUE,
 	ACTION_SET_IPV6_DSCP,
 	ACTION_SET_IPV6_DSCP_VALUE,
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
 };
 
 /** Maximum size for pattern in struct rte_flow_item_raw. */
@@ -1146,6 +1148,7 @@ static const enum index next_action[] = {
 	ACTION_SET_META,
 	ACTION_SET_IPV4_DSCP,
 	ACTION_SET_IPV6_DSCP,
+	ACTION_AGE,
 	ZERO,
 };
 
@@ -1371,6 +1374,13 @@ static const enum index action_set_ipv6_dscp[] = {
 	ZERO,
 };
 
+static const enum index action_age[] = {
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
+	ACTION_NEXT,
+	ZERO,
+};
+
 static int parse_set_raw_encap_decap(struct context *, const struct token *,
 				     const char *, unsigned int,
 				     void *, unsigned int);
@@ -3692,6 +3702,22 @@ static const struct token token_list[] = {
 			     (struct rte_flow_action_set_dscp, dscp)),
 		.call = parse_vc_conf,
 	},
+	[ACTION_AGE] = {
+		.name = "age",
+		.help = "set a specific metadata header",
+		.next = NEXT(action_age),
+		.priv = PRIV_ACTION(AGE,
+			sizeof(struct rte_flow_action_age)),
+		.call = parse_vc,
+	},
+	[ACTION_AGE_TIMEOUT] = {
+		.name = "timeout",
+		.help = "flow age timeout value",
+		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
+					   timeout, 24)),
+		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
+		.call = parse_vc_conf,
+	},
 };
 
 /** Remove and return last entry from argument stack. */
diff --git a/doc/guides/prog_guide/rte_flow.rst b/doc/guides/prog_guide/rte_flow.rst
index 41c147913c..cf4368e1c4 100644
--- a/doc/guides/prog_guide/rte_flow.rst
+++ b/doc/guides/prog_guide/rte_flow.rst
@@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
    | ``dscp``  | DSCP in low 6 bits, rest ignore |
    +-----------+---------------------------------+
 
+Action: ``AGE``
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Set ageing timeout configuration to a flow.
+
+Event RTE_ETH_EVENT_FLOW_AGED will be reported if
+timeout passed without any matching on the flow.
+
+.. _table_rte_flow_action_age:
+
+.. table:: AGE
+
+   +--------------+---------------------------------+
+   | Field        | Value                           |
+   +==============+=================================+
+   | ``timeout``  | 24 bits timeout value           |
+   +--------------+---------------------------------+
+   | ``reserved`` | 8 bits reserved, must be zero   |
+   +--------------+---------------------------------+
+   | ``context``  | user input flow context         |
+   +--------------+---------------------------------+
+
 Negative types
 ~~~~~~~~~~~~~~
 
diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index 2596269da5..85658663bf 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -62,6 +62,7 @@ New Features
 
   * Added support for matching on IPv4 Time To Live and IPv6 Hop Limit.
   * Added support for creating Relaxed Ordering Memory Regions.
+  * Added support for flow Aging mechanism base on counter.
 
 * **Updated the Intel ice driver.**
 
@@ -78,6 +79,16 @@ New Features
   * Hierarchial Scheduling with DWRR and SP.
   * Single rate - Two color, Two rate - Three color shaping.
 
+* **Added flow Aging Support.**
+
+  Added flow Aging support to detect and report aged-out flows, including:
+
+  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and the
+    application flow context for each flow.
+  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to report that
+    there are new aged-out flows.
+  * Added new API: rte_flow_get_aged_flows to get the aged-out flows contexts
+    from the port.
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d1a593ad11..74c9d00f36 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 3f32fdecf7..fa4b5816be 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -230,4 +230,7 @@ EXPERIMENTAL {
 
 	# added in 20.02
 	rte_flow_dev_dump;
+
+	# added in 20.05
+	rte_flow_get_aged_flows;
 };
diff --git a/lib/librte_ethdev/rte_flow.c b/lib/librte_ethdev/rte_flow.c
index a5ac1c7fbd..3699edce49 100644
--- a/lib/librte_ethdev/rte_flow.c
+++ b/lib/librte_ethdev/rte_flow.c
@@ -172,6 +172,7 @@ static const struct rte_flow_desc_data rte_flow_desc_action[] = {
 	MK_FLOW_ACTION(SET_META, sizeof(struct rte_flow_action_set_meta)),
 	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct rte_flow_action_set_dscp)),
 	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct rte_flow_action_set_dscp)),
+	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
 };
 
 int
@@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file, struct rte_flow_error *error)
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 				  NULL, rte_strerror(ENOSYS));
 }
+
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+		    uint32_t nb_contexts, struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (unlikely(!ops))
+		return -rte_errno;
+	if (likely(!!ops->get_aged_flows))
+		return flow_err(port_id, ops->get_aged_flows(dev, contexts,
+				nb_contexts, error), error);
+	return rte_flow_error_set(error, ENOTSUP,
+				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+				  NULL, rte_strerror(ENOTSUP));
+}
diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
index 7f3e08fad3..fab44f6c0b 100644
--- a/lib/librte_ethdev/rte_flow.h
+++ b/lib/librte_ethdev/rte_flow.h
@@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
 	 * See struct rte_flow_action_set_dscp.
 	 */
 	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
+
+	/**
+	 * Report as aged flow if timeout passed without any matching on the
+	 * flow.
+	 *
+	 * See struct rte_flow_action_age.
+	 * See function rte_flow_get_aged_flows
+	 * see enum RTE_ETH_EVENT_FLOW_AGED
+	 */
+	RTE_FLOW_ACTION_TYPE_AGE,
 };
 
 /**
@@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
 	uint16_t index; /**< Queue index to use. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice
+ *
+ * RTE_FLOW_ACTION_TYPE_AGE
+ *
+ * Report flow as aged-out if timeout passed without any matching
+ * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
+ * port detects new aged-out flows.
+ *
+ * The flow context and the flow handle will be reported by the
+ * rte_flow_get_aged_flows API.
+ */
+struct rte_flow_action_age {
+	uint32_t timeout:24; /**< Time in seconds. */
+	uint32_t reserved:8; /**< Reserved, must be zero. */
+	void *context;
+		/**< The user flow context, NULL means the rte_flow pointer. */
+};
 
 /**
  * @warning
@@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
 	      const void *src,
 	      struct rte_flow_error *error);
 
+/**
+ * Get aged-out flows of a given port.
+ *
+ * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least one new aged
+ * out flow was detected after the last call to rte_flow_get_aged_flows.
+ * This function can be called to get the aged flows usynchronously from the
+ * event callback or synchronously regardless the event.
+ * This is not safe to call rte_flow_get_aged_flows function with other flow
+ * functions from multiple threads simultaneously.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in, out] contexts
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_contexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array, otherwise negative errno value.
+ *
+ * @see rte_flow_action_age
+ * @see RTE_ETH_EVENT_FLOW_AGED
+ */
+__rte_experimental
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_flow_driver.h b/lib/librte_ethdev/rte_flow_driver.h
index 51a9a57a0f..881cc469b7 100644
--- a/lib/librte_ethdev/rte_flow_driver.h
+++ b/lib/librte_ethdev/rte_flow_driver.h
@@ -101,6 +101,12 @@ struct rte_flow_ops {
 		(struct rte_eth_dev *dev,
 		 FILE *file,
 		 struct rte_flow_error *error);
+	/** See rte_flow_get_aged_flows() */
+	int (*get_aged_flows)
+		(struct rte_eth_dev *dev,
+		 void **context,
+		 uint32_t nb_contexts,
+		 struct rte_flow_error *err);
 };
 
 /**
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] ethdev: support flow aging
  2020-04-10  9:46   ` [dpdk-dev] [PATCH] " BillZhou
@ 2020-04-10 10:14     ` Thomas Monjalon
  2020-04-13  4:02       ` Bill Zhou
  2020-04-10 12:07     ` Andrew Rybchenko
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 50+ messages in thread
From: Thomas Monjalon @ 2020-04-10 10:14 UTC (permalink / raw)
  To: BillZhou
  Cc: matan, orika, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, ferruh.yigit, arybchenko, dev

10/04/2020 11:46, BillZhou:
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is no any DPDK mechanism for flow aging and the
> applications use their own ways to detect and destroy aged-out flows.
> 
> The flow aging implementation need include:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
> - Support input flow aging command line in Testpmd.
> 
> Signed-off-by: BillZhou <dongz@mellanox.com>

I think you should insert a space in your name: Bill Zhou.
I find strange to associate "Bill Zhou" with "dongz" in your email.
Are you sure you don't want to mention "Dong"?


> +  * Added support for flow Aging mechanism base on counter.

Aging -> aging
base -> based
counter -> hardware counter?




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] ethdev: support flow aging
  2020-04-10  9:46   ` [dpdk-dev] [PATCH] " BillZhou
  2020-04-10 10:14     ` Thomas Monjalon
@ 2020-04-10 12:07     ` Andrew Rybchenko
  2020-04-10 12:41       ` Jerin Jacob
  2020-04-12  9:13     ` Ori Kam
  2020-04-14  8:32     ` [dpdk-dev] [PATCH v2] " Dong Zhou
  3 siblings, 1 reply; 50+ messages in thread
From: Andrew Rybchenko @ 2020-04-10 12:07 UTC (permalink / raw)
  To: BillZhou, matan, orika, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic, thomas,
	ferruh.yigit
  Cc: dev

On 4/10/20 12:46 PM, BillZhou wrote:
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
>
> Currently, there is no any DPDK mechanism for flow aging and the
> applications use their own ways to detect and destroy aged-out flows.
>
> The flow aging implementation need include:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>    the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>    that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>    contexts from the port.
> - Support input flow aging command line in Testpmd.
>
> Signed-off-by: BillZhou <dongz@mellanox.com>

Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] ethdev: support flow aging
  2020-04-10 12:07     ` Andrew Rybchenko
@ 2020-04-10 12:41       ` Jerin Jacob
  0 siblings, 0 replies; 50+ messages in thread
From: Jerin Jacob @ 2020-04-10 12:41 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: BillZhou, Matan Azrad, Ori Kam, Wenzhuo Lu, Jingjing Wu,
	Bernard Iremonger, John McNamara, Marko Kovacevic,
	Thomas Monjalon, Ferruh Yigit, dpdk-dev

On Fri, Apr 10, 2020 at 5:38 PM Andrew Rybchenko
<arybchenko@solarflare.com> wrote:
>
> On 4/10/20 12:46 PM, BillZhou wrote:
> > One of the reasons to destroy a flow is the fact that no packet matches the
> > flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> >
> > Currently, there is no any DPDK mechanism for flow aging and the
> > applications use their own ways to detect and destroy aged-out flows.
> >
> > The flow aging implementation need include:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
> >    the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
> >    that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >    contexts from the port.
> > - Support input flow aging command line in Testpmd.
> >
> > Signed-off-by: BillZhou <dongz@mellanox.com>
>
> Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>


>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] ethdev: support flow aging
  2020-04-10  9:46   ` [dpdk-dev] [PATCH] " BillZhou
  2020-04-10 10:14     ` Thomas Monjalon
  2020-04-10 12:07     ` Andrew Rybchenko
@ 2020-04-12  9:13     ` Ori Kam
  2020-04-12  9:48       ` Matan Azrad
  2020-04-14  8:32     ` [dpdk-dev] [PATCH v2] " Dong Zhou
  3 siblings, 1 reply; 50+ messages in thread
From: Ori Kam @ 2020-04-12  9:13 UTC (permalink / raw)
  To: Bill Zhou, Matan Azrad, Bill Zhou, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic,
	Thomas Monjalon, ferruh.yigit, arybchenko
  Cc: dev



> -----Original Message-----
> From: BillZhou <dongz@mellanox.com>
> Sent: Friday, April 10, 2020 12:47 PM
> To: Matan Azrad <matan@mellanox.com>; Bill Zhou <dongz@mellanox.com>;
> Ori Kam <orika@mellanox.com>; wenzhuo.lu@intel.com;
> jingjing.wu@intel.com; bernard.iremonger@intel.com;
> john.mcnamara@intel.com; marko.kovacevic@intel.com; Thomas Monjalon
> <thomas@monjalon.net>; ferruh.yigit@intel.com; arybchenko@solarflare.com
> Cc: dev@dpdk.org
> Subject: [PATCH] ethdev: support flow aging
> 
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is no any DPDK mechanism for flow aging and the
> applications use their own ways to detect and destroy aged-out flows.
> 
> The flow aging implementation need include:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
> - Support input flow aging command line in Testpmd.
> 
> Signed-off-by: BillZhou <dongz@mellanox.com>
> ---

Nice patch.
Acked-by: Ori Kam <orika@mellanox.com>
Thanks,
Ori

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] ethdev: support flow aging
  2020-04-12  9:13     ` Ori Kam
@ 2020-04-12  9:48       ` Matan Azrad
  0 siblings, 0 replies; 50+ messages in thread
From: Matan Azrad @ 2020-04-12  9:48 UTC (permalink / raw)
  To: Ori Kam, Bill Zhou, Bill Zhou, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic,
	Thomas Monjalon, ferruh.yigit, arybchenko
  Cc: dev

From: Ori Kam <orika@mellanox.com>
> > -----Original Message-----
> > From: BillZhou <dongz@mellanox.com>
> > Sent: Friday, April 10, 2020 12:47 PM
> > To: Matan Azrad <matan@mellanox.com>; Bill Zhou
> <dongz@mellanox.com>;
> > Ori Kam <orika@mellanox.com>; wenzhuo.lu@intel.com;
> > jingjing.wu@intel.com; bernard.iremonger@intel.com;
> > john.mcnamara@intel.com; marko.kovacevic@intel.com; Thomas
> Monjalon
> > <thomas@monjalon.net>; ferruh.yigit@intel.com;
> > arybchenko@solarflare.com
> > Cc: dev@dpdk.org
> > Subject: [PATCH] ethdev: support flow aging
> >
> > One of the reasons to destroy a flow is the fact that no packet
> > matches the flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> >
> > Currently, there is no any DPDK mechanism for flow aging and the
> > applications use their own ways to detect and destroy aged-out flows.
> >
> > The flow aging implementation need include:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
> and
> >   the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report
> >   that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >   contexts from the port.
> > - Support input flow aging command line in Testpmd.
> >
> > Signed-off-by: BillZhou <dongz@mellanox.com>
> > ---
> 
> Nice patch.
> Acked-by: Ori Kam <orika@mellanox.com>
> Thanks,
> Ori
Acked-by: Matan Azrad <matan@mellanox.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH] ethdev: support flow aging
  2020-04-10 10:14     ` Thomas Monjalon
@ 2020-04-13  4:02       ` Bill Zhou
  0 siblings, 0 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-13  4:02 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Matan Azrad, Ori Kam, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, ferruh.yigit, arybchenko, dev

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, April 10, 2020 6:14 PM
> To: Bill Zhou <dongz@mellanox.com>
> Cc: Matan Azrad <matan@mellanox.com>; Ori Kam <orika@mellanox.com>;
> wenzhuo.lu@intel.com; jingjing.wu@intel.com;
> bernard.iremonger@intel.com; john.mcnamara@intel.com;
> marko.kovacevic@intel.com; ferruh.yigit@intel.com;
> arybchenko@solarflare.com; dev@dpdk.org
> Subject: Re: [PATCH] ethdev: support flow aging
> 
> 10/04/2020 11:46, BillZhou:
> > One of the reasons to destroy a flow is the fact that no packet
> > matches the flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> >
> > Currently, there is no any DPDK mechanism for flow aging and the
> > applications use their own ways to detect and destroy aged-out flows.
> >
> > The flow aging implementation need include:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the
> timeout and
> >   the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report
> >   that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >   contexts from the port.
> > - Support input flow aging command line in Testpmd.
> >
> > Signed-off-by: BillZhou <dongz@mellanox.com>
> 
> I think you should insert a space in your name: Bill Zhou.
> I find strange to associate "Bill Zhou" with "dongz" in your email.
> Are you sure you don't want to mention "Dong"?

Thanks for your mention, it's will be updated in the latter patches.
> 
> 
> > +  * Added support for flow Aging mechanism base on counter.
> 
> Aging -> aging
> base -> based
> counter -> hardware counter?

For Mellanox mlx5 driver, flow aging is based on hardware counter update.
But for this patch, it's not include this support, so remove this line.
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH 0/2] support flow aging
  2020-03-16 12:52 ` BillZhou
                     ` (2 preceding siblings ...)
  2020-04-10  9:46   ` [dpdk-dev] [PATCH] " BillZhou
@ 2020-04-13 14:53   ` Dong Zhou
  2020-04-13 14:53     ` [dpdk-dev] [PATCH 1/2] net/mlx5: modify ext-counter memory allocation Dong Zhou
                       ` (2 more replies)
  3 siblings, 3 replies; 50+ messages in thread
From: Dong Zhou @ 2020-04-13 14:53 UTC (permalink / raw)
  To: matan, dongz, orika, shahafs, viacheslavo, john.mcnamara,
	marko.kovacevic
  Cc: dev

Those patches implement flow aging for mlx5 driver. First patch is to modify
the current additional memory allocation for counter, so that it's easy to
get every counter additional memory location by using offsetting. Second patch
implements aging check and age-out event callback mechanism for mlx5 driver.


Dong Zhou (2):
  net/mlx5: modify ext-counter memory allocation
  net/mlx5: support flow aging

 doc/guides/rel_notes/release_20_05.rst |   1 +
 drivers/net/mlx5/mlx5.c                |  34 ++-
 drivers/net/mlx5/mlx5.h                |  59 ++++-
 drivers/net/mlx5/mlx5_flow.c           | 147 ++++++++++-
 drivers/net/mlx5/mlx5_flow.h           |  15 +-
 drivers/net/mlx5/mlx5_flow_dv.c        | 336 +++++++++++++++++++++----
 drivers/net/mlx5/mlx5_flow_verbs.c     |  16 +-
 7 files changed, 518 insertions(+), 90 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH 1/2] net/mlx5: modify ext-counter memory allocation
  2020-04-13 14:53   ` [dpdk-dev] [PATCH 0/2] " Dong Zhou
@ 2020-04-13 14:53     ` Dong Zhou
  2020-04-13 14:53     ` [dpdk-dev] [PATCH 2/2] net/mlx5: support flow aging Dong Zhou
  2020-04-24 10:45     ` [dpdk-dev] [PATCH v2 0/2] " Bill Zhou
  2 siblings, 0 replies; 50+ messages in thread
From: Dong Zhou @ 2020-04-13 14:53 UTC (permalink / raw)
  To: matan, dongz, orika, shahafs, viacheslavo, john.mcnamara,
	marko.kovacevic
  Cc: dev

Currently, the counter pool needs 512 ext-counter memory for no batch
counters, it's allocated separately by once, behind the 512 basic-counter
memory. This is not easy to get ext-counter pointer by corresponding
basic-counter pointer. This is also no easy for expanding some other
potential additional type of counter memory.

So, need allocate every one of ext-counter and basic-counter together,
as a single piece of memory. It's will be same for further additional
type of counter memory. In this case, one piece of memory contains all
type of memory for one counter, it's easy to get each type memory by
using offsetting.

Signed-off-by: Dong Zhou <dongz@mellanox.com>
---
 drivers/net/mlx5/mlx5.c            |  4 ++--
 drivers/net/mlx5/mlx5.h            | 21 ++++++++++++++++-----
 drivers/net/mlx5/mlx5_flow_dv.c    | 27 +++++++++++++++------------
 drivers/net/mlx5/mlx5_flow_verbs.c | 16 ++++++++--------
 4 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 293d316413..3d21cffbd0 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -390,10 +390,10 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 					(mlx5_devx_cmd_destroy(pool->min_dcs));
 			}
 			for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
-				if (pool->counters_raw[j].action)
+				if (MLX5_POOL_GET_CNT(pool, j)->action)
 					claim_zero
 					(mlx5_glue->destroy_flow_action
-					       (pool->counters_raw[j].action));
+					 (MLX5_POOL_GET_CNT(pool, j)->action));
 				if (!batch && MLX5_GET_POOL_CNT_EXT
 				    (pool, j)->dcs)
 					claim_zero(mlx5_devx_cmd_destroy
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index fccfe47341..2e8c745c06 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -240,6 +240,18 @@ struct mlx5_drop {
 #define MLX5_COUNTERS_PER_POOL 512
 #define MLX5_MAX_PENDING_QUERIES 4
 #define MLX5_CNT_CONTAINER_RESIZE 64
+#define CNT_SIZE (sizeof(struct mlx5_flow_counter))
+#define CNTEXT_SIZE (sizeof(struct mlx5_flow_counter_ext))
+
+#define CNT_POOL_TYPE_EXT	(1 << 0)
+#define IS_EXT_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_EXT)
+#define MLX5_CNT_LEN(pool) \
+	(CNT_SIZE + (IS_EXT_POOL((pool)) ? CNTEXT_SIZE : 0))
+#define MLX5_POOL_GET_CNT(pool, index) \
+	((struct mlx5_flow_counter *) \
+	((char *)((pool) + 1) + (index) * (MLX5_CNT_LEN(pool))))
+#define MLX5_CNT_ARRAY_IDX(pool, cnt) \
+	((int)(((char *)(cnt) - (char *)((pool) + 1)) / MLX5_CNT_LEN((pool)))) \
 /*
  * The pool index and offset of counter in the pool array makes up the
  * counter index. In case the counter is from pool 0 and offset 0, it
@@ -248,11 +260,10 @@ struct mlx5_drop {
  */
 #define MLX5_MAKE_CNT_IDX(pi, offset) \
 	((pi) * MLX5_COUNTERS_PER_POOL + (offset) + 1)
-#define MLX5_CNT_TO_CNT_EXT(pool, cnt) (&((struct mlx5_flow_counter_ext *) \
-			    ((pool) + 1))[((cnt) - (pool)->counters_raw)])
+#define MLX5_CNT_TO_CNT_EXT(cnt) \
+	((struct mlx5_flow_counter_ext *)((cnt) + 1))
 #define MLX5_GET_POOL_CNT_EXT(pool, offset) \
-			      (&((struct mlx5_flow_counter_ext *) \
-			      ((pool) + 1))[offset])
+	MLX5_CNT_TO_CNT_EXT(MLX5_POOL_GET_CNT((pool), (offset)))
 
 struct mlx5_flow_counter_pool;
 
@@ -305,10 +316,10 @@ struct mlx5_flow_counter_pool {
 	rte_atomic64_t start_query_gen; /* Query start round. */
 	rte_atomic64_t end_query_gen; /* Query end round. */
 	uint32_t index; /* Pool index in container. */
+	uint32_t type: 2;
 	rte_spinlock_t sl; /* The pool lock. */
 	struct mlx5_counter_stats_raw *raw;
 	struct mlx5_counter_stats_raw *raw_hw; /* The raw on HW working. */
-	struct mlx5_flow_counter counters_raw[MLX5_COUNTERS_PER_POOL];
 	/* The pool counters memory. */
 };
 
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index 18ea577f8c..aa8a774f77 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -3854,7 +3854,7 @@ flow_dv_counter_get_by_idx(struct rte_eth_dev *dev,
 	MLX5_ASSERT(pool);
 	if (ppool)
 		*ppool = pool;
-	return &pool->counters_raw[idx % MLX5_COUNTERS_PER_POOL];
+	return MLX5_POOL_GET_CNT(pool, idx % MLX5_COUNTERS_PER_POOL);
 }
 
 /**
@@ -4062,7 +4062,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 		if (priv->counter_fallback)
 			return mlx5_devx_cmd_flow_counter_query(cnt_ext->dcs, 0,
 					0, pkts, bytes, 0, NULL, NULL, 0);
@@ -4078,7 +4078,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 		*pkts = 0;
 		*bytes = 0;
 	} else {
-		offset = cnt - &pool->counters_raw[0];
+		offset = MLX5_CNT_ARRAY_IDX(pool, cnt);
 		*pkts = rte_be_to_cpu_64(pool->raw->data[offset].hits);
 		*bytes = rte_be_to_cpu_64(pool->raw->data[offset].bytes);
 	}
@@ -4118,9 +4118,9 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 			return NULL;
 	}
 	size = sizeof(*pool);
+	size += MLX5_COUNTERS_PER_POOL * CNT_SIZE;
 	if (!batch)
-		size += MLX5_COUNTERS_PER_POOL *
-			sizeof(struct mlx5_flow_counter_ext);
+		size += MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE;
 	pool = rte_calloc(__func__, 1, size, 0);
 	if (!pool) {
 		rte_errno = ENOMEM;
@@ -4131,6 +4131,9 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 		pool->raw = cont->init_mem_mng->raws + n_valid %
 						     MLX5_CNT_CONTAINER_RESIZE;
 	pool->raw_hw = NULL;
+	pool->type = 0;
+	if (!batch)
+		pool->type |= CNT_POOL_TYPE_EXT;
 	rte_spinlock_init(&pool->sl);
 	/*
 	 * The generation of the new allocated counters in this pool is 0, 2 in
@@ -4202,7 +4205,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 					 (int64_t)(uintptr_t)dcs);
 		}
 		i = dcs->id % MLX5_COUNTERS_PER_POOL;
-		cnt = &pool->counters_raw[i];
+		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 		MLX5_GET_POOL_CNT_EXT(pool, i)->dcs = dcs;
 		*cnt_free = cnt;
@@ -4222,10 +4225,10 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 	}
 	pool = TAILQ_FIRST(&cont->pool_list);
 	for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
-		cnt = &pool->counters_raw[i];
+		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 	}
-	*cnt_free = &pool->counters_raw[0];
+	*cnt_free = MLX5_POOL_GET_CNT(pool, 0);
 	return cont;
 }
 
@@ -4343,14 +4346,14 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		pool = TAILQ_FIRST(&cont->pool_list);
 	}
 	if (!batch)
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt_free);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt_free);
 	/* Create a DV counter action only in the first time usage. */
 	if (!cnt_free->action) {
 		uint16_t offset;
 		struct mlx5_devx_obj *dcs;
 
 		if (batch) {
-			offset = cnt_free - &pool->counters_raw[0];
+			offset = MLX5_CNT_ARRAY_IDX(pool, cnt_free);
 			dcs = pool->min_dcs;
 		} else {
 			offset = 0;
@@ -4364,7 +4367,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		}
 	}
 	cnt_idx = MLX5_MAKE_CNT_IDX(pool->index,
-				    (cnt_free - pool->counters_raw));
+				MLX5_CNT_ARRAY_IDX(pool, cnt_free));
 	cnt_idx += batch * MLX5_CNT_BATCH_OFFSET;
 	/* Update the counter reset values. */
 	if (_flow_dv_query_count(dev, cnt_idx, &cnt_free->hits,
@@ -4407,7 +4410,7 @@ flow_dv_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 		if (cnt_ext && --cnt_ext->ref_cnt)
 			return;
 	}
diff --git a/drivers/net/mlx5/mlx5_flow_verbs.c b/drivers/net/mlx5/mlx5_flow_verbs.c
index ef4d7a3620..1a5c880221 100644
--- a/drivers/net/mlx5/mlx5_flow_verbs.c
+++ b/drivers/net/mlx5/mlx5_flow_verbs.c
@@ -64,7 +64,7 @@ flow_verbs_counter_get_by_idx(struct rte_eth_dev *dev,
 	MLX5_ASSERT(pool);
 	if (ppool)
 		*ppool = pool;
-	return &pool->counters_raw[idx % MLX5_COUNTERS_PER_POOL];
+	return MLX5_POOL_GET_CNT(pool, idx % MLX5_COUNTERS_PER_POOL);
 }
 
 /**
@@ -207,16 +207,16 @@ flow_verbs_counter_new(struct rte_eth_dev *dev, uint32_t shared, uint32_t id)
 		if (!pool)
 			return 0;
 		for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
-			cnt = &pool->counters_raw[i];
+			cnt = MLX5_POOL_GET_CNT(pool, i);
 			TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 		}
-		cnt = &pool->counters_raw[0];
+		cnt = MLX5_POOL_GET_CNT(pool, 0);
 		cont->pools[n_valid] = pool;
 		pool_idx = n_valid;
 		rte_atomic16_add(&cont->n_valid, 1);
 		TAILQ_INSERT_HEAD(&cont->pool_list, pool, next);
 	}
-	i = cnt - pool->counters_raw;
+	i = MLX5_CNT_ARRAY_IDX(pool, cnt);
 	cnt_ext = MLX5_GET_POOL_CNT_EXT(pool, i);
 	cnt_ext->id = id;
 	cnt_ext->shared = shared;
@@ -251,7 +251,7 @@ flow_verbs_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 
 	cnt = flow_verbs_counter_get_by_idx(dev, counter,
 					    &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	if (--cnt_ext->ref_cnt == 0) {
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 		claim_zero(mlx5_glue->destroy_counter_set(cnt_ext->cs));
@@ -282,7 +282,7 @@ flow_verbs_counter_query(struct rte_eth_dev *dev __rte_unused,
 		struct mlx5_flow_counter *cnt = flow_verbs_counter_get_by_idx
 						(dev, flow->counter, &pool);
 		struct mlx5_flow_counter_ext *cnt_ext = MLX5_CNT_TO_CNT_EXT
-							(pool, cnt);
+						(cnt);
 		struct rte_flow_query_count *qc = data;
 		uint64_t counters[2] = {0, 0};
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
@@ -1090,12 +1090,12 @@ flow_verbs_translate_action_count(struct mlx5_flow *dev_flow,
 	}
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	counter.counter_set_handle = cnt_ext->cs->handle;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #elif defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	counter.counters = cnt_ext->cs;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH 2/2] net/mlx5: support flow aging
  2020-04-13 14:53   ` [dpdk-dev] [PATCH 0/2] " Dong Zhou
  2020-04-13 14:53     ` [dpdk-dev] [PATCH 1/2] net/mlx5: modify ext-counter memory allocation Dong Zhou
@ 2020-04-13 14:53     ` Dong Zhou
  2020-04-24 10:45     ` [dpdk-dev] [PATCH v2 0/2] " Bill Zhou
  2 siblings, 0 replies; 50+ messages in thread
From: Dong Zhou @ 2020-04-13 14:53 UTC (permalink / raw)
  To: matan, dongz, orika, shahafs, viacheslavo, john.mcnamara,
	marko.kovacevic
  Cc: dev

Currently, there is no flow aging check and age-out event callback
mechanism for mlx5 driver, this patch implements it. It's included:
- Splitting the current counter container to aged or no-aged container
  since reducing memory consumption. Aged container will allocate extra
  memory to save the aging parameter from user configuration.
- Aging check and age-out event callback mechanism based on current
  counter. When a flow be checked aged-out, RTE_ETH_EVENT_FLOW_AGED
  event will be triggered to applications.
- Implement the new API: rte_flow_get_aged_flows, applications can use
  this API to get aged flows.

Signed-off-by: Dong Zhou <dongz@mellanox.com>
---
 doc/guides/rel_notes/release_20_05.rst |   1 +
 drivers/net/mlx5/mlx5.c                |  30 ++-
 drivers/net/mlx5/mlx5.h                |  46 +++-
 drivers/net/mlx5/mlx5_flow.c           | 147 ++++++++++-
 drivers/net/mlx5/mlx5_flow.h           |  15 +-
 drivers/net/mlx5/mlx5_flow_dv.c        | 321 +++++++++++++++++++++----
 drivers/net/mlx5/mlx5_flow_verbs.c     |  14 +-
 7 files changed, 494 insertions(+), 80 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index 6b3cd8cda7..51f79019c1 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -63,6 +63,7 @@ New Features
   * Added support for matching on IPv4 Time To Live and IPv6 Hop Limit.
   * Added support for creating Relaxed Ordering Memory Regions.
   * Added support for jumbo frame size (9K MTU) in Multi-Packet RQ mode.
+  * Added support for flow aging based on hardware counter.
 
 * **Updated the Intel ice driver.**
 
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 3d21cffbd0..bb99166511 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -331,11 +331,16 @@ mlx5_flow_id_release(struct mlx5_flow_id_pool *pool, uint32_t id)
 static void
 mlx5_flow_counters_mng_init(struct mlx5_ibv_shared *sh)
 {
-	uint8_t i;
+	uint8_t i, age;
 
 	TAILQ_INIT(&sh->cmng.flow_counters);
-	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
-		TAILQ_INIT(&sh->cmng.ccont[i].pool_list);
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
+			TAILQ_INIT(&sh->cmng.ccont[i][age].pool_list);
+	}
+	sh->cmng.age = 0;
+	TAILQ_INIT(&sh->cmng.aged_counters);
+	rte_spinlock_init(&sh->cmng.aged_sl);
 }
 
 /**
@@ -365,7 +370,7 @@ static void
 mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 {
 	struct mlx5_counter_stats_mem_mng *mng;
-	uint8_t i;
+	uint8_t i, age = 0;
 	int j;
 	int retries = 1024;
 
@@ -376,13 +381,14 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 			break;
 		rte_pause();
 	}
+age_again:
 	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i) {
 		struct mlx5_flow_counter_pool *pool;
 		uint32_t batch = !!(i % 2);
 
-		if (!sh->cmng.ccont[i].pools)
+		if (!sh->cmng.ccont[i][age].pools)
 			continue;
-		pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
+		pool = TAILQ_FIRST(&sh->cmng.ccont[i][age].pool_list);
 		while (pool) {
 			if (batch) {
 				if (pool->min_dcs)
@@ -400,12 +406,16 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 						  (MLX5_GET_POOL_CNT_EXT
 						  (pool, j)->dcs));
 			}
-			TAILQ_REMOVE(&sh->cmng.ccont[i].pool_list, pool,
-				     next);
+			TAILQ_REMOVE(&sh->cmng.ccont[i][age].pool_list,
+				pool, next);
 			rte_free(pool);
-			pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
+			pool = TAILQ_FIRST(&sh->cmng.ccont[i][age].pool_list);
 		}
-		rte_free(sh->cmng.ccont[i].pools);
+		rte_free(sh->cmng.ccont[i][age].pools);
+	}
+	if (!age) {
+		age = 1;
+		goto age_again;
 	}
 	mng = LIST_FIRST(&sh->cmng.mem_mngs);
 	while (mng) {
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 2e8c745c06..03a5b5a7c5 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -240,13 +240,21 @@ struct mlx5_drop {
 #define MLX5_COUNTERS_PER_POOL 512
 #define MLX5_MAX_PENDING_QUERIES 4
 #define MLX5_CNT_CONTAINER_RESIZE 64
+#define MLX5_CNT_AGE_OFFSET 0x80000000
 #define CNT_SIZE (sizeof(struct mlx5_flow_counter))
 #define CNTEXT_SIZE (sizeof(struct mlx5_flow_counter_ext))
+#define AGE_SIZE (sizeof(struct mlx5_age_param))
 
 #define CNT_POOL_TYPE_EXT	(1 << 0)
+#define CNT_POOL_TYPE_AGE	(1 << 1)
 #define IS_EXT_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_EXT)
+#define IS_AGE_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_AGE)
+#define MLX_CNT_IS_AGE(counter) ((counter) & MLX5_CNT_AGE_OFFSET ? 1 : 0)
+
 #define MLX5_CNT_LEN(pool) \
-	(CNT_SIZE + (IS_EXT_POOL((pool)) ? CNTEXT_SIZE : 0))
+	(CNT_SIZE + \
+	(IS_AGE_POOL((pool)) ? AGE_SIZE : 0) + \
+	(IS_EXT_POOL((pool)) ? CNTEXT_SIZE : 0))
 #define MLX5_POOL_GET_CNT(pool, index) \
 	((struct mlx5_flow_counter *) \
 	((char *)((pool) + 1) + (index) * (MLX5_CNT_LEN(pool))))
@@ -260,13 +268,33 @@ struct mlx5_drop {
  */
 #define MLX5_MAKE_CNT_IDX(pi, offset) \
 	((pi) * MLX5_COUNTERS_PER_POOL + (offset) + 1)
-#define MLX5_CNT_TO_CNT_EXT(cnt) \
-	((struct mlx5_flow_counter_ext *)((cnt) + 1))
+#define MLX5_CNT_TO_CNT_EXT(pool, cnt) \
+	((struct mlx5_flow_counter_ext *)\
+	((char *)((cnt) + 1) + \
+	(IS_AGE_POOL(pool) ? AGE_SIZE : 0)))
 #define MLX5_GET_POOL_CNT_EXT(pool, offset) \
-	MLX5_CNT_TO_CNT_EXT(MLX5_POOL_GET_CNT((pool), (offset)))
+	MLX5_CNT_TO_CNT_EXT(pool, MLX5_POOL_GET_CNT((pool), (offset)))
+#define MLX5_CNT_TO_AGE(cnt) \
+	((struct mlx5_age_param *)((cnt) + 1))
 
 struct mlx5_flow_counter_pool;
 
+/*age status*/
+enum {
+	AGE_FREE,
+	AGE_CANDIDATE, /* Counter assigned to flows. */
+	AGE_TMOUT, /* Timeout, wait for aged flows query and destroy. */
+};
+
+/* Counter age parameter. */
+struct mlx5_age_param {
+	rte_atomic16_t state; /**< Age state. */
+	uint32_t timeout:15; /**< Age timeout in unit of 0.1sec. */
+	uint32_t expire:16; /**< Expire time(0.1sec) in the future. */
+	uint16_t port_id; /**< Port id of the counter. */
+	void *context; /**< Flow counter age context. */
+};
+
 struct flow_counter_stats {
 	uint64_t hits;
 	uint64_t bytes;
@@ -355,13 +383,15 @@ struct mlx5_pools_container {
 
 /* Counter global management structure. */
 struct mlx5_flow_counter_mng {
-	uint8_t mhi[2]; /* master \ host container index. */
-	struct mlx5_pools_container ccont[2 * 2];
-	/* 2 containers for single and for batch for double-buffer. */
+	uint8_t mhi[2][2]; /* master \ host container index. */
+	struct mlx5_pools_container ccont[2 * 2][2];
+	struct mlx5_counters aged_counters; /* Aged flow counter list. */
+	rte_spinlock_t aged_sl; /* Aged flow counter list lock. */
 	struct mlx5_counters flow_counters; /* Legacy flow counter list. */
 	uint8_t pending_queries;
 	uint8_t batch;
 	uint16_t pool_index;
+	uint8_t age;
 	uint8_t query_thread_on;
 	LIST_HEAD(mem_mngs, mlx5_counter_stats_mem_mng) mem_mngs;
 	LIST_HEAD(stat_raws, mlx5_counter_stats_raw) free_stat_raws;
@@ -792,6 +822,8 @@ int mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 		       bool clear, uint64_t *pkts, uint64_t *bytes);
 int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
 		       struct rte_flow_error *error);
+int mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error);
 
 /* mlx5_mp.c */
 void mlx5_mp_req_start_rxtx(struct rte_eth_dev *dev);
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index c44bc1f526..58d6b8a9c5 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -24,6 +24,7 @@
 #include <rte_ether.h>
 #include <rte_ethdev_driver.h>
 #include <rte_flow.h>
+#include <rte_cycles.h>
 #include <rte_flow_driver.h>
 #include <rte_malloc.h>
 #include <rte_ip.h>
@@ -242,6 +243,7 @@ static const struct rte_flow_ops mlx5_flow_ops = {
 	.isolate = mlx5_flow_isolate,
 	.query = mlx5_flow_query,
 	.dev_dump = mlx5_flow_dev_dump,
+	.get_aged_flows = mlx5_flow_get_aged_flows,
 };
 
 /* Convert FDIR request to Generic flow. */
@@ -5586,6 +5588,31 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 
 #define MLX5_POOL_QUERY_FREQ_US 1000000
 
+/**
+ * Get number of all validate pools.
+ *
+ * @param[in] sh
+ *   Pointer to mlx5_ibv_shared object.
+ *
+ * @return
+ *   The number of all validate pools.
+ */
+static uint32_t
+mlx5_get_all_valid_pool_count(struct mlx5_ibv_shared *sh)
+{
+	uint8_t age, i;
+	uint32_t pools_n = 0;
+	struct mlx5_pools_container *cont;
+
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < 2 ; i++) {
+			cont = MLX5_CNT_CONTAINER(sh, i, 0, age);
+			pools_n += rte_atomic16_read(&cont->n_valid);
+		}
+	}
+	return pools_n;
+}
+
 /**
  * Set the periodic procedure for triggering asynchronous batch queries for all
  * the counter pools.
@@ -5596,12 +5623,9 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 void
 mlx5_set_query_alarm(struct mlx5_ibv_shared *sh)
 {
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(sh, 0, 0);
-	uint32_t pools_n = rte_atomic16_read(&cont->n_valid);
-	uint32_t us;
+	uint32_t pools_n, us;
 
-	cont = MLX5_CNT_CONTAINER(sh, 1, 0);
-	pools_n += rte_atomic16_read(&cont->n_valid);
+	pools_n = mlx5_get_all_valid_pool_count(sh);
 	us = MLX5_POOL_QUERY_FREQ_US / pools_n;
 	DRV_LOG(DEBUG, "Set alarm for %u pools each %u us", pools_n, us);
 	if (rte_eal_alarm_set(us, mlx5_flow_query_alarm, sh)) {
@@ -5627,6 +5651,7 @@ mlx5_flow_query_alarm(void *arg)
 	uint16_t offset;
 	int ret;
 	uint8_t batch = sh->cmng.batch;
+	uint8_t age = sh->cmng.age;
 	uint16_t pool_index = sh->cmng.pool_index;
 	struct mlx5_pools_container *cont;
 	struct mlx5_pools_container *mcont;
@@ -5635,8 +5660,8 @@ mlx5_flow_query_alarm(void *arg)
 	if (sh->cmng.pending_queries >= MLX5_MAX_PENDING_QUERIES)
 		goto set_alarm;
 next_container:
-	cont = MLX5_CNT_CONTAINER(sh, batch, 1);
-	mcont = MLX5_CNT_CONTAINER(sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(sh, batch, 1, age);
+	mcont = MLX5_CNT_CONTAINER(sh, batch, 0, age);
 	/* Check if resize was done and need to flip a container. */
 	if (cont != mcont) {
 		if (cont->pools) {
@@ -5646,15 +5671,22 @@ mlx5_flow_query_alarm(void *arg)
 		}
 		rte_cio_wmb();
 		 /* Flip the host container. */
-		sh->cmng.mhi[batch] ^= (uint8_t)2;
+		sh->cmng.mhi[batch][age] ^= (uint8_t)2;
 		cont = mcont;
 	}
 	if (!cont->pools) {
 		/* 2 empty containers case is unexpected. */
-		if (unlikely(batch != sh->cmng.batch))
+		if (unlikely(batch != sh->cmng.batch) &&
+			unlikely(age != sh->cmng.age)) {
 			goto set_alarm;
+		}
 		batch ^= 0x1;
 		pool_index = 0;
+		if (batch == 0 && pool_index == 0) {
+			age ^= 0x1;
+			sh->cmng.batch = batch;
+			sh->cmng.age = age;
+		}
 		goto next_container;
 	}
 	pool = cont->pools[pool_index];
@@ -5697,13 +5729,65 @@ mlx5_flow_query_alarm(void *arg)
 	if (pool_index >= rte_atomic16_read(&cont->n_valid)) {
 		batch ^= 0x1;
 		pool_index = 0;
+		if (batch == 0 && pool_index == 0)
+			age ^= 0x1;
 	}
 set_alarm:
 	sh->cmng.batch = batch;
 	sh->cmng.pool_index = pool_index;
+	sh->cmng.age = age;
 	mlx5_set_query_alarm(sh);
 }
 
+static void
+mlx5_flow_aging_check(struct mlx5_ibv_shared *sh,
+		       struct mlx5_flow_counter_pool *pool)
+{
+	struct mlx5_flow_counter *cnt;
+	struct mlx5_age_param *age_param;
+	struct mlx5_counter_stats_raw *cur = pool->raw_hw;
+	struct mlx5_counter_stats_raw *prev = pool->raw;
+	uint16_t curr = rte_rdtsc() / (rte_get_tsc_hz() / 10);
+	uint64_t port_mask = 0;
+	uint32_t i;
+
+	for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
+		cnt = MLX5_POOL_GET_CNT(pool, i);
+		age_param = MLX5_CNT_TO_AGE(cnt);
+		if (rte_atomic16_read(&age_param->state) != AGE_CANDIDATE)
+			continue;
+		if (cur->data[i].hits != prev->data[i].hits) {
+			age_param->expire = curr + age_param->timeout;
+			continue;
+		}
+		if ((uint16_t)(curr - age_param->expire) >= (UINT16_MAX / 2))
+			continue;
+		/**
+		 * Hold the lock first, or if between the
+		 * state AGE_TMOUT and tailq operation the
+		 * release happened, the release procedure
+		 * may delete a non-existent tailq node.
+		 */
+		rte_spinlock_lock(&sh->cmng.aged_sl);
+		/* If the cpmset fails, release happens. */
+		if (rte_atomic16_cmpset((volatile uint16_t *)
+					&age_param->state,
+					AGE_CANDIDATE,
+					AGE_TMOUT) ==
+					AGE_CANDIDATE) {
+			TAILQ_INSERT_TAIL(&sh->cmng.aged_counters, cnt, next);
+			port_mask |= (1 << age_param->port_id);
+		}
+		rte_spinlock_unlock(&sh->cmng.aged_sl);
+	}
+
+	for (i = 0; i < 64; i++) {
+		if (port_mask & (1ull << i))
+			_rte_eth_dev_callback_process(&rte_eth_devices[i],
+				RTE_ETH_EVENT_FLOW_AGED, NULL);
+	}
+}
+
 /**
  * Handler for the HW respond about ready values from an asynchronous batch
  * query. This function is probably called by the host thread.
@@ -5728,6 +5812,14 @@ mlx5_flow_async_pool_query_handle(struct mlx5_ibv_shared *sh,
 		raw_to_free = pool->raw_hw;
 	} else {
 		raw_to_free = pool->raw;
+		/**
+		 *  The the registered flow aged callback in age trigger
+		 *  function may hold the pool spinlock in case concurrent
+		 *  access to the aged flows tailq. So put the age trigger
+		 *  call out of the pool spinlock to avoid deadlock.
+		 */
+		if (IS_AGE_POOL(pool))
+			mlx5_flow_aging_check(sh, pool);
 		rte_spinlock_lock(&pool->sl);
 		pool->raw = pool->raw_hw;
 		rte_spinlock_unlock(&pool->sl);
@@ -5876,3 +5968,40 @@ mlx5_flow_dev_dump(struct rte_eth_dev *dev,
 	return mlx5_devx_cmd_flow_dump(sh->fdb_domain, sh->rx_domain,
 				       sh->tx_domain, file);
 }
+
+/**
+ * Get aged-out flows.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] context
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_countexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   how many contexts get in success, otherwise negative errno value.
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array.
+ */
+int
+mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error)
+{
+	const struct mlx5_flow_driver_ops *fops;
+	struct rte_flow_attr attr = { .transfer = 0 };
+
+	if (flow_get_drv_type(dev, &attr) == MLX5_FLOW_TYPE_DV) {
+		fops = flow_get_drv_ops(MLX5_FLOW_TYPE_DV);
+		return fops->get_aged_flows(dev, contexts, nb_contexts,
+						    error);
+	}
+	DRV_LOG(ERR,
+		"port %u get aged flows is not supported.",
+		 dev->data->port_id);
+	return -ENOTSUP;
+}
diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index daa1f84145..eb6dff204f 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -199,6 +199,7 @@ enum mlx5_feature_name {
 #define MLX5_FLOW_ACTION_METER (1ull << 31)
 #define MLX5_FLOW_ACTION_SET_IPV4_DSCP (1ull << 32)
 #define MLX5_FLOW_ACTION_SET_IPV6_DSCP (1ull << 33)
+#define MLX5_FLOW_ACTION_AGE (1ull << 34)
 
 #define MLX5_FLOW_FATE_ACTIONS \
 	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE | \
@@ -788,6 +789,11 @@ typedef int (*mlx5_flow_counter_query_t)(struct rte_eth_dev *dev,
 					 uint32_t cnt,
 					 bool clear, uint64_t *pkts,
 					 uint64_t *bytes);
+typedef int (*mlx5_flow_get_aged_flows_t)
+					(struct rte_eth_dev *dev,
+					 void **context,
+					 uint32_t nb_contexts,
+					 struct rte_flow_error *error);
 struct mlx5_flow_driver_ops {
 	mlx5_flow_validate_t validate;
 	mlx5_flow_prepare_t prepare;
@@ -803,13 +809,14 @@ struct mlx5_flow_driver_ops {
 	mlx5_flow_counter_alloc_t counter_alloc;
 	mlx5_flow_counter_free_t counter_free;
 	mlx5_flow_counter_query_t counter_query;
+	mlx5_flow_get_aged_flows_t get_aged_flows;
 };
 
 
-#define MLX5_CNT_CONTAINER(sh, batch, thread) (&(sh)->cmng.ccont \
-	[(((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
-#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread) (&(sh)->cmng.ccont \
-	[(~((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
+#define MLX5_CNT_CONTAINER(sh, batch, thread, age) (&(sh)->cmng.ccont \
+	[(((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
+#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread, age) (&(sh)->cmng.ccont \
+	[(~((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
 
 /* mlx5_flow.c */
 
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index aa8a774f77..5ec6de08bd 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -24,6 +24,7 @@
 #include <rte_flow.h>
 #include <rte_flow_driver.h>
 #include <rte_malloc.h>
+#include <rte_cycles.h>
 #include <rte_ip.h>
 #include <rte_gre.h>
 #include <rte_vxlan.h>
@@ -3664,6 +3665,50 @@ mlx5_flow_validate_action_meter(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/**
+ * Validate the age action.
+ *
+ * @param[in] action_flags
+ *   Holds the actions detected until now.
+ * @param[in] action
+ *   Pointer to the age action.
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[out] error
+ *   Pointer to error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+flow_dv_validate_action_age(uint64_t action_flags,
+			    const struct rte_flow_action *action,
+			    struct rte_eth_dev *dev,
+			    struct rte_flow_error *error)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	const struct rte_flow_action_age *age = action->conf;
+
+	if (!priv->config.devx)
+		return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+					  NULL,
+					  "age action not supported");
+	if (!(action->conf))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "configuration cannot be null");
+	if (age->timeout >= UINT16_MAX / 2 / 10)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "Max age time: 3270 seconds");
+	if (action_flags & MLX5_FLOW_ACTION_AGE)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, NULL,
+					  "Duplicate age ctions set");
+	return 0;
+}
+
 /**
  * Validate the modify-header IPv4 DSCP actions.
  *
@@ -3841,14 +3886,16 @@ flow_dv_counter_get_by_idx(struct rte_eth_dev *dev,
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont;
 	struct mlx5_flow_counter_pool *pool;
-	uint32_t batch = 0;
+	uint32_t batch = 0, age = 0;
 
 	idx--;
+	age = MLX_CNT_IS_AGE(idx);
+	idx = age ? idx - MLX5_CNT_AGE_OFFSET : idx;
 	if (idx >= MLX5_CNT_BATCH_OFFSET) {
 		idx -= MLX5_CNT_BATCH_OFFSET;
 		batch = 1;
 	}
-	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	MLX5_ASSERT(idx / MLX5_COUNTERS_PER_POOL < cont->n);
 	pool = cont->pools[idx / MLX5_COUNTERS_PER_POOL];
 	MLX5_ASSERT(pool);
@@ -3968,18 +4015,21 @@ flow_dv_create_counter_stat_mem_mng(struct rte_eth_dev *dev, int raws_n)
  *   Pointer to the Ethernet device structure.
  * @param[in] batch
  *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the pool is for Aging counter.
  *
  * @return
  *   The new container pointer on success, otherwise NULL and rte_errno is set.
  */
 static struct mlx5_pools_container *
-flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
+flow_dv_container_resize(struct rte_eth_dev *dev,
+				uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont =
-			MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+			MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	struct mlx5_pools_container *new_cont =
-			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0);
+			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0, age);
 	struct mlx5_counter_stats_mem_mng *mem_mng = NULL;
 	uint32_t resize = cont->n + MLX5_CNT_CONTAINER_RESIZE;
 	uint32_t mem_size = sizeof(struct mlx5_flow_counter_pool *) * resize;
@@ -3987,7 +4037,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
 
 	/* Fallback mode has no background thread. Skip the check. */
 	if (!priv->counter_fallback &&
-	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1)) {
+	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1, age)) {
 		/* The last resize still hasn't detected by the host thread. */
 		rte_errno = EAGAIN;
 		return NULL;
@@ -4030,7 +4080,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
 	new_cont->init_mem_mng = mem_mng;
 	rte_cio_wmb();
 	 /* Flip the master container. */
-	priv->sh->cmng.mhi[batch] ^= (uint8_t)1;
+	priv->sh->cmng.mhi[batch][age] ^= (uint8_t)1;
 	return new_cont;
 }
 
@@ -4062,7 +4112,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 		if (priv->counter_fallback)
 			return mlx5_devx_cmd_flow_counter_query(cnt_ext->dcs, 0,
 					0, pkts, bytes, 0, NULL, NULL, 0);
@@ -4103,17 +4153,17 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
  */
 static struct mlx5_pools_container *
 flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
-		    uint32_t batch)
+		    uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_flow_counter_pool *pool;
 	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
-							       0);
+							       0, age);
 	int16_t n_valid = rte_atomic16_read(&cont->n_valid);
 	uint32_t size;
 
 	if (cont->n == n_valid) {
-		cont = flow_dv_container_resize(dev, batch);
+		cont = flow_dv_container_resize(dev, batch, age);
 		if (!cont)
 			return NULL;
 	}
@@ -4121,6 +4171,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 	size += MLX5_COUNTERS_PER_POOL * CNT_SIZE;
 	if (!batch)
 		size += MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE;
+	if (age)
+		size += MLX5_COUNTERS_PER_POOL * AGE_SIZE;
 	pool = rte_calloc(__func__, 1, size, 0);
 	if (!pool) {
 		rte_errno = ENOMEM;
@@ -4134,6 +4186,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 	pool->type = 0;
 	if (!batch)
 		pool->type |= CNT_POOL_TYPE_EXT;
+	if (age)
+		pool->type |= CNT_POOL_TYPE_AGE;
 	rte_spinlock_init(&pool->sl);
 	/*
 	 * The generation of the new allocated counters in this pool is 0, 2 in
@@ -4160,6 +4214,27 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 	return cont;
 }
 
+static void
+flow_dv_counter_update_min_dcs(struct rte_eth_dev *dev,
+			struct mlx5_flow_counter_pool *pool,
+			uint32_t batch, uint32_t age)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_flow_counter_pool *other;
+	struct mlx5_pools_container *cont;
+
+	cont = MLX5_CNT_CONTAINER(priv->sh,	batch, 0, (age ^ 0x1));
+	other = flow_dv_find_pool_by_id(cont, pool->min_dcs->id);
+	if (!other)
+		return;
+	if (pool->min_dcs->id < other->min_dcs->id) {
+		rte_atomic64_set(&other->a64_dcs,
+			rte_atomic64_read(&pool->a64_dcs));
+	} else {
+		rte_atomic64_set(&pool->a64_dcs,
+			rte_atomic64_read(&other->a64_dcs));
+	}
+}
 /**
  * Prepare a new counter and/or a new counter pool.
  *
@@ -4177,7 +4252,7 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 static struct mlx5_pools_container *
 flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			     struct mlx5_flow_counter **cnt_free,
-			     uint32_t batch)
+			     uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont;
@@ -4186,7 +4261,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 	struct mlx5_flow_counter *cnt;
 	uint32_t i;
 
-	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	if (!batch) {
 		/* bulk_bitmap must be 0 for single counter allocation. */
 		dcs = mlx5_devx_cmd_flow_counter_alloc(priv->sh->ctx, 0);
@@ -4194,7 +4269,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			return NULL;
 		pool = flow_dv_find_pool_by_id(cont, dcs->id);
 		if (!pool) {
-			cont = flow_dv_pool_create(dev, dcs, batch);
+			cont = flow_dv_pool_create(dev, dcs, batch, age);
 			if (!cont) {
 				mlx5_devx_cmd_destroy(dcs);
 				return NULL;
@@ -4204,6 +4279,8 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			rte_atomic64_set(&pool->a64_dcs,
 					 (int64_t)(uintptr_t)dcs);
 		}
+		flow_dv_counter_update_min_dcs(dev,
+						pool, batch, age);
 		i = dcs->id % MLX5_COUNTERS_PER_POOL;
 		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
@@ -4218,7 +4295,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 		rte_errno = ENODATA;
 		return NULL;
 	}
-	cont = flow_dv_pool_create(dev, dcs, batch);
+	cont = flow_dv_pool_create(dev, dcs, batch, age);
 	if (!cont) {
 		mlx5_devx_cmd_destroy(dcs);
 		return NULL;
@@ -4285,7 +4362,7 @@ flow_dv_counter_shared_search(struct mlx5_pools_container *cont, uint32_t id,
  */
 static uint32_t
 flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
-		      uint16_t group)
+		      uint16_t group, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_flow_counter_pool *pool = NULL;
@@ -4301,7 +4378,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	 */
 	uint32_t batch = (group && !shared && !priv->counter_fallback) ? 1 : 0;
 	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
-							       0);
+							       0, age);
 	uint32_t cnt_idx;
 
 	if (!priv->config.devx) {
@@ -4340,13 +4417,13 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		cnt_free = NULL;
 	}
 	if (!cnt_free) {
-		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch);
+		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch, age);
 		if (!cont)
 			return 0;
 		pool = TAILQ_FIRST(&cont->pool_list);
 	}
 	if (!batch)
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt_free);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt_free);
 	/* Create a DV counter action only in the first time usage. */
 	if (!cnt_free->action) {
 		uint16_t offset;
@@ -4369,6 +4446,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	cnt_idx = MLX5_MAKE_CNT_IDX(pool->index,
 				MLX5_CNT_ARRAY_IDX(pool, cnt_free));
 	cnt_idx += batch * MLX5_CNT_BATCH_OFFSET;
+	cnt_idx += age * MLX5_CNT_AGE_OFFSET;
 	/* Update the counter reset values. */
 	if (_flow_dv_query_count(dev, cnt_idx, &cnt_free->hits,
 				 &cnt_free->bytes))
@@ -4390,6 +4468,60 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	return cnt_idx;
 }
 
+/**
+ * Get age param from counter index.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] counter
+ *   Index to the counter handler.
+ */
+static struct mlx5_age_param*
+flow_dv_counter_idx_get_age(struct rte_eth_dev *dev,
+				uint32_t counter)
+{
+	struct mlx5_flow_counter *cnt;
+	struct mlx5_flow_counter_pool *pool = NULL;
+
+	flow_dv_counter_get_by_idx(dev, counter, &pool);
+	counter = (counter - 1) % MLX5_COUNTERS_PER_POOL;
+	cnt = MLX5_POOL_GET_CNT(pool, counter);
+	return MLX5_CNT_TO_AGE(cnt);
+}
+
+/**
+ * Remove a flow counter from aged counter list.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] counter
+ *   Index to the counter handler.
+ * @param[in] cnt
+ *   Pointer to the counter handler.
+ */
+static void
+flow_dv_counter_remove_from_age(struct rte_eth_dev *dev,
+				uint32_t counter, struct mlx5_flow_counter *cnt)
+{
+	struct mlx5_age_param *age_param;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
+	age_param = flow_dv_counter_idx_get_age(dev, counter);
+	if (rte_atomic16_cmpset((volatile uint16_t *)
+			&age_param->state,
+			AGE_CANDIDATE, AGE_FREE)
+			!= AGE_CANDIDATE) {
+		/**
+		 * We need the lock even it is age timeout,
+		 * since counter may still in process.
+		 */
+		rte_spinlock_lock(&priv->sh->cmng.aged_sl);
+		TAILQ_REMOVE(&priv->sh->cmng.aged_counters,
+			cnt, next);
+		rte_spinlock_unlock(&priv->sh->cmng.aged_sl);
+	}
+	rte_atomic16_set(&age_param->state, AGE_FREE);
+}
 /**
  * Release a flow counter.
  *
@@ -4410,10 +4542,12 @@ flow_dv_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 		if (cnt_ext && --cnt_ext->ref_cnt)
 			return;
 	}
+	if (IS_AGE_POOL(pool))
+		flow_dv_counter_remove_from_age(dev, counter, cnt);
 	/* Put the counter in the end - the last updated one. */
 	TAILQ_INSERT_TAIL(&pool->counters, cnt, next);
 	/*
@@ -5153,6 +5287,15 @@ flow_dv_validate(struct rte_eth_dev *dev, const struct rte_flow_attr *attr,
 			action_flags |= MLX5_FLOW_ACTION_METER;
 			++actions_n;
 			break;
+		case RTE_FLOW_ACTION_TYPE_AGE:
+			ret = flow_dv_validate_action_age(action_flags,
+							  actions, dev,
+							  error);
+			if (ret < 0)
+				return ret;
+			action_flags |= MLX5_FLOW_ACTION_AGE;
+			++actions_n;
+			break;
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP:
 			ret = flow_dv_validate_action_modify_ipv4_dscp
 							 (action_flags,
@@ -7164,6 +7307,41 @@ flow_dv_translate_action_port_id(struct rte_eth_dev *dev,
 	return 0;
 }
 
+static uint32_t
+flow_dv_translate_create_counter(struct rte_eth_dev *dev,
+				struct mlx5_flow *dev_flow,
+				const struct rte_flow_action_count *count,
+				const struct rte_flow_action_age *age)
+{
+	uint32_t counter;
+	struct mlx5_age_param *age_param;
+
+	counter = flow_dv_counter_alloc(dev,
+				count ? count->shared : 0,
+				count ? count->id : 0,
+				dev_flow->dv.group,
+				age ? 1 : 0);
+
+	if (!counter || age == NULL)
+		return counter;
+	age_param  = flow_dv_counter_idx_get_age(dev, counter);
+	/*
+	 * The counter age accuracy may have a bit delay. Have 3/4
+	 * second bias on the timeount in order to let it age in time.
+	 */
+	age_param->context = age->context ? age->context : dev_flow->flow;
+	/*
+	 * The counter age accuracy may have a bit delay. Have 3/4
+	 * second bias on the timeount in order to let it age in time.
+	 */
+	age_param->timeout = age->timeout * 10 - 7;
+	/* Set expire time in unit of 0.1 sec. */
+	age_param->port_id = dev->data->port_id;
+	age_param->expire = age_param->timeout +
+			rte_rdtsc() / (rte_get_tsc_hz() / 10);
+	rte_atomic16_set(&age_param->state, AGE_CANDIDATE);
+	return counter;
+}
 /**
  * Add Tx queue matcher
  *
@@ -7328,6 +7506,8 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 			    (MLX5_MAX_MODIFY_NUM + 1)];
 	} mhdr_dummy;
 	struct mlx5_flow_dv_modify_hdr_resource *mhdr_res = &mhdr_dummy.res;
+	const struct rte_flow_action_count *count = NULL;
+	const struct rte_flow_action_age *age = NULL;
 	union flow_dv_attr flow_attr = { .attr = 0 };
 	uint32_t tag_be;
 	union mlx5_flow_tbl_key tbl_key;
@@ -7356,7 +7536,6 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 		const struct rte_flow_action_queue *queue;
 		const struct rte_flow_action_rss *rss;
 		const struct rte_flow_action *action = actions;
-		const struct rte_flow_action_count *count = action->conf;
 		const uint8_t *rss_key;
 		const struct rte_flow_action_jump *jump_data;
 		const struct rte_flow_action_meter *mtr;
@@ -7477,36 +7656,21 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 			 */
 			action_flags |= MLX5_FLOW_ACTION_RSS;
 			break;
+		case RTE_FLOW_ACTION_TYPE_AGE:
 		case RTE_FLOW_ACTION_TYPE_COUNT:
 			if (!dev_conf->devx) {
-				rte_errno = ENOTSUP;
-				goto cnt_err;
-			}
-			flow->counter = flow_dv_counter_alloc(dev,
-							count->shared,
-							count->id,
-							dev_flow->dv.group);
-			if (!flow->counter)
-				goto cnt_err;
-			dev_flow->dv.actions[actions_n++] =
-				  (flow_dv_counter_get_by_idx(dev,
-				  flow->counter, NULL))->action;
-			action_flags |= MLX5_FLOW_ACTION_COUNT;
-			break;
-cnt_err:
-			if (rte_errno == ENOTSUP)
 				return rte_flow_error_set
 					      (error, ENOTSUP,
 					       RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 					       NULL,
 					       "count action not supported");
+			}
+			/* Save information first, will apply later. */
+			if (actions->type == RTE_FLOW_ACTION_TYPE_COUNT)
+				count = action->conf;
 			else
-				return rte_flow_error_set
-						(error, rte_errno,
-						 RTE_FLOW_ERROR_TYPE_ACTION,
-						 action,
-						 "cannot create counter"
-						  " object.");
+				age = action->conf;
+			action_flags |= MLX5_FLOW_ACTION_COUNT;
 			break;
 		case RTE_FLOW_ACTION_TYPE_OF_POP_VLAN:
 			dev_flow->dv.actions[actions_n++] =
@@ -7766,6 +7930,22 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 				dev_flow->dv.actions[modify_action_position] =
 					handle->dvh.modify_hdr->verbs_action;
 			}
+			if (action_flags & MLX5_FLOW_ACTION_COUNT) {
+				flow->counter =
+					flow_dv_translate_create_counter(dev,
+						dev_flow, count, age);
+
+				if (!flow->counter)
+					return rte_flow_error_set
+						(error, rte_errno,
+						RTE_FLOW_ERROR_TYPE_ACTION,
+						NULL,
+						"cannot create counter"
+						" object.");
+				dev_flow->dv.actions[actions_n++] =
+					  (flow_dv_counter_get_by_idx(dev,
+					  flow->counter, NULL))->action;
+			}
 			break;
 		default:
 			break;
@@ -8947,6 +9127,58 @@ flow_dv_counter_query(struct rte_eth_dev *dev, uint32_t counter, bool clear,
 	return 0;
 }
 
+/**
+ * Get aged-out flows.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] context
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_contexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   how many contexts get in success, otherwise negative errno value.
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array.
+ * @note: only stub for now
+ */
+static int
+flow_get_aged_flows(struct rte_eth_dev *dev,
+		    void **context,
+		    uint32_t nb_contexts,
+		    struct rte_flow_error *error)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_counters *aged_tq = &priv->sh->cmng.aged_counters;
+	struct mlx5_age_param *age_param;
+	struct mlx5_flow_counter *counter;
+	int nb_flows = 0;
+
+	if (nb_contexts && !context)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+					  NULL,
+					  "Should assign at least one flow or"
+					  " context to get if nb_contexts != 0");
+	rte_spinlock_lock(&priv->sh->cmng.aged_sl);
+	TAILQ_FOREACH(counter, aged_tq, next) {
+		nb_flows++;
+		if (nb_contexts) {
+			age_param = MLX5_CNT_TO_AGE(counter);
+			context[nb_flows - 1] = age_param->context;
+			if (!(--nb_contexts))
+				break;
+		}
+	}
+	rte_spinlock_unlock(&priv->sh->cmng.aged_sl);
+	return nb_flows;
+}
+
 /*
  * Mutex-protected thunk to lock-free  __flow_dv_translate().
  */
@@ -9013,7 +9245,7 @@ flow_dv_counter_allocate(struct rte_eth_dev *dev)
 	uint32_t cnt;
 
 	flow_dv_shared_lock(dev);
-	cnt = flow_dv_counter_alloc(dev, 0, 0, 1);
+	cnt = flow_dv_counter_alloc(dev, 0, 0, 1, 0);
 	flow_dv_shared_unlock(dev);
 	return cnt;
 }
@@ -9044,6 +9276,7 @@ const struct mlx5_flow_driver_ops mlx5_flow_dv_drv_ops = {
 	.counter_alloc = flow_dv_counter_allocate,
 	.counter_free = flow_dv_counter_free,
 	.counter_query = flow_dv_counter_query,
+	.get_aged_flows = flow_get_aged_flows,
 };
 
 #endif /* HAVE_IBV_FLOW_DV_SUPPORT */
diff --git a/drivers/net/mlx5/mlx5_flow_verbs.c b/drivers/net/mlx5/mlx5_flow_verbs.c
index 1a5c880221..7cf38195bd 100644
--- a/drivers/net/mlx5/mlx5_flow_verbs.c
+++ b/drivers/net/mlx5/mlx5_flow_verbs.c
@@ -56,7 +56,8 @@ flow_verbs_counter_get_by_idx(struct rte_eth_dev *dev,
 			      struct mlx5_flow_counter_pool **ppool)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
+	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
+									0);
 	struct mlx5_flow_counter_pool *pool;
 
 	idx--;
@@ -151,7 +152,8 @@ static uint32_t
 flow_verbs_counter_new(struct rte_eth_dev *dev, uint32_t shared, uint32_t id)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
+	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
+									0);
 	struct mlx5_flow_counter_pool *pool = NULL;
 	struct mlx5_flow_counter_ext *cnt_ext = NULL;
 	struct mlx5_flow_counter *cnt = NULL;
@@ -251,7 +253,7 @@ flow_verbs_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 
 	cnt = flow_verbs_counter_get_by_idx(dev, counter,
 					    &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	if (--cnt_ext->ref_cnt == 0) {
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 		claim_zero(mlx5_glue->destroy_counter_set(cnt_ext->cs));
@@ -282,7 +284,7 @@ flow_verbs_counter_query(struct rte_eth_dev *dev __rte_unused,
 		struct mlx5_flow_counter *cnt = flow_verbs_counter_get_by_idx
 						(dev, flow->counter, &pool);
 		struct mlx5_flow_counter_ext *cnt_ext = MLX5_CNT_TO_CNT_EXT
-						(cnt);
+						(pool, cnt);
 		struct rte_flow_query_count *qc = data;
 		uint64_t counters[2] = {0, 0};
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
@@ -1090,12 +1092,12 @@ flow_verbs_translate_action_count(struct mlx5_flow *dev_flow,
 	}
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	counter.counter_set_handle = cnt_ext->cs->handle;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #elif defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	counter.counters = cnt_ext->cs;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-10  9:46   ` [dpdk-dev] [PATCH] " BillZhou
                       ` (2 preceding siblings ...)
  2020-04-12  9:13     ` Ori Kam
@ 2020-04-14  8:32     ` Dong Zhou
  2020-04-14  8:49       ` Ori Kam
                         ` (2 more replies)
  3 siblings, 3 replies; 50+ messages in thread
From: Dong Zhou @ 2020-04-14  8:32 UTC (permalink / raw)
  To: orika, matan, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, thomas, ferruh.yigit, arybchenko
  Cc: dev

One of the reasons to destroy a flow is the fact that no packet matches the
flow for "timeout" time.
For example, when TCP\UDP sessions are suddenly closed.

Currently, there is not any DPDK mechanism for flow aging and the
applications use their own ways to detect and destroy aged-out flows.

The flow aging implementation need include:
- A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
  the application flow context for each flow.
- A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
  that there are new aged-out flows.
- A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
  contexts from the port.
- Support input flow aging command line in Testpmd.

Signed-off-by: Dong Zhou <dongz@mellanox.com>
---
 app/test-pmd/cmdline_flow.c              | 26 ++++++++++
 doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
 doc/guides/rel_notes/release_20_05.rst   | 11 +++++
 lib/librte_ethdev/rte_ethdev.h           |  1 +
 lib/librte_ethdev/rte_ethdev_version.map |  3 ++
 lib/librte_ethdev/rte_flow.c             | 18 +++++++
 lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_flow_driver.h      |  6 +++
 8 files changed, 149 insertions(+)

diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
index e6ab8ff2f7..45bcff3cf5 100644
--- a/app/test-pmd/cmdline_flow.c
+++ b/app/test-pmd/cmdline_flow.c
@@ -343,6 +343,8 @@ enum index {
 	ACTION_SET_IPV4_DSCP_VALUE,
 	ACTION_SET_IPV6_DSCP,
 	ACTION_SET_IPV6_DSCP_VALUE,
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
 };
 
 /** Maximum size for pattern in struct rte_flow_item_raw. */
@@ -1145,6 +1147,7 @@ static const enum index next_action[] = {
 	ACTION_SET_META,
 	ACTION_SET_IPV4_DSCP,
 	ACTION_SET_IPV6_DSCP,
+	ACTION_AGE,
 	ZERO,
 };
 
@@ -1370,6 +1373,13 @@ static const enum index action_set_ipv6_dscp[] = {
 	ZERO,
 };
 
+static const enum index action_age[] = {
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
+	ACTION_NEXT,
+	ZERO,
+};
+
 static int parse_set_raw_encap_decap(struct context *, const struct token *,
 				     const char *, unsigned int,
 				     void *, unsigned int);
@@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
 			     (struct rte_flow_action_set_dscp, dscp)),
 		.call = parse_vc_conf,
 	},
+	[ACTION_AGE] = {
+		.name = "age",
+		.help = "set a specific metadata header",
+		.next = NEXT(action_age),
+		.priv = PRIV_ACTION(AGE,
+			sizeof(struct rte_flow_action_age)),
+		.call = parse_vc,
+	},
+	[ACTION_AGE_TIMEOUT] = {
+		.name = "timeout",
+		.help = "flow age timeout value",
+		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
+					   timeout, 24)),
+		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
+		.call = parse_vc_conf,
+	},
 };
 
 /** Remove and return last entry from argument stack. */
diff --git a/doc/guides/prog_guide/rte_flow.rst b/doc/guides/prog_guide/rte_flow.rst
index 41c147913c..cf4368e1c4 100644
--- a/doc/guides/prog_guide/rte_flow.rst
+++ b/doc/guides/prog_guide/rte_flow.rst
@@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
    | ``dscp``  | DSCP in low 6 bits, rest ignore |
    +-----------+---------------------------------+
 
+Action: ``AGE``
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Set ageing timeout configuration to a flow.
+
+Event RTE_ETH_EVENT_FLOW_AGED will be reported if
+timeout passed without any matching on the flow.
+
+.. _table_rte_flow_action_age:
+
+.. table:: AGE
+
+   +--------------+---------------------------------+
+   | Field        | Value                           |
+   +==============+=================================+
+   | ``timeout``  | 24 bits timeout value           |
+   +--------------+---------------------------------+
+   | ``reserved`` | 8 bits reserved, must be zero   |
+   +--------------+---------------------------------+
+   | ``context``  | user input flow context         |
+   +--------------+---------------------------------+
+
 Negative types
 ~~~~~~~~~~~~~~
 
diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index db885f3609..6b3cd8cda7 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -100,6 +100,17 @@ New Features
 
   * Added generic filter support.
 
+* **Added flow Aging Support.**
+
+  Added flow Aging support to detect and report aged-out flows, including:
+
+  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and the
+    application flow context for each flow.
+  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to report that
+    there are new aged-out flows.
+  * Added new API: rte_flow_get_aged_flows to get the aged-out flows contexts
+    from the port.
+
 Removed Items
 -------------
 
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d1a593ad11..74c9d00f36 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 3f32fdecf7..fa4b5816be 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -230,4 +230,7 @@ EXPERIMENTAL {
 
 	# added in 20.02
 	rte_flow_dev_dump;
+
+	# added in 20.05
+	rte_flow_get_aged_flows;
 };
diff --git a/lib/librte_ethdev/rte_flow.c b/lib/librte_ethdev/rte_flow.c
index a5ac1c7fbd..3699edce49 100644
--- a/lib/librte_ethdev/rte_flow.c
+++ b/lib/librte_ethdev/rte_flow.c
@@ -172,6 +172,7 @@ static const struct rte_flow_desc_data rte_flow_desc_action[] = {
 	MK_FLOW_ACTION(SET_META, sizeof(struct rte_flow_action_set_meta)),
 	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct rte_flow_action_set_dscp)),
 	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct rte_flow_action_set_dscp)),
+	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
 };
 
 int
@@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file, struct rte_flow_error *error)
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 				  NULL, rte_strerror(ENOSYS));
 }
+
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+		    uint32_t nb_contexts, struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (unlikely(!ops))
+		return -rte_errno;
+	if (likely(!!ops->get_aged_flows))
+		return flow_err(port_id, ops->get_aged_flows(dev, contexts,
+				nb_contexts, error), error);
+	return rte_flow_error_set(error, ENOTSUP,
+				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+				  NULL, rte_strerror(ENOTSUP));
+}
diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
index 7f3e08fad3..fab44f6c0b 100644
--- a/lib/librte_ethdev/rte_flow.h
+++ b/lib/librte_ethdev/rte_flow.h
@@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
 	 * See struct rte_flow_action_set_dscp.
 	 */
 	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
+
+	/**
+	 * Report as aged flow if timeout passed without any matching on the
+	 * flow.
+	 *
+	 * See struct rte_flow_action_age.
+	 * See function rte_flow_get_aged_flows
+	 * see enum RTE_ETH_EVENT_FLOW_AGED
+	 */
+	RTE_FLOW_ACTION_TYPE_AGE,
 };
 
 /**
@@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
 	uint16_t index; /**< Queue index to use. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice
+ *
+ * RTE_FLOW_ACTION_TYPE_AGE
+ *
+ * Report flow as aged-out if timeout passed without any matching
+ * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
+ * port detects new aged-out flows.
+ *
+ * The flow context and the flow handle will be reported by the
+ * rte_flow_get_aged_flows API.
+ */
+struct rte_flow_action_age {
+	uint32_t timeout:24; /**< Time in seconds. */
+	uint32_t reserved:8; /**< Reserved, must be zero. */
+	void *context;
+		/**< The user flow context, NULL means the rte_flow pointer. */
+};
 
 /**
  * @warning
@@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
 	      const void *src,
 	      struct rte_flow_error *error);
 
+/**
+ * Get aged-out flows of a given port.
+ *
+ * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least one new aged
+ * out flow was detected after the last call to rte_flow_get_aged_flows.
+ * This function can be called to get the aged flows usynchronously from the
+ * event callback or synchronously regardless the event.
+ * This is not safe to call rte_flow_get_aged_flows function with other flow
+ * functions from multiple threads simultaneously.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in, out] contexts
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_contexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array, otherwise negative errno value.
+ *
+ * @see rte_flow_action_age
+ * @see RTE_ETH_EVENT_FLOW_AGED
+ */
+__rte_experimental
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_flow_driver.h b/lib/librte_ethdev/rte_flow_driver.h
index 51a9a57a0f..881cc469b7 100644
--- a/lib/librte_ethdev/rte_flow_driver.h
+++ b/lib/librte_ethdev/rte_flow_driver.h
@@ -101,6 +101,12 @@ struct rte_flow_ops {
 		(struct rte_eth_dev *dev,
 		 FILE *file,
 		 struct rte_flow_error *error);
+	/** See rte_flow_get_aged_flows() */
+	int (*get_aged_flows)
+		(struct rte_eth_dev *dev,
+		 void **context,
+		 uint32_t nb_contexts,
+		 struct rte_flow_error *err);
 };
 
 /**
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-14  8:32     ` [dpdk-dev] [PATCH v2] " Dong Zhou
@ 2020-04-14  8:49       ` Ori Kam
  2020-04-14  9:23         ` Bill Zhou
  2020-04-16 13:32         ` Ferruh Yigit
  2020-04-17 22:00       ` Ferruh Yigit
  2020-04-21  6:22       ` [dpdk-dev] [PATCH v3] " Bill Zhou
  2 siblings, 2 replies; 50+ messages in thread
From: Ori Kam @ 2020-04-14  8:49 UTC (permalink / raw)
  To: Bill Zhou, Matan Azrad, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic,
	Thomas Monjalon, ferruh.yigit, arybchenko
  Cc: dev



> -----Original Message-----
> From: Dong Zhou <dongz@mellanox.com>
> Sent: Tuesday, April 14, 2020 11:33 AM
> To: Ori Kam <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>;
> wenzhuo.lu@intel.com; jingjing.wu@intel.com; bernard.iremonger@intel.com;
> john.mcnamara@intel.com; marko.kovacevic@intel.com; Thomas Monjalon
> <thomas@monjalon.net>; ferruh.yigit@intel.com; arybchenko@solarflare.com
> Cc: dev@dpdk.org
> Subject: [PATCH v2] ethdev: support flow aging
> 
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is not any DPDK mechanism for flow aging and the
> applications use their own ways to detect and destroy aged-out flows.
> 
> The flow aging implementation need include:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
> - Support input flow aging command line in Testpmd.
> 
> Signed-off-by: Dong Zhou <dongz@mellanox.com>
> ---
Like said before nice patch and hope to see more patches from you.
Just a small nit please next time add change log.

Acked-by: Ori Kam <orika@mellanox.com>
Thanks,
Ori


>  app/test-pmd/cmdline_flow.c              | 26 ++++++++++
>  doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
>  doc/guides/rel_notes/release_20_05.rst   | 11 +++++
>  lib/librte_ethdev/rte_ethdev.h           |  1 +
>  lib/librte_ethdev/rte_ethdev_version.map |  3 ++
>  lib/librte_ethdev/rte_flow.c             | 18 +++++++
>  lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_flow_driver.h      |  6 +++
>  8 files changed, 149 insertions(+)
> 
> diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
> index e6ab8ff2f7..45bcff3cf5 100644
> --- a/app/test-pmd/cmdline_flow.c
> +++ b/app/test-pmd/cmdline_flow.c
> @@ -343,6 +343,8 @@ enum index {
>  	ACTION_SET_IPV4_DSCP_VALUE,
>  	ACTION_SET_IPV6_DSCP,
>  	ACTION_SET_IPV6_DSCP_VALUE,
> +	ACTION_AGE,
> +	ACTION_AGE_TIMEOUT,
>  };
> 
>  /** Maximum size for pattern in struct rte_flow_item_raw. */
> @@ -1145,6 +1147,7 @@ static const enum index next_action[] = {
>  	ACTION_SET_META,
>  	ACTION_SET_IPV4_DSCP,
>  	ACTION_SET_IPV6_DSCP,
> +	ACTION_AGE,
>  	ZERO,
>  };
> 
> @@ -1370,6 +1373,13 @@ static const enum index action_set_ipv6_dscp[] = {
>  	ZERO,
>  };
> 
> +static const enum index action_age[] = {
> +	ACTION_AGE,
> +	ACTION_AGE_TIMEOUT,
> +	ACTION_NEXT,
> +	ZERO,
> +};
> +
>  static int parse_set_raw_encap_decap(struct context *, const struct token *,
>  				     const char *, unsigned int,
>  				     void *, unsigned int);
> @@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
>  			     (struct rte_flow_action_set_dscp, dscp)),
>  		.call = parse_vc_conf,
>  	},
> +	[ACTION_AGE] = {
> +		.name = "age",
> +		.help = "set a specific metadata header",
> +		.next = NEXT(action_age),
> +		.priv = PRIV_ACTION(AGE,
> +			sizeof(struct rte_flow_action_age)),
> +		.call = parse_vc,
> +	},
> +	[ACTION_AGE_TIMEOUT] = {
> +		.name = "timeout",
> +		.help = "flow age timeout value",
> +		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
> +					   timeout, 24)),
> +		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
> +		.call = parse_vc_conf,
> +	},
>  };
> 
>  /** Remove and return last entry from argument stack. */
> diff --git a/doc/guides/prog_guide/rte_flow.rst
> b/doc/guides/prog_guide/rte_flow.rst
> index 41c147913c..cf4368e1c4 100644
> --- a/doc/guides/prog_guide/rte_flow.rst
> +++ b/doc/guides/prog_guide/rte_flow.rst
> @@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error
> will be returned.
>     | ``dscp``  | DSCP in low 6 bits, rest ignore |
>     +-----------+---------------------------------+
> 
> +Action: ``AGE``
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Set ageing timeout configuration to a flow.
> +
> +Event RTE_ETH_EVENT_FLOW_AGED will be reported if
> +timeout passed without any matching on the flow.
> +
> +.. _table_rte_flow_action_age:
> +
> +.. table:: AGE
> +
> +   +--------------+---------------------------------+
> +   | Field        | Value                           |
> +   +==============+=================================+
> +   | ``timeout``  | 24 bits timeout value           |
> +   +--------------+---------------------------------+
> +   | ``reserved`` | 8 bits reserved, must be zero   |
> +   +--------------+---------------------------------+
> +   | ``context``  | user input flow context         |
> +   +--------------+---------------------------------+
> +
>  Negative types
>  ~~~~~~~~~~~~~~
> 
> diff --git a/doc/guides/rel_notes/release_20_05.rst
> b/doc/guides/rel_notes/release_20_05.rst
> index db885f3609..6b3cd8cda7 100644
> --- a/doc/guides/rel_notes/release_20_05.rst
> +++ b/doc/guides/rel_notes/release_20_05.rst
> @@ -100,6 +100,17 @@ New Features
> 
>    * Added generic filter support.
> 
> +* **Added flow Aging Support.**
> +
> +  Added flow Aging support to detect and report aged-out flows, including:
> +
> +  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
> the
> +    application flow context for each flow.
> +  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
> that
> +    there are new aged-out flows.
> +  * Added new API: rte_flow_get_aged_flows to get the aged-out flows
> contexts
> +    from the port.
> +
>  Removed Items
>  -------------
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index d1a593ad11..74c9d00f36 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };
> 
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map
> b/lib/librte_ethdev/rte_ethdev_version.map
> index 3f32fdecf7..fa4b5816be 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -230,4 +230,7 @@ EXPERIMENTAL {
> 
>  	# added in 20.02
>  	rte_flow_dev_dump;
> +
> +	# added in 20.05
> +	rte_flow_get_aged_flows;
>  };
> diff --git a/lib/librte_ethdev/rte_flow.c b/lib/librte_ethdev/rte_flow.c
> index a5ac1c7fbd..3699edce49 100644
> --- a/lib/librte_ethdev/rte_flow.c
> +++ b/lib/librte_ethdev/rte_flow.c
> @@ -172,6 +172,7 @@ static const struct rte_flow_desc_data
> rte_flow_desc_action[] = {
>  	MK_FLOW_ACTION(SET_META, sizeof(struct
> rte_flow_action_set_meta)),
>  	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct
> rte_flow_action_set_dscp)),
>  	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct
> rte_flow_action_set_dscp)),
> +	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
>  };
> 
>  int
> @@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file,
> struct rte_flow_error *error)
>  				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
>  				  NULL, rte_strerror(ENOSYS));
>  }
> +
> +int
> +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> +		    uint32_t nb_contexts, struct rte_flow_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> +
> +	if (unlikely(!ops))
> +		return -rte_errno;
> +	if (likely(!!ops->get_aged_flows))
> +		return flow_err(port_id, ops->get_aged_flows(dev, contexts,
> +				nb_contexts, error), error);
> +	return rte_flow_error_set(error, ENOTSUP,
> +				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +				  NULL, rte_strerror(ENOTSUP));
> +}
> diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
> index 7f3e08fad3..fab44f6c0b 100644
> --- a/lib/librte_ethdev/rte_flow.h
> +++ b/lib/librte_ethdev/rte_flow.h
> @@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
>  	 * See struct rte_flow_action_set_dscp.
>  	 */
>  	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
> +
> +	/**
> +	 * Report as aged flow if timeout passed without any matching on the
> +	 * flow.
> +	 *
> +	 * See struct rte_flow_action_age.
> +	 * See function rte_flow_get_aged_flows
> +	 * see enum RTE_ETH_EVENT_FLOW_AGED
> +	 */
> +	RTE_FLOW_ACTION_TYPE_AGE,
>  };
> 
>  /**
> @@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
>  	uint16_t index; /**< Queue index to use. */
>  };
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this structure may change without prior notice
> + *
> + * RTE_FLOW_ACTION_TYPE_AGE
> + *
> + * Report flow as aged-out if timeout passed without any matching
> + * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
> + * port detects new aged-out flows.
> + *
> + * The flow context and the flow handle will be reported by the
> + * rte_flow_get_aged_flows API.
> + */
> +struct rte_flow_action_age {
> +	uint32_t timeout:24; /**< Time in seconds. */
> +	uint32_t reserved:8; /**< Reserved, must be zero. */
> +	void *context;
> +		/**< The user flow context, NULL means the rte_flow pointer.
> */
> +};
> 
>  /**
>   * @warning
> @@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
>  	      const void *src,
>  	      struct rte_flow_error *error);
> 
> +/**
> + * Get aged-out flows of a given port.
> + *
> + * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least one
> new aged
> + * out flow was detected after the last call to rte_flow_get_aged_flows.
> + * This function can be called to get the aged flows usynchronously from the
> + * event callback or synchronously regardless the event.
> + * This is not safe to call rte_flow_get_aged_flows function with other flow
> + * functions from multiple threads simultaneously.
> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param[in, out] contexts
> + *   The address of an array of pointers to the aged-out flows contexts.
> + * @param[in] nb_contexts
> + *   The length of context array pointers.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL. Initialized in case of
> + *   error only.
> + *
> + * @return
> + *   if nb_contexts is 0, return the amount of all aged contexts.
> + *   if nb_contexts is not 0 , return the amount of aged flows reported
> + *   in the context array, otherwise negative errno value.
> + *
> + * @see rte_flow_action_age
> + * @see RTE_ETH_EVENT_FLOW_AGED
> + */
> +__rte_experimental
> +int
> +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> +			uint32_t nb_contexts, struct rte_flow_error *error);
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_ethdev/rte_flow_driver.h
> b/lib/librte_ethdev/rte_flow_driver.h
> index 51a9a57a0f..881cc469b7 100644
> --- a/lib/librte_ethdev/rte_flow_driver.h
> +++ b/lib/librte_ethdev/rte_flow_driver.h
> @@ -101,6 +101,12 @@ struct rte_flow_ops {
>  		(struct rte_eth_dev *dev,
>  		 FILE *file,
>  		 struct rte_flow_error *error);
> +	/** See rte_flow_get_aged_flows() */
> +	int (*get_aged_flows)
> +		(struct rte_eth_dev *dev,
> +		 void **context,
> +		 uint32_t nb_contexts,
> +		 struct rte_flow_error *err);
>  };
> 
>  /**
> --
> 2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-14  8:49       ` Ori Kam
@ 2020-04-14  9:23         ` Bill Zhou
  2020-04-16 13:32         ` Ferruh Yigit
  1 sibling, 0 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-14  9:23 UTC (permalink / raw)
  To: Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, Thomas Monjalon, ferruh.yigit,
	arybchenko
  Cc: dev



> -----Original Message-----
> From: Ori Kam <orika@mellanox.com>
> Sent: Tuesday, April 14, 2020 4:50 PM
> To: Bill Zhou <dongz@mellanox.com>; Matan Azrad
> <matan@mellanox.com>; wenzhuo.lu@intel.com; jingjing.wu@intel.com;
> bernard.iremonger@intel.com; john.mcnamara@intel.com;
> marko.kovacevic@intel.com; Thomas Monjalon <thomas@monjalon.net>;
> ferruh.yigit@intel.com; arybchenko@solarflare.com
> Cc: dev@dpdk.org
> Subject: RE: [PATCH v2] ethdev: support flow aging
> 
> 
> 
> > -----Original Message-----
> > From: Dong Zhou <dongz@mellanox.com>
> > Sent: Tuesday, April 14, 2020 11:33 AM
> > To: Ori Kam <orika@mellanox.com>; Matan Azrad
> <matan@mellanox.com>;
> > wenzhuo.lu@intel.com; jingjing.wu@intel.com;
> > bernard.iremonger@intel.com; john.mcnamara@intel.com;
> > marko.kovacevic@intel.com; Thomas Monjalon <thomas@monjalon.net>;
> > ferruh.yigit@intel.com; arybchenko@solarflare.com
> > Cc: dev@dpdk.org
> > Subject: [PATCH v2] ethdev: support flow aging
> >
> > One of the reasons to destroy a flow is the fact that no packet
> > matches the flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> >
> > Currently, there is not any DPDK mechanism for flow aging and the
> > applications use their own ways to detect and destroy aged-out flows.
> >
> > The flow aging implementation need include:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the
> timeout and
> >   the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report
> >   that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >   contexts from the port.
> > - Support input flow aging command line in Testpmd.
> >
> > Signed-off-by: Dong Zhou <dongz@mellanox.com>
> > ---
> Like said before nice patch and hope to see more patches from you.
> Just a small nit please next time add change log.
> 

Sorry for it.
---
v2: Removing "* Added support for flow Aging mechanism base on counter."
this line from doc/guides/rel_notes/release_20_05.rst, this patch  does not
include this support.
---

> Acked-by: Ori Kam <orika@mellanox.com>
> Thanks,
> Ori
> 
> 
> >  app/test-pmd/cmdline_flow.c              | 26 ++++++++++
> >  doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
> >  doc/guides/rel_notes/release_20_05.rst   | 11 +++++
> >  lib/librte_ethdev/rte_ethdev.h           |  1 +
> >  lib/librte_ethdev/rte_ethdev_version.map |  3 ++
> >  lib/librte_ethdev/rte_flow.c             | 18 +++++++
> >  lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
> >  lib/librte_ethdev/rte_flow_driver.h      |  6 +++
> >  8 files changed, 149 insertions(+)
> >
> > diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
> > index e6ab8ff2f7..45bcff3cf5 100644
> > --- a/app/test-pmd/cmdline_flow.c
> > +++ b/app/test-pmd/cmdline_flow.c
> > @@ -343,6 +343,8 @@ enum index {
> >  	ACTION_SET_IPV4_DSCP_VALUE,
> >  	ACTION_SET_IPV6_DSCP,
> >  	ACTION_SET_IPV6_DSCP_VALUE,
> > +	ACTION_AGE,
> > +	ACTION_AGE_TIMEOUT,
> >  };
> >
> >  /** Maximum size for pattern in struct rte_flow_item_raw. */ @@
> > -1145,6 +1147,7 @@ static const enum index next_action[] = {
> >  	ACTION_SET_META,
> >  	ACTION_SET_IPV4_DSCP,
> >  	ACTION_SET_IPV6_DSCP,
> > +	ACTION_AGE,
> >  	ZERO,
> >  };
> >
> > @@ -1370,6 +1373,13 @@ static const enum index
> action_set_ipv6_dscp[] = {
> >  	ZERO,
> >  };
> >
> > +static const enum index action_age[] = {
> > +	ACTION_AGE,
> > +	ACTION_AGE_TIMEOUT,
> > +	ACTION_NEXT,
> > +	ZERO,
> > +};
> > +
> >  static int parse_set_raw_encap_decap(struct context *, const struct
> token *,
> >  				     const char *, unsigned int,
> >  				     void *, unsigned int);
> > @@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
> >  			     (struct rte_flow_action_set_dscp, dscp)),
> >  		.call = parse_vc_conf,
> >  	},
> > +	[ACTION_AGE] = {
> > +		.name = "age",
> > +		.help = "set a specific metadata header",
> > +		.next = NEXT(action_age),
> > +		.priv = PRIV_ACTION(AGE,
> > +			sizeof(struct rte_flow_action_age)),
> > +		.call = parse_vc,
> > +	},
> > +	[ACTION_AGE_TIMEOUT] = {
> > +		.name = "timeout",
> > +		.help = "flow age timeout value",
> > +		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
> > +					   timeout, 24)),
> > +		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
> > +		.call = parse_vc_conf,
> > +	},
> >  };
> >
> >  /** Remove and return last entry from argument stack. */ diff --git
> > a/doc/guides/prog_guide/rte_flow.rst
> > b/doc/guides/prog_guide/rte_flow.rst
> > index 41c147913c..cf4368e1c4 100644
> > --- a/doc/guides/prog_guide/rte_flow.rst
> > +++ b/doc/guides/prog_guide/rte_flow.rst
> > @@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION
> error
> > will be returned.
> >     | ``dscp``  | DSCP in low 6 bits, rest ignore |
> >     +-----------+---------------------------------+
> >
> > +Action: ``AGE``
> > +^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Set ageing timeout configuration to a flow.
> > +
> > +Event RTE_ETH_EVENT_FLOW_AGED will be reported if timeout passed
> > +without any matching on the flow.
> > +
> > +.. _table_rte_flow_action_age:
> > +
> > +.. table:: AGE
> > +
> > +   +--------------+---------------------------------+
> > +   | Field        | Value                           |
> > +   +==============+=================================+
> > +   | ``timeout``  | 24 bits timeout value           |
> > +   +--------------+---------------------------------+
> > +   | ``reserved`` | 8 bits reserved, must be zero   |
> > +   +--------------+---------------------------------+
> > +   | ``context``  | user input flow context         |
> > +   +--------------+---------------------------------+
> > +
> >  Negative types
> >  ~~~~~~~~~~~~~~
> >
> > diff --git a/doc/guides/rel_notes/release_20_05.rst
> > b/doc/guides/rel_notes/release_20_05.rst
> > index db885f3609..6b3cd8cda7 100644
> > --- a/doc/guides/rel_notes/release_20_05.rst
> > +++ b/doc/guides/rel_notes/release_20_05.rst
> > @@ -100,6 +100,17 @@ New Features
> >
> >    * Added generic filter support.
> >
> > +* **Added flow Aging Support.**
> > +
> > +  Added flow Aging support to detect and report aged-out flows,
> including:
> > +
> > +  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
> and
> > the
> > +    application flow context for each flow.
> > +  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report
> > that
> > +    there are new aged-out flows.
> > +  * Added new API: rte_flow_get_aged_flows to get the aged-out flows
> > contexts
> > +    from the port.
> > +
> >  Removed Items
> >  -------------
> >
> > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > b/lib/librte_ethdev/rte_ethdev.h index d1a593ad11..74c9d00f36 100644
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
> */
> >  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >  };
> >
> > diff --git a/lib/librte_ethdev/rte_ethdev_version.map
> > b/lib/librte_ethdev/rte_ethdev_version.map
> > index 3f32fdecf7..fa4b5816be 100644
> > --- a/lib/librte_ethdev/rte_ethdev_version.map
> > +++ b/lib/librte_ethdev/rte_ethdev_version.map
> > @@ -230,4 +230,7 @@ EXPERIMENTAL {
> >
> >  	# added in 20.02
> >  	rte_flow_dev_dump;
> > +
> > +	# added in 20.05
> > +	rte_flow_get_aged_flows;
> >  };
> > diff --git a/lib/librte_ethdev/rte_flow.c
> > b/lib/librte_ethdev/rte_flow.c index a5ac1c7fbd..3699edce49 100644
> > --- a/lib/librte_ethdev/rte_flow.c
> > +++ b/lib/librte_ethdev/rte_flow.c
> > @@ -172,6 +172,7 @@ static const struct rte_flow_desc_data
> > rte_flow_desc_action[] = {
> >  	MK_FLOW_ACTION(SET_META, sizeof(struct
> rte_flow_action_set_meta)),
> >  	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct
> > rte_flow_action_set_dscp)),
> >  	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct
> > rte_flow_action_set_dscp)),
> > +	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
> >  };
> >
> >  int
> > @@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE
> *file,
> > struct rte_flow_error *error)
> >  				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> >  				  NULL, rte_strerror(ENOSYS));
> >  }
> > +
> > +int
> > +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> > +		    uint32_t nb_contexts, struct rte_flow_error *error) {
> > +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> > +
> > +	if (unlikely(!ops))
> > +		return -rte_errno;
> > +	if (likely(!!ops->get_aged_flows))
> > +		return flow_err(port_id, ops->get_aged_flows(dev, contexts,
> > +				nb_contexts, error), error);
> > +	return rte_flow_error_set(error, ENOTSUP,
> > +				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > +				  NULL, rte_strerror(ENOTSUP));
> > +}
> > diff --git a/lib/librte_ethdev/rte_flow.h
> > b/lib/librte_ethdev/rte_flow.h index 7f3e08fad3..fab44f6c0b 100644
> > --- a/lib/librte_ethdev/rte_flow.h
> > +++ b/lib/librte_ethdev/rte_flow.h
> > @@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
> >  	 * See struct rte_flow_action_set_dscp.
> >  	 */
> >  	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
> > +
> > +	/**
> > +	 * Report as aged flow if timeout passed without any matching on
> the
> > +	 * flow.
> > +	 *
> > +	 * See struct rte_flow_action_age.
> > +	 * See function rte_flow_get_aged_flows
> > +	 * see enum RTE_ETH_EVENT_FLOW_AGED
> > +	 */
> > +	RTE_FLOW_ACTION_TYPE_AGE,
> >  };
> >
> >  /**
> > @@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
> >  	uint16_t index; /**< Queue index to use. */  };
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this structure may change without prior notice
> > + *
> > + * RTE_FLOW_ACTION_TYPE_AGE
> > + *
> > + * Report flow as aged-out if timeout passed without any matching
> > + * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
> > + * port detects new aged-out flows.
> > + *
> > + * The flow context and the flow handle will be reported by the
> > + * rte_flow_get_aged_flows API.
> > + */
> > +struct rte_flow_action_age {
> > +	uint32_t timeout:24; /**< Time in seconds. */
> > +	uint32_t reserved:8; /**< Reserved, must be zero. */
> > +	void *context;
> > +		/**< The user flow context, NULL means the rte_flow
> pointer.
> > */
> > +};
> >
> >  /**
> >   * @warning
> > @@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
> >  	      const void *src,
> >  	      struct rte_flow_error *error);
> >
> > +/**
> > + * Get aged-out flows of a given port.
> > + *
> > + * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least
> one
> > new aged
> > + * out flow was detected after the last call to rte_flow_get_aged_flows.
> > + * This function can be called to get the aged flows usynchronously
> > +from the
> > + * event callback or synchronously regardless the event.
> > + * This is not safe to call rte_flow_get_aged_flows function with
> > +other flow
> > + * functions from multiple threads simultaneously.
> > + *
> > + * @param port_id
> > + *   Port identifier of Ethernet device.
> > + * @param[in, out] contexts
> > + *   The address of an array of pointers to the aged-out flows contexts.
> > + * @param[in] nb_contexts
> > + *   The length of context array pointers.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL. Initialized in case of
> > + *   error only.
> > + *
> > + * @return
> > + *   if nb_contexts is 0, return the amount of all aged contexts.
> > + *   if nb_contexts is not 0 , return the amount of aged flows reported
> > + *   in the context array, otherwise negative errno value.
> > + *
> > + * @see rte_flow_action_age
> > + * @see RTE_ETH_EVENT_FLOW_AGED
> > + */
> > +__rte_experimental
> > +int
> > +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> > +			uint32_t nb_contexts, struct rte_flow_error *error);
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_ethdev/rte_flow_driver.h
> > b/lib/librte_ethdev/rte_flow_driver.h
> > index 51a9a57a0f..881cc469b7 100644
> > --- a/lib/librte_ethdev/rte_flow_driver.h
> > +++ b/lib/librte_ethdev/rte_flow_driver.h
> > @@ -101,6 +101,12 @@ struct rte_flow_ops {
> >  		(struct rte_eth_dev *dev,
> >  		 FILE *file,
> >  		 struct rte_flow_error *error);
> > +	/** See rte_flow_get_aged_flows() */
> > +	int (*get_aged_flows)
> > +		(struct rte_eth_dev *dev,
> > +		 void **context,
> > +		 uint32_t nb_contexts,
> > +		 struct rte_flow_error *err);
> >  };
> >
> >  /**
> > --
> > 2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-14  8:49       ` Ori Kam
  2020-04-14  9:23         ` Bill Zhou
@ 2020-04-16 13:32         ` Ferruh Yigit
  1 sibling, 0 replies; 50+ messages in thread
From: Ferruh Yigit @ 2020-04-16 13:32 UTC (permalink / raw)
  To: Ori Kam, Bill Zhou, Matan Azrad, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic,
	Thomas Monjalon, arybchenko
  Cc: dev

On 4/14/2020 9:49 AM, Ori Kam wrote:
> 
> 
>> -----Original Message-----
>> From: Dong Zhou <dongz@mellanox.com>
>> Sent: Tuesday, April 14, 2020 11:33 AM
>> To: Ori Kam <orika@mellanox.com>; Matan Azrad <matan@mellanox.com>;
>> wenzhuo.lu@intel.com; jingjing.wu@intel.com; bernard.iremonger@intel.com;
>> john.mcnamara@intel.com; marko.kovacevic@intel.com; Thomas Monjalon
>> <thomas@monjalon.net>; ferruh.yigit@intel.com; arybchenko@solarflare.com
>> Cc: dev@dpdk.org
>> Subject: [PATCH v2] ethdev: support flow aging
>>
>> One of the reasons to destroy a flow is the fact that no packet matches the
>> flow for "timeout" time.
>> For example, when TCP\UDP sessions are suddenly closed.
>>
>> Currently, there is not any DPDK mechanism for flow aging and the
>> applications use their own ways to detect and destroy aged-out flows.
>>
>> The flow aging implementation need include:
>> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>>   the application flow context for each flow.
>> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>>   that there are new aged-out flows.
>> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>>   contexts from the port.
>> - Support input flow aging command line in Testpmd.
>>
>> Signed-off-by: Dong Zhou <dongz@mellanox.com>
>> ---
> Like said before nice patch and hope to see more patches from you.
> Just a small nit please next time add change log.
> 
> Acked-by: Ori Kam <orika@mellanox.com>

Moved other acks from v1:
    Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
    Acked-by: Jerin Jacob <jerinj@marvell.com>
    Acked-by: Matan Azrad <matan@mellanox.com>

Applied to dpdk-next-net/master, thanks.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-14  8:32     ` [dpdk-dev] [PATCH v2] " Dong Zhou
  2020-04-14  8:49       ` Ori Kam
@ 2020-04-17 22:00       ` Ferruh Yigit
  2020-04-17 22:07         ` Stephen Hemminger
  2020-04-18  5:04         ` Bill Zhou
  2020-04-21  6:22       ` [dpdk-dev] [PATCH v3] " Bill Zhou
  2 siblings, 2 replies; 50+ messages in thread
From: Ferruh Yigit @ 2020-04-17 22:00 UTC (permalink / raw)
  To: Dong Zhou, orika, matan, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic, thomas,
	arybchenko
  Cc: dev

On 4/14/2020 9:32 AM, Dong Zhou wrote:
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is not any DPDK mechanism for flow aging and the
> applications use their own ways to detect and destroy aged-out flows.
> 
> The flow aging implementation need include:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
> - Support input flow aging command line in Testpmd.
> 
> Signed-off-by: Dong Zhou <dongz@mellanox.com>

<...>

> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };


Just recognized that this is failing in ABI check [1], as far as last time for a
similar enum warning a QAT patch has been dropped, should this need to wait for
20.11 too?


[1]
  [C]'function int _rte_eth_dev_callback_process(rte_eth_dev*,
rte_eth_event_type, void*)' at rte_ethdev.c:4063:1 has some indirect sub-type
changes:
    parameter 2 of type 'enum rte_eth_event_type' has sub-type changes:
      type size hasn't changed
      1 enumerator insertion:
        'rte_eth_event_type::RTE_ETH_EVENT_FLOW_AGED' value '10'
      1 enumerator change:
        'rte_eth_event_type::RTE_ETH_EVENT_MAX' from value '10' to '11' at
rte_ethdev.h:3008:1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-17 22:00       ` Ferruh Yigit
@ 2020-04-17 22:07         ` Stephen Hemminger
  2020-04-18  5:04         ` Bill Zhou
  1 sibling, 0 replies; 50+ messages in thread
From: Stephen Hemminger @ 2020-04-17 22:07 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Dong Zhou, orika, matan, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic, thomas,
	arybchenko, dev

On Fri, 17 Apr 2020 23:00:57 +0100
Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 4/14/2020 9:32 AM, Dong Zhou wrote:
> > One of the reasons to destroy a flow is the fact that no packet matches the
> > flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> > 
> > Currently, there is not any DPDK mechanism for flow aging and the
> > applications use their own ways to detect and destroy aged-out flows.
> > 
> > The flow aging implementation need include:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
> >   the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
> >   that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >   contexts from the port.
> > - Support input flow aging command line in Testpmd.
> > 
> > Signed-off-by: Dong Zhou <dongz@mellanox.com>  
> 
> <...>
> 
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> >  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >  };  
> 
> 
> Just recognized that this is failing in ABI check [1], as far as last time for a
> similar enum warning a QAT patch has been dropped, should this need to wait for
> 20.11 too?
> 
> 
> [1]
>   [C]'function int _rte_eth_dev_callback_process(rte_eth_dev*,
> rte_eth_event_type, void*)' at rte_ethdev.c:4063:1 has some indirect sub-type
> changes:
>     parameter 2 of type 'enum rte_eth_event_type' has sub-type changes:
>       type size hasn't changed
>       1 enumerator insertion:
>         'rte_eth_event_type::RTE_ETH_EVENT_FLOW_AGED' value '10'
>       1 enumerator change:
>         'rte_eth_event_type::RTE_ETH_EVENT_MAX' from value '10' to '11' at
> rte_ethdev.h:3008:1
> 

For 20.11, those _MAX values need to be removed from enums

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-17 22:00       ` Ferruh Yigit
  2020-04-17 22:07         ` Stephen Hemminger
@ 2020-04-18  5:04         ` Bill Zhou
  2020-04-18  9:44           ` Thomas Monjalon
  1 sibling, 1 reply; 50+ messages in thread
From: Bill Zhou @ 2020-04-18  5:04 UTC (permalink / raw)
  To: Ferruh Yigit, Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic,
	Thomas Monjalon, arybchenko
  Cc: dev



> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Saturday, April 18, 2020 6:01 AM
> To: Bill Zhou <dongz@mellanox.com>; Ori Kam <orika@mellanox.com>;
> Matan Azrad <matan@mellanox.com>; wenzhuo.lu@intel.com;
> jingjing.wu@intel.com; bernard.iremonger@intel.com;
> john.mcnamara@intel.com; marko.kovacevic@intel.com; Thomas Monjalon
> <thomas@monjalon.net>; arybchenko@solarflare.com
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
> 
> On 4/14/2020 9:32 AM, Dong Zhou wrote:
> > One of the reasons to destroy a flow is the fact that no packet
> > matches the flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> >
> > Currently, there is not any DPDK mechanism for flow aging and the
> > applications use their own ways to detect and destroy aged-out flows.
> >
> > The flow aging implementation need include:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the
> timeout and
> >   the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report
> >   that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >   contexts from the port.
> > - Support input flow aging command line in Testpmd.
> >
> > Signed-off-by: Dong Zhou <dongz@mellanox.com>
> 
> <...>
> 
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
> */
> >  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >  };
> 
> 
> Just recognized that this is failing in ABI check [1], as far as last time for a
> similar enum warning a QAT patch has been dropped, should this need to
> wait for
> 20.11 too?

This patch is commonly used for flow aging, there are 2 other patches have 
implement flow aging in mlx5 driver reply to this patch.
In our schedule, this feature is merged in 20.05 for some customers. Can it
be fixed?

> 
> 
> [1]
>   [C]'function int _rte_eth_dev_callback_process(rte_eth_dev*,
> rte_eth_event_type, void*)' at rte_ethdev.c:4063:1 has some indirect sub-
> type
> changes:
>     parameter 2 of type 'enum rte_eth_event_type' has sub-type changes:
>       type size hasn't changed
>       1 enumerator insertion:
>         'rte_eth_event_type::RTE_ETH_EVENT_FLOW_AGED' value '10'
>       1 enumerator change:
>         'rte_eth_event_type::RTE_ETH_EVENT_MAX' from value '10' to '11' at
> rte_ethdev.h:3008:1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-18  5:04         ` Bill Zhou
@ 2020-04-18  9:44           ` Thomas Monjalon
  2020-04-20 14:06             ` Ferruh Yigit
  0 siblings, 1 reply; 50+ messages in thread
From: Thomas Monjalon @ 2020-04-18  9:44 UTC (permalink / raw)
  To: Ferruh Yigit, Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic, arybchenko,
	Bill Zhou
  Cc: dev

18/04/2020 07:04, Bill Zhou:
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> > On 4/14/2020 9:32 AM, Dong Zhou wrote:
> > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> > >  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> > >  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > >  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
> > */
> > >  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> > >  };
> > 
> > 
> > Just recognized that this is failing in ABI check [1], as far as last time for a
> > similar enum warning a QAT patch has been dropped, should this need to
> > wait for
> > 20.11 too?
> 
> This patch is commonly used for flow aging, there are 2 other patches have 
> implement flow aging in mlx5 driver reply to this patch.
> In our schedule, this feature is merged in 20.05 for some customers. Can it
> be fixed?

These MAX values in enums are a pain.
We can try to think what can be done, waiting 20.11.
Not sure there is a solution, except hijacking an existing value
not used in the PMD, waiting the definitive value in 20.11...



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-18  9:44           ` Thomas Monjalon
@ 2020-04-20 14:06             ` Ferruh Yigit
  2020-04-20 16:10               ` Thomas Monjalon
  0 siblings, 1 reply; 50+ messages in thread
From: Ferruh Yigit @ 2020-04-20 14:06 UTC (permalink / raw)
  To: Thomas Monjalon, Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic, arybchenko,
	Bill Zhou
  Cc: dev

On 4/18/2020 10:44 AM, Thomas Monjalon wrote:
> 18/04/2020 07:04, Bill Zhou:
>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>>> On 4/14/2020 9:32 AM, Dong Zhou wrote:
>>>> --- a/lib/librte_ethdev/rte_ethdev.h
>>>> +++ b/lib/librte_ethdev/rte_ethdev.h
>>>> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>>>>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>>>>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>>>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>>> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
>>> */
>>>>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>>>>  };
>>>
>>>
>>> Just recognized that this is failing in ABI check [1], as far as last time for a
>>> similar enum warning a QAT patch has been dropped, should this need to
>>> wait for
>>> 20.11 too?
>>
>> This patch is commonly used for flow aging, there are 2 other patches have 
>> implement flow aging in mlx5 driver reply to this patch.
>> In our schedule, this feature is merged in 20.05 for some customers. Can it
>> be fixed?
> 
> These MAX values in enums are a pain.
> We can try to think what can be done, waiting 20.11.
> Not sure there is a solution, except hijacking an existing value
> not used in the PMD, waiting the definitive value in 20.11...
> 

Dropping from the tree as of now, to not cause more merge conflicts, we can add
it later when issue is resolved.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-20 14:06             ` Ferruh Yigit
@ 2020-04-20 16:10               ` Thomas Monjalon
  2020-04-21 10:04                 ` Ferruh Yigit
  0 siblings, 1 reply; 50+ messages in thread
From: Thomas Monjalon @ 2020-04-20 16:10 UTC (permalink / raw)
  To: Bill Zhou, Ferruh Yigit
  Cc: Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, arybchenko, dev

20/04/2020 16:06, Ferruh Yigit:
> On 4/18/2020 10:44 AM, Thomas Monjalon wrote:
> > 18/04/2020 07:04, Bill Zhou:
> >> From: Ferruh Yigit <ferruh.yigit@intel.com>
> >>> On 4/14/2020 9:32 AM, Dong Zhou wrote:
> >>>> --- a/lib/librte_ethdev/rte_ethdev.h
> >>>> +++ b/lib/librte_ethdev/rte_ethdev.h
> >>>> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> >>>>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> >>>>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >>>>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> >>>> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
> >>> */
> >>>>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >>>>  };
> >>>
> >>>
> >>> Just recognized that this is failing in ABI check [1], as far as last time for a
> >>> similar enum warning a QAT patch has been dropped, should this need to
> >>> wait for
> >>> 20.11 too?
> >>
> >> This patch is commonly used for flow aging, there are 2 other patches have 
> >> implement flow aging in mlx5 driver reply to this patch.
[...]
> > These MAX values in enums are a pain.
> > We can try to think what can be done, waiting 20.11.
> > Not sure there is a solution, except hijacking an existing value
> > not used in the PMD, waiting the definitive value in 20.11...
> 
> Dropping from the tree as of now, to not cause more merge conflicts, we can add
> it later when issue is resolved.

Thanks for dropping, that's the right thing to do
when a patch is breaking ABI check.

After some thoughts, I think it is acceptable to make a v3
which ignore this specific enum change. I explain my thought below:

An enum can accept a new value at 2 conditions:
	- added as last value (not changing old values)
	- new value not used by existing API

The value RTE_ETH_EVENT_FLOW_AGED meet the above 2 conditions:
	- only RTE_ETH_EVENT_MAX is changed, which is consistent
	- new value sent to the app only if the app registered for it

So, except if I miss something, I suggest we add this exception:
Allow new value in rte_eth_event_type if added just before RTE_ETH_EVENT_MAX.
In other words, allow changing the value of RTE_ETH_EVENT_MAX.
The file to add such exception is devtools/libabigail.abignore.




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v3] ethdev: support flow aging
  2020-04-14  8:32     ` [dpdk-dev] [PATCH v2] " Dong Zhou
  2020-04-14  8:49       ` Ori Kam
  2020-04-17 22:00       ` Ferruh Yigit
@ 2020-04-21  6:22       ` Bill Zhou
  2020-04-21 10:11         ` [dpdk-dev] [PATCH v4] " Bill Zhou
  2 siblings, 1 reply; 50+ messages in thread
From: Bill Zhou @ 2020-04-21  6:22 UTC (permalink / raw)
  To: orika, matan, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, thomas, ferruh.yigit, arybchenko
  Cc: dev, Dong Zhou

From: Dong Zhou <dongz@mellanox.com>

One of the reasons to destroy a flow is the fact that no packet matches the
flow for "timeout" time.
For example, when TCP\UDP sessions are suddenly closed.

Currently, there is not any DPDK mechanism for flow aging and the
applications use their own ways to detect and destroy aged-out flows.

The flow aging implementation need include:
- A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
  the application flow context for each flow.
- A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
  that there are new aged-out flows.
- A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
  contexts from the port.
- Support input flow aging command line in Testpmd.

Signed-off-by: Dong Zhou <dongz@mellanox.com>
---
v2: Removing "* Added support for flow Aging mechanism base on counter."
this line from doc/guides/rel_notes/release_20_05.rst, this patch  does not
include this support.
---
v3: Update file libabigail.abignore, add one new suppressed enumeration
type for RTE_ETH_EVENT_MAX.
---
 app/test-pmd/cmdline_flow.c              | 26 ++++++++++
 devtools/libabigail.abignore             |  4 ++
 doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
 doc/guides/rel_notes/release_20_05.rst   | 11 +++++
 lib/librte_ethdev/rte_ethdev.h           |  1 +
 lib/librte_ethdev/rte_ethdev_version.map |  3 ++
 lib/librte_ethdev/rte_flow.c             | 18 +++++++
 lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_flow_driver.h      |  6 +++
 9 files changed, 153 insertions(+)

diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
index e6ab8ff2f7..45bcff3cf5 100644
--- a/app/test-pmd/cmdline_flow.c
+++ b/app/test-pmd/cmdline_flow.c
@@ -343,6 +343,8 @@ enum index {
 	ACTION_SET_IPV4_DSCP_VALUE,
 	ACTION_SET_IPV6_DSCP,
 	ACTION_SET_IPV6_DSCP_VALUE,
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
 };
 
 /** Maximum size for pattern in struct rte_flow_item_raw. */
@@ -1145,6 +1147,7 @@ static const enum index next_action[] = {
 	ACTION_SET_META,
 	ACTION_SET_IPV4_DSCP,
 	ACTION_SET_IPV6_DSCP,
+	ACTION_AGE,
 	ZERO,
 };
 
@@ -1370,6 +1373,13 @@ static const enum index action_set_ipv6_dscp[] = {
 	ZERO,
 };
 
+static const enum index action_age[] = {
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
+	ACTION_NEXT,
+	ZERO,
+};
+
 static int parse_set_raw_encap_decap(struct context *, const struct token *,
 				     const char *, unsigned int,
 				     void *, unsigned int);
@@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
 			     (struct rte_flow_action_set_dscp, dscp)),
 		.call = parse_vc_conf,
 	},
+	[ACTION_AGE] = {
+		.name = "age",
+		.help = "set a specific metadata header",
+		.next = NEXT(action_age),
+		.priv = PRIV_ACTION(AGE,
+			sizeof(struct rte_flow_action_age)),
+		.call = parse_vc,
+	},
+	[ACTION_AGE_TIMEOUT] = {
+		.name = "timeout",
+		.help = "flow age timeout value",
+		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
+					   timeout, 24)),
+		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
+		.call = parse_vc_conf,
+	},
 };
 
 /** Remove and return last entry from argument stack. */
diff --git a/devtools/libabigail.abignore b/devtools/libabigail.abignore
index a59df8f135..949e40fbd1 100644
--- a/devtools/libabigail.abignore
+++ b/devtools/libabigail.abignore
@@ -11,3 +11,7 @@
         type_kind = enum
         name = rte_crypto_asym_xform_type
         changed_enumerators = RTE_CRYPTO_ASYM_XFORM_TYPE_LIST_END
+[suppress_type]
+        type_kind = enum
+        name = rte_eth_event_type
+        changed_enumerators = RTE_ETH_EVENT_MAX
diff --git a/doc/guides/prog_guide/rte_flow.rst b/doc/guides/prog_guide/rte_flow.rst
index 41c147913c..cf4368e1c4 100644
--- a/doc/guides/prog_guide/rte_flow.rst
+++ b/doc/guides/prog_guide/rte_flow.rst
@@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
    | ``dscp``  | DSCP in low 6 bits, rest ignore |
    +-----------+---------------------------------+
 
+Action: ``AGE``
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Set ageing timeout configuration to a flow.
+
+Event RTE_ETH_EVENT_FLOW_AGED will be reported if
+timeout passed without any matching on the flow.
+
+.. _table_rte_flow_action_age:
+
+.. table:: AGE
+
+   +--------------+---------------------------------+
+   | Field        | Value                           |
+   +==============+=================================+
+   | ``timeout``  | 24 bits timeout value           |
+   +--------------+---------------------------------+
+   | ``reserved`` | 8 bits reserved, must be zero   |
+   +--------------+---------------------------------+
+   | ``context``  | user input flow context         |
+   +--------------+---------------------------------+
+
 Negative types
 ~~~~~~~~~~~~~~
 
diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index bacd4c65a2..ff0cf9f1d6 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -135,6 +135,17 @@ New Features
   by making use of the event device capabilities. The event mode currently supports
   only inline IPsec protocol offload.
 
+* **Added flow Aging Support.**
+
+  Added flow Aging support to detect and report aged-out flows, including:
+
+  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and the
+    application flow context for each flow.
+  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to report that
+    there are new aged-out flows.
+  * Added new API: rte_flow_get_aged_flows to get the aged-out flows contexts
+    from the port.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 8d69b88f9e..00cc7b4052 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3018,6 +3018,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 3f32fdecf7..fa4b5816be 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -230,4 +230,7 @@ EXPERIMENTAL {
 
 	# added in 20.02
 	rte_flow_dev_dump;
+
+	# added in 20.05
+	rte_flow_get_aged_flows;
 };
diff --git a/lib/librte_ethdev/rte_flow.c b/lib/librte_ethdev/rte_flow.c
index a5ac1c7fbd..3699edce49 100644
--- a/lib/librte_ethdev/rte_flow.c
+++ b/lib/librte_ethdev/rte_flow.c
@@ -172,6 +172,7 @@ static const struct rte_flow_desc_data rte_flow_desc_action[] = {
 	MK_FLOW_ACTION(SET_META, sizeof(struct rte_flow_action_set_meta)),
 	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct rte_flow_action_set_dscp)),
 	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct rte_flow_action_set_dscp)),
+	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
 };
 
 int
@@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file, struct rte_flow_error *error)
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 				  NULL, rte_strerror(ENOSYS));
 }
+
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+		    uint32_t nb_contexts, struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (unlikely(!ops))
+		return -rte_errno;
+	if (likely(!!ops->get_aged_flows))
+		return flow_err(port_id, ops->get_aged_flows(dev, contexts,
+				nb_contexts, error), error);
+	return rte_flow_error_set(error, ENOTSUP,
+				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+				  NULL, rte_strerror(ENOTSUP));
+}
diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
index 7f3e08fad3..fab44f6c0b 100644
--- a/lib/librte_ethdev/rte_flow.h
+++ b/lib/librte_ethdev/rte_flow.h
@@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
 	 * See struct rte_flow_action_set_dscp.
 	 */
 	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
+
+	/**
+	 * Report as aged flow if timeout passed without any matching on the
+	 * flow.
+	 *
+	 * See struct rte_flow_action_age.
+	 * See function rte_flow_get_aged_flows
+	 * see enum RTE_ETH_EVENT_FLOW_AGED
+	 */
+	RTE_FLOW_ACTION_TYPE_AGE,
 };
 
 /**
@@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
 	uint16_t index; /**< Queue index to use. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice
+ *
+ * RTE_FLOW_ACTION_TYPE_AGE
+ *
+ * Report flow as aged-out if timeout passed without any matching
+ * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
+ * port detects new aged-out flows.
+ *
+ * The flow context and the flow handle will be reported by the
+ * rte_flow_get_aged_flows API.
+ */
+struct rte_flow_action_age {
+	uint32_t timeout:24; /**< Time in seconds. */
+	uint32_t reserved:8; /**< Reserved, must be zero. */
+	void *context;
+		/**< The user flow context, NULL means the rte_flow pointer. */
+};
 
 /**
  * @warning
@@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
 	      const void *src,
 	      struct rte_flow_error *error);
 
+/**
+ * Get aged-out flows of a given port.
+ *
+ * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least one new aged
+ * out flow was detected after the last call to rte_flow_get_aged_flows.
+ * This function can be called to get the aged flows usynchronously from the
+ * event callback or synchronously regardless the event.
+ * This is not safe to call rte_flow_get_aged_flows function with other flow
+ * functions from multiple threads simultaneously.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in, out] contexts
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_contexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array, otherwise negative errno value.
+ *
+ * @see rte_flow_action_age
+ * @see RTE_ETH_EVENT_FLOW_AGED
+ */
+__rte_experimental
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_flow_driver.h b/lib/librte_ethdev/rte_flow_driver.h
index 51a9a57a0f..881cc469b7 100644
--- a/lib/librte_ethdev/rte_flow_driver.h
+++ b/lib/librte_ethdev/rte_flow_driver.h
@@ -101,6 +101,12 @@ struct rte_flow_ops {
 		(struct rte_eth_dev *dev,
 		 FILE *file,
 		 struct rte_flow_error *error);
+	/** See rte_flow_get_aged_flows() */
+	int (*get_aged_flows)
+		(struct rte_eth_dev *dev,
+		 void **context,
+		 uint32_t nb_contexts,
+		 struct rte_flow_error *err);
 };
 
 /**
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-20 16:10               ` Thomas Monjalon
@ 2020-04-21 10:04                 ` Ferruh Yigit
  2020-04-21 10:09                   ` Thomas Monjalon
  2020-04-21 15:59                   ` Andrew Rybchenko
  0 siblings, 2 replies; 50+ messages in thread
From: Ferruh Yigit @ 2020-04-21 10:04 UTC (permalink / raw)
  To: Thomas Monjalon, Bill Zhou
  Cc: Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, arybchenko, dev

On 4/20/2020 5:10 PM, Thomas Monjalon wrote:
> 20/04/2020 16:06, Ferruh Yigit:
>> On 4/18/2020 10:44 AM, Thomas Monjalon wrote:
>>> 18/04/2020 07:04, Bill Zhou:
>>>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>>>>> On 4/14/2020 9:32 AM, Dong Zhou wrote:
>>>>>> --- a/lib/librte_ethdev/rte_ethdev.h
>>>>>> +++ b/lib/librte_ethdev/rte_ethdev.h
>>>>>> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>>>>>>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>>>>>>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>>>>>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>>>>> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
>>>>> */
>>>>>>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>>>>>>  };
>>>>>
>>>>>
>>>>> Just recognized that this is failing in ABI check [1], as far as last time for a
>>>>> similar enum warning a QAT patch has been dropped, should this need to
>>>>> wait for
>>>>> 20.11 too?
>>>>
>>>> This patch is commonly used for flow aging, there are 2 other patches have 
>>>> implement flow aging in mlx5 driver reply to this patch.
> [...]
>>> These MAX values in enums are a pain.
>>> We can try to think what can be done, waiting 20.11.
>>> Not sure there is a solution, except hijacking an existing value
>>> not used in the PMD, waiting the definitive value in 20.11...
>>
>> Dropping from the tree as of now, to not cause more merge conflicts, we can add
>> it later when issue is resolved.
> 
> Thanks for dropping, that's the right thing to do
> when a patch is breaking ABI check.
> 
> After some thoughts, I think it is acceptable to make a v3
> which ignore this specific enum change. I explain my thought below:
> 
> An enum can accept a new value at 2 conditions:
> 	- added as last value (not changing old values)
> 	- new value not used by existing API
> 
> The value RTE_ETH_EVENT_FLOW_AGED meet the above 2 conditions:
> 	- only RTE_ETH_EVENT_MAX is changed, which is consistent
> 	- new value sent to the app only if the app registered for it
> 

Same here, as far as I can see it is safe to get this change.

If any DPDK API returns this enum, either as return of the API or as output
parameter, this still can be problem, because application may use that returned
value, this was the concern in the QAT sample.

But here application registers an event and DPDK library process callback for
it, so application callbacks won't be called for anything that application
doesn't already know about, in that respect this should be safe for old
applications.

Not sure if we can generalize above two conditions for all enum changes, but we
can investigate them case by case as we get the warnings.

> So, except if I miss something, I suggest we add this exception:
> Allow new value in rte_eth_event_type if added just before RTE_ETH_EVENT_MAX.
> In other words, allow changing the value of RTE_ETH_EVENT_MAX.
> The file to add such exception is devtools/libabigail.abignore.
> 

OK to exception.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-21 10:04                 ` Ferruh Yigit
@ 2020-04-21 10:09                   ` Thomas Monjalon
  2020-04-21 15:59                   ` Andrew Rybchenko
  1 sibling, 0 replies; 50+ messages in thread
From: Thomas Monjalon @ 2020-04-21 10:09 UTC (permalink / raw)
  To: Bill Zhou, Ferruh Yigit
  Cc: Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, arybchenko, dev

21/04/2020 12:04, Ferruh Yigit:
> On 4/20/2020 5:10 PM, Thomas Monjalon wrote:
> > 20/04/2020 16:06, Ferruh Yigit:
> >> On 4/18/2020 10:44 AM, Thomas Monjalon wrote:
> >>> 18/04/2020 07:04, Bill Zhou:
> >>>> From: Ferruh Yigit <ferruh.yigit@intel.com>
> >>>>> On 4/14/2020 9:32 AM, Dong Zhou wrote:
> >>>>>> --- a/lib/librte_ethdev/rte_ethdev.h
> >>>>>> +++ b/lib/librte_ethdev/rte_ethdev.h
> >>>>>> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> >>>>>>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
> >>>>>>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >>>>>>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> >>>>>> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
> >>>>> */
> >>>>>>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >>>>>>  };
> >>>>>
> >>>>>
> >>>>> Just recognized that this is failing in ABI check [1], as far as last time for a
> >>>>> similar enum warning a QAT patch has been dropped, should this need to
> >>>>> wait for
> >>>>> 20.11 too?
> >>>>
> >>>> This patch is commonly used for flow aging, there are 2 other patches have 
> >>>> implement flow aging in mlx5 driver reply to this patch.
> > [...]
> >>> These MAX values in enums are a pain.
> >>> We can try to think what can be done, waiting 20.11.
> >>> Not sure there is a solution, except hijacking an existing value
> >>> not used in the PMD, waiting the definitive value in 20.11...
> >>
> >> Dropping from the tree as of now, to not cause more merge conflicts, we can add
> >> it later when issue is resolved.
> > 
> > Thanks for dropping, that's the right thing to do
> > when a patch is breaking ABI check.
> > 
> > After some thoughts, I think it is acceptable to make a v3
> > which ignore this specific enum change. I explain my thought below:
> > 
> > An enum can accept a new value at 2 conditions:
> > 	- added as last value (not changing old values)
> > 	- new value not used by existing API
> > 
> > The value RTE_ETH_EVENT_FLOW_AGED meet the above 2 conditions:
> > 	- only RTE_ETH_EVENT_MAX is changed, which is consistent
> > 	- new value sent to the app only if the app registered for it
> > 
> 
> Same here, as far as I can see it is safe to get this change.
> 
> If any DPDK API returns this enum, either as return of the API or as output
> parameter, this still can be problem, because application may use that returned
> value, this was the concern in the QAT sample.
> 
> But here application registers an event and DPDK library process callback for
> it, so application callbacks won't be called for anything that application
> doesn't already know about, in that respect this should be safe for old
> applications.
> 
> Not sure if we can generalize above two conditions for all enum changes, but we
> can investigate them case by case as we get the warnings.
> 
> > So, except if I miss something, I suggest we add this exception:
> > Allow new value in rte_eth_event_type if added just before RTE_ETH_EVENT_MAX.
> > In other words, allow changing the value of RTE_ETH_EVENT_MAX.
> > The file to add such exception is devtools/libabigail.abignore.
> > 
> 
> OK to exception.

v3 was sent.
I hope we'll get a v4 with justification for the exception.




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v4] ethdev: support flow aging
  2020-04-21  6:22       ` [dpdk-dev] [PATCH v3] " Bill Zhou
@ 2020-04-21 10:11         ` Bill Zhou
  2020-04-21 17:13           ` Ferruh Yigit
  2020-04-29 14:50           ` Tom Barbette
  0 siblings, 2 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-21 10:11 UTC (permalink / raw)
  To: orika, matan, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, thomas, ferruh.yigit, arybchenko
  Cc: dev, Dong Zhou

From: Dong Zhou <dongz@mellanox.com>

One of the reasons to destroy a flow is the fact that no packet matches the
flow for "timeout" time.
For example, when TCP\UDP sessions are suddenly closed.

Currently, there is not any DPDK mechanism for flow aging and the
applications use their own ways to detect and destroy aged-out flows.

The flow aging implementation need include:
- A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
  the application flow context for each flow.
- A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
  that there are new aged-out flows.
- A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
  contexts from the port.
- Support input flow aging command line in Testpmd.

The new event type addition in the enum is flagged as an ABI breakage, so
an ignore rule is added for these reasons:
- It is not changing value of existing types (except MAX)
- The new value is not used by existing API if the event is not registered
In general, it is safe adding new ethdev event types at the end of the
enum, because of event callback registration mechanism.

Signed-off-by: Dong Zhou <dongz@mellanox.com>
---
v2: Removing "* Added support for flow Aging mechanism base on counter."
this line from doc/guides/rel_notes/release_20_05.rst, this patch does not
include this support.

v3: Update file libabigail.abignore, add one new suppressed enumeration
type for RTE_ETH_EVENT_MAX.

v4: Add justification in devtools/libabigail.abignore and in the commit
log about the modification of v3.
---
 app/test-pmd/cmdline_flow.c              | 26 ++++++++++
 devtools/libabigail.abignore             |  6 +++
 doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
 doc/guides/rel_notes/release_20_05.rst   | 11 +++++
 lib/librte_ethdev/rte_ethdev.h           |  1 +
 lib/librte_ethdev/rte_ethdev_version.map |  3 ++
 lib/librte_ethdev/rte_flow.c             | 18 +++++++
 lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_flow_driver.h      |  6 +++
 9 files changed, 155 insertions(+)

diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
index e6ab8ff2f7..45bcff3cf5 100644
--- a/app/test-pmd/cmdline_flow.c
+++ b/app/test-pmd/cmdline_flow.c
@@ -343,6 +343,8 @@ enum index {
 	ACTION_SET_IPV4_DSCP_VALUE,
 	ACTION_SET_IPV6_DSCP,
 	ACTION_SET_IPV6_DSCP_VALUE,
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
 };
 
 /** Maximum size for pattern in struct rte_flow_item_raw. */
@@ -1145,6 +1147,7 @@ static const enum index next_action[] = {
 	ACTION_SET_META,
 	ACTION_SET_IPV4_DSCP,
 	ACTION_SET_IPV6_DSCP,
+	ACTION_AGE,
 	ZERO,
 };
 
@@ -1370,6 +1373,13 @@ static const enum index action_set_ipv6_dscp[] = {
 	ZERO,
 };
 
+static const enum index action_age[] = {
+	ACTION_AGE,
+	ACTION_AGE_TIMEOUT,
+	ACTION_NEXT,
+	ZERO,
+};
+
 static int parse_set_raw_encap_decap(struct context *, const struct token *,
 				     const char *, unsigned int,
 				     void *, unsigned int);
@@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
 			     (struct rte_flow_action_set_dscp, dscp)),
 		.call = parse_vc_conf,
 	},
+	[ACTION_AGE] = {
+		.name = "age",
+		.help = "set a specific metadata header",
+		.next = NEXT(action_age),
+		.priv = PRIV_ACTION(AGE,
+			sizeof(struct rte_flow_action_age)),
+		.call = parse_vc,
+	},
+	[ACTION_AGE_TIMEOUT] = {
+		.name = "timeout",
+		.help = "flow age timeout value",
+		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
+					   timeout, 24)),
+		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
+		.call = parse_vc_conf,
+	},
 };
 
 /** Remove and return last entry from argument stack. */
diff --git a/devtools/libabigail.abignore b/devtools/libabigail.abignore
index a59df8f135..c047adbd79 100644
--- a/devtools/libabigail.abignore
+++ b/devtools/libabigail.abignore
@@ -11,3 +11,9 @@
         type_kind = enum
         name = rte_crypto_asym_xform_type
         changed_enumerators = RTE_CRYPTO_ASYM_XFORM_TYPE_LIST_END
+; Ignore ethdev event enum update because new event cannot be
+; received if not registered
+[suppress_type]
+        type_kind = enum
+        name = rte_eth_event_type
+        changed_enumerators = RTE_ETH_EVENT_MAX
diff --git a/doc/guides/prog_guide/rte_flow.rst b/doc/guides/prog_guide/rte_flow.rst
index 41c147913c..cf4368e1c4 100644
--- a/doc/guides/prog_guide/rte_flow.rst
+++ b/doc/guides/prog_guide/rte_flow.rst
@@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
    | ``dscp``  | DSCP in low 6 bits, rest ignore |
    +-----------+---------------------------------+
 
+Action: ``AGE``
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Set ageing timeout configuration to a flow.
+
+Event RTE_ETH_EVENT_FLOW_AGED will be reported if
+timeout passed without any matching on the flow.
+
+.. _table_rte_flow_action_age:
+
+.. table:: AGE
+
+   +--------------+---------------------------------+
+   | Field        | Value                           |
+   +==============+=================================+
+   | ``timeout``  | 24 bits timeout value           |
+   +--------------+---------------------------------+
+   | ``reserved`` | 8 bits reserved, must be zero   |
+   +--------------+---------------------------------+
+   | ``context``  | user input flow context         |
+   +--------------+---------------------------------+
+
 Negative types
 ~~~~~~~~~~~~~~
 
diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index bacd4c65a2..ff0cf9f1d6 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -135,6 +135,17 @@ New Features
   by making use of the event device capabilities. The event mode currently supports
   only inline IPsec protocol offload.
 
+* **Added flow Aging Support.**
+
+  Added flow Aging support to detect and report aged-out flows, including:
+
+  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and the
+    application flow context for each flow.
+  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to report that
+    there are new aged-out flows.
+  * Added new API: rte_flow_get_aged_flows to get the aged-out flows contexts
+    from the port.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 8d69b88f9e..00cc7b4052 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3018,6 +3018,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 3f32fdecf7..fa4b5816be 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -230,4 +230,7 @@ EXPERIMENTAL {
 
 	# added in 20.02
 	rte_flow_dev_dump;
+
+	# added in 20.05
+	rte_flow_get_aged_flows;
 };
diff --git a/lib/librte_ethdev/rte_flow.c b/lib/librte_ethdev/rte_flow.c
index a5ac1c7fbd..3699edce49 100644
--- a/lib/librte_ethdev/rte_flow.c
+++ b/lib/librte_ethdev/rte_flow.c
@@ -172,6 +172,7 @@ static const struct rte_flow_desc_data rte_flow_desc_action[] = {
 	MK_FLOW_ACTION(SET_META, sizeof(struct rte_flow_action_set_meta)),
 	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct rte_flow_action_set_dscp)),
 	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct rte_flow_action_set_dscp)),
+	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
 };
 
 int
@@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file, struct rte_flow_error *error)
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 				  NULL, rte_strerror(ENOSYS));
 }
+
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+		    uint32_t nb_contexts, struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (unlikely(!ops))
+		return -rte_errno;
+	if (likely(!!ops->get_aged_flows))
+		return flow_err(port_id, ops->get_aged_flows(dev, contexts,
+				nb_contexts, error), error);
+	return rte_flow_error_set(error, ENOTSUP,
+				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+				  NULL, rte_strerror(ENOTSUP));
+}
diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
index 7f3e08fad3..fab44f6c0b 100644
--- a/lib/librte_ethdev/rte_flow.h
+++ b/lib/librte_ethdev/rte_flow.h
@@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
 	 * See struct rte_flow_action_set_dscp.
 	 */
 	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
+
+	/**
+	 * Report as aged flow if timeout passed without any matching on the
+	 * flow.
+	 *
+	 * See struct rte_flow_action_age.
+	 * See function rte_flow_get_aged_flows
+	 * see enum RTE_ETH_EVENT_FLOW_AGED
+	 */
+	RTE_FLOW_ACTION_TYPE_AGE,
 };
 
 /**
@@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
 	uint16_t index; /**< Queue index to use. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice
+ *
+ * RTE_FLOW_ACTION_TYPE_AGE
+ *
+ * Report flow as aged-out if timeout passed without any matching
+ * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
+ * port detects new aged-out flows.
+ *
+ * The flow context and the flow handle will be reported by the
+ * rte_flow_get_aged_flows API.
+ */
+struct rte_flow_action_age {
+	uint32_t timeout:24; /**< Time in seconds. */
+	uint32_t reserved:8; /**< Reserved, must be zero. */
+	void *context;
+		/**< The user flow context, NULL means the rte_flow pointer. */
+};
 
 /**
  * @warning
@@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
 	      const void *src,
 	      struct rte_flow_error *error);
 
+/**
+ * Get aged-out flows of a given port.
+ *
+ * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least one new aged
+ * out flow was detected after the last call to rte_flow_get_aged_flows.
+ * This function can be called to get the aged flows usynchronously from the
+ * event callback or synchronously regardless the event.
+ * This is not safe to call rte_flow_get_aged_flows function with other flow
+ * functions from multiple threads simultaneously.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in, out] contexts
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_contexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array, otherwise negative errno value.
+ *
+ * @see rte_flow_action_age
+ * @see RTE_ETH_EVENT_FLOW_AGED
+ */
+__rte_experimental
+int
+rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_flow_driver.h b/lib/librte_ethdev/rte_flow_driver.h
index 51a9a57a0f..881cc469b7 100644
--- a/lib/librte_ethdev/rte_flow_driver.h
+++ b/lib/librte_ethdev/rte_flow_driver.h
@@ -101,6 +101,12 @@ struct rte_flow_ops {
 		(struct rte_eth_dev *dev,
 		 FILE *file,
 		 struct rte_flow_error *error);
+	/** See rte_flow_get_aged_flows() */
+	int (*get_aged_flows)
+		(struct rte_eth_dev *dev,
+		 void **context,
+		 uint32_t nb_contexts,
+		 struct rte_flow_error *err);
 };
 
 /**
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ethdev: support flow aging
  2020-04-21 10:04                 ` Ferruh Yigit
  2020-04-21 10:09                   ` Thomas Monjalon
@ 2020-04-21 15:59                   ` Andrew Rybchenko
  1 sibling, 0 replies; 50+ messages in thread
From: Andrew Rybchenko @ 2020-04-21 15:59 UTC (permalink / raw)
  To: Ferruh Yigit, Thomas Monjalon, Bill Zhou
  Cc: Ori Kam, Matan Azrad, wenzhuo.lu, jingjing.wu, bernard.iremonger,
	john.mcnamara, marko.kovacevic, dev

On 4/21/20 1:04 PM, Ferruh Yigit wrote:
> On 4/20/2020 5:10 PM, Thomas Monjalon wrote:
>> 20/04/2020 16:06, Ferruh Yigit:
>>> On 4/18/2020 10:44 AM, Thomas Monjalon wrote:
>>>> 18/04/2020 07:04, Bill Zhou:
>>>>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>>>>>> On 4/14/2020 9:32 AM, Dong Zhou wrote:
>>>>>>> --- a/lib/librte_ethdev/rte_ethdev.h
>>>>>>> +++ b/lib/librte_ethdev/rte_ethdev.h
>>>>>>> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>>>>>>>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>>>>>>>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>>>>>>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>>>>>> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
>>>>>> */
>>>>>>>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>>>>>>>  };
>>>>>>
>>>>>>
>>>>>> Just recognized that this is failing in ABI check [1], as far as last time for a
>>>>>> similar enum warning a QAT patch has been dropped, should this need to
>>>>>> wait for
>>>>>> 20.11 too?
>>>>>
>>>>> This patch is commonly used for flow aging, there are 2 other patches have 
>>>>> implement flow aging in mlx5 driver reply to this patch.
>> [...]
>>>> These MAX values in enums are a pain.
>>>> We can try to think what can be done, waiting 20.11.
>>>> Not sure there is a solution, except hijacking an existing value
>>>> not used in the PMD, waiting the definitive value in 20.11...
>>>
>>> Dropping from the tree as of now, to not cause more merge conflicts, we can add
>>> it later when issue is resolved.
>>
>> Thanks for dropping, that's the right thing to do
>> when a patch is breaking ABI check.
>>
>> After some thoughts, I think it is acceptable to make a v3
>> which ignore this specific enum change. I explain my thought below:
>>
>> An enum can accept a new value at 2 conditions:
>> 	- added as last value (not changing old values)
>> 	- new value not used by existing API
>>
>> The value RTE_ETH_EVENT_FLOW_AGED meet the above 2 conditions:
>> 	- only RTE_ETH_EVENT_MAX is changed, which is consistent
>> 	- new value sent to the app only if the app registered for it
>>
> 
> Same here, as far as I can see it is safe to get this change.
> 
> If any DPDK API returns this enum, either as return of the API or as output
> parameter, this still can be problem, because application may use that returned
> value, this was the concern in the QAT sample.
> 
> But here application registers an event and DPDK library process callback for
> it, so application callbacks won't be called for anything that application
> doesn't already know about, in that respect this should be safe for old
> applications.
> 
> Not sure if we can generalize above two conditions for all enum changes, but we
> can investigate them case by case as we get the warnings.
> 
>> So, except if I miss something, I suggest we add this exception:
>> Allow new value in rte_eth_event_type if added just before RTE_ETH_EVENT_MAX.
>> In other words, allow changing the value of RTE_ETH_EVENT_MAX.
>> The file to add such exception is devtools/libabigail.abignore.
>>
> 
> OK to exception.

Me too

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v4] ethdev: support flow aging
  2020-04-21 10:11         ` [dpdk-dev] [PATCH v4] " Bill Zhou
@ 2020-04-21 17:13           ` Ferruh Yigit
  2020-04-29 14:50           ` Tom Barbette
  1 sibling, 0 replies; 50+ messages in thread
From: Ferruh Yigit @ 2020-04-21 17:13 UTC (permalink / raw)
  To: Bill Zhou, orika, matan, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic, thomas,
	arybchenko
  Cc: dev

On 4/21/2020 11:11 AM, Bill Zhou wrote:
> From: Dong Zhou <dongz@mellanox.com>
> 
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is not any DPDK mechanism for flow aging and the
> applications use their own ways to detect and destroy aged-out flows.
> 
> The flow aging implementation need include:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>   the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>   that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>   contexts from the port.
> - Support input flow aging command line in Testpmd.
> 
> The new event type addition in the enum is flagged as an ABI breakage, so
> an ignore rule is added for these reasons:
> - It is not changing value of existing types (except MAX)
> - The new value is not used by existing API if the event is not registered
> In general, it is safe adding new ethdev event types at the end of the
> enum, because of event callback registration mechanism.
> 
> Signed-off-by: Dong Zhou <dongz@mellanox.com>

Carrying ack from prev versions:
    Acked-by: Ori Kam <orika@mellanox.com>
    Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
    Acked-by: Jerin Jacob <jerinj@marvell.com>
    Acked-by: Matan Azrad <matan@mellanox.com>

Applied to dpdk-next-net/master, thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v2 0/2] net/mlx5: support flow aging
  2020-04-13 14:53   ` [dpdk-dev] [PATCH 0/2] " Dong Zhou
  2020-04-13 14:53     ` [dpdk-dev] [PATCH 1/2] net/mlx5: modify ext-counter memory allocation Dong Zhou
  2020-04-13 14:53     ` [dpdk-dev] [PATCH 2/2] net/mlx5: support flow aging Dong Zhou
@ 2020-04-24 10:45     ` Bill Zhou
  2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
                         ` (2 more replies)
  2 siblings, 3 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-24 10:45 UTC (permalink / raw)
  To: matan, shahafs, viacheslavo, marko.kovacevic, john.mcnamara, orika; +Cc: dev

Those patches implement flow aging for mlx5 driver. First patch is to modify
the current additional memory allocation for counter, so that it's easy to
get every counter additional memory location by using offsetting. Second patch
implements aging check and age-out event callback mechanism for mlx5 driver.


Bill Zhou (2):
  net/mlx5: modify ext-counter memory allocation
  net/mlx5: support flow aging

 doc/guides/rel_notes/release_20_05.rst |   1 +
 drivers/net/mlx5/mlx5.c                |  86 +++---
 drivers/net/mlx5/mlx5.h                |  63 ++++-
 drivers/net/mlx5/mlx5_flow.c           | 201 ++++++++++++--
 drivers/net/mlx5/mlx5_flow.h           |  16 +-
 drivers/net/mlx5/mlx5_flow_dv.c        | 370 +++++++++++++++++++++----
 drivers/net/mlx5/mlx5_flow_verbs.c     |  16 +-
 7 files changed, 626 insertions(+), 127 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] net/mlx5: modify ext-counter memory allocation
  2020-04-24 10:45     ` [dpdk-dev] [PATCH v2 0/2] " Bill Zhou
@ 2020-04-24 10:45       ` Bill Zhou
  2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: support flow aging Bill Zhou
  2020-04-29  2:25       ` [dpdk-dev] [PATCH v3 0/2] " Bill Zhou
  2 siblings, 0 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-24 10:45 UTC (permalink / raw)
  To: matan, shahafs, viacheslavo, marko.kovacevic, john.mcnamara, orika; +Cc: dev

Currently, the counter pool needs 512 ext-counter memory for no batch
counters, it's allocated separately by once, behind the 512 basic-counter
memory. This is not easy to get ext-counter pointer by corresponding
basic-counter pointer. This is also no easy for expanding some other
potential additional type of counter memory.

So, need allocate every one of ext-counter and basic-counter together,
as a single piece of memory. It's will be same for further additional
type of counter memory. In this case, one piece of memory contains all
type of memory for one counter, it's easy to get each type memory by
using offsetting.

Signed-off-by: Bill Zhou <dongz@mellanox.com>
---
v2: Update some comments for new adding fields.
---
 drivers/net/mlx5/mlx5.c            |  4 ++--
 drivers/net/mlx5/mlx5.h            | 22 ++++++++++++++++------
 drivers/net/mlx5/mlx5_flow_dv.c    | 27 +++++++++++++++------------
 drivers/net/mlx5/mlx5_flow_verbs.c | 16 ++++++++--------
 4 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index cc13e447d6..57d76cb741 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -505,10 +505,10 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 					(mlx5_devx_cmd_destroy(pool->min_dcs));
 			}
 			for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
-				if (pool->counters_raw[j].action)
+				if (MLX5_POOL_GET_CNT(pool, j)->action)
 					claim_zero
 					(mlx5_glue->destroy_flow_action
-					       (pool->counters_raw[j].action));
+					 (MLX5_POOL_GET_CNT(pool, j)->action));
 				if (!batch && MLX5_GET_POOL_CNT_EXT
 				    (pool, j)->dcs)
 					claim_zero(mlx5_devx_cmd_destroy
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 50349abf34..51c3f33e6b 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -222,6 +222,18 @@ struct mlx5_drop {
 #define MLX5_COUNTERS_PER_POOL 512
 #define MLX5_MAX_PENDING_QUERIES 4
 #define MLX5_CNT_CONTAINER_RESIZE 64
+#define CNT_SIZE (sizeof(struct mlx5_flow_counter))
+#define CNTEXT_SIZE (sizeof(struct mlx5_flow_counter_ext))
+
+#define CNT_POOL_TYPE_EXT	(1 << 0)
+#define IS_EXT_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_EXT)
+#define MLX5_CNT_LEN(pool) \
+	(CNT_SIZE + (IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
+#define MLX5_POOL_GET_CNT(pool, index) \
+	((struct mlx5_flow_counter *) \
+	((char *)((pool) + 1) + (index) * (MLX5_CNT_LEN(pool))))
+#define MLX5_CNT_ARRAY_IDX(pool, cnt) \
+	((int)(((char *)(cnt) - (char *)((pool) + 1)) / MLX5_CNT_LEN(pool))) \
 /*
  * The pool index and offset of counter in the pool array makes up the
  * counter index. In case the counter is from pool 0 and offset 0, it
@@ -230,11 +242,10 @@ struct mlx5_drop {
  */
 #define MLX5_MAKE_CNT_IDX(pi, offset) \
 	((pi) * MLX5_COUNTERS_PER_POOL + (offset) + 1)
-#define MLX5_CNT_TO_CNT_EXT(pool, cnt) (&((struct mlx5_flow_counter_ext *) \
-			    ((pool) + 1))[((cnt) - (pool)->counters_raw)])
+#define MLX5_CNT_TO_CNT_EXT(cnt) \
+	((struct mlx5_flow_counter_ext *)((cnt) + 1))
 #define MLX5_GET_POOL_CNT_EXT(pool, offset) \
-			      (&((struct mlx5_flow_counter_ext *) \
-			      ((pool) + 1))[offset])
+	MLX5_CNT_TO_CNT_EXT(MLX5_POOL_GET_CNT((pool), (offset)))
 
 struct mlx5_flow_counter_pool;
 
@@ -287,11 +298,10 @@ struct mlx5_flow_counter_pool {
 	rte_atomic64_t start_query_gen; /* Query start round. */
 	rte_atomic64_t end_query_gen; /* Query end round. */
 	uint32_t index; /* Pool index in container. */
+	uint32_t type: 2; /* Memory type behind the counter array. */
 	rte_spinlock_t sl; /* The pool lock. */
 	struct mlx5_counter_stats_raw *raw;
 	struct mlx5_counter_stats_raw *raw_hw; /* The raw on HW working. */
-	struct mlx5_flow_counter counters_raw[MLX5_COUNTERS_PER_POOL];
-	/* The pool counters memory. */
 };
 
 struct mlx5_counter_stats_raw;
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index 6263ecc731..784a62c521 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -3909,7 +3909,7 @@ flow_dv_counter_get_by_idx(struct rte_eth_dev *dev,
 	MLX5_ASSERT(pool);
 	if (ppool)
 		*ppool = pool;
-	return &pool->counters_raw[idx % MLX5_COUNTERS_PER_POOL];
+	return MLX5_POOL_GET_CNT(pool, idx % MLX5_COUNTERS_PER_POOL);
 }
 
 /**
@@ -4117,7 +4117,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 		if (priv->counter_fallback)
 			return mlx5_devx_cmd_flow_counter_query(cnt_ext->dcs, 0,
 					0, pkts, bytes, 0, NULL, NULL, 0);
@@ -4133,7 +4133,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 		*pkts = 0;
 		*bytes = 0;
 	} else {
-		offset = cnt - &pool->counters_raw[0];
+		offset = MLX5_CNT_ARRAY_IDX(pool, cnt);
 		*pkts = rte_be_to_cpu_64(pool->raw->data[offset].hits);
 		*bytes = rte_be_to_cpu_64(pool->raw->data[offset].bytes);
 	}
@@ -4173,9 +4173,9 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 			return NULL;
 	}
 	size = sizeof(*pool);
+	size += MLX5_COUNTERS_PER_POOL * CNT_SIZE;
 	if (!batch)
-		size += MLX5_COUNTERS_PER_POOL *
-			sizeof(struct mlx5_flow_counter_ext);
+		size += MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE;
 	pool = rte_calloc(__func__, 1, size, 0);
 	if (!pool) {
 		rte_errno = ENOMEM;
@@ -4186,6 +4186,9 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 		pool->raw = cont->init_mem_mng->raws + n_valid %
 						     MLX5_CNT_CONTAINER_RESIZE;
 	pool->raw_hw = NULL;
+	pool->type = 0;
+	if (!batch)
+		pool->type |= CNT_POOL_TYPE_EXT;
 	rte_spinlock_init(&pool->sl);
 	/*
 	 * The generation of the new allocated counters in this pool is 0, 2 in
@@ -4257,7 +4260,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 					 (int64_t)(uintptr_t)dcs);
 		}
 		i = dcs->id % MLX5_COUNTERS_PER_POOL;
-		cnt = &pool->counters_raw[i];
+		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 		MLX5_GET_POOL_CNT_EXT(pool, i)->dcs = dcs;
 		*cnt_free = cnt;
@@ -4277,10 +4280,10 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 	}
 	pool = TAILQ_FIRST(&cont->pool_list);
 	for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
-		cnt = &pool->counters_raw[i];
+		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 	}
-	*cnt_free = &pool->counters_raw[0];
+	*cnt_free = MLX5_POOL_GET_CNT(pool, 0);
 	return cont;
 }
 
@@ -4398,14 +4401,14 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		pool = TAILQ_FIRST(&cont->pool_list);
 	}
 	if (!batch)
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt_free);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt_free);
 	/* Create a DV counter action only in the first time usage. */
 	if (!cnt_free->action) {
 		uint16_t offset;
 		struct mlx5_devx_obj *dcs;
 
 		if (batch) {
-			offset = cnt_free - &pool->counters_raw[0];
+			offset = MLX5_CNT_ARRAY_IDX(pool, cnt_free);
 			dcs = pool->min_dcs;
 		} else {
 			offset = 0;
@@ -4419,7 +4422,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		}
 	}
 	cnt_idx = MLX5_MAKE_CNT_IDX(pool->index,
-				    (cnt_free - pool->counters_raw));
+				MLX5_CNT_ARRAY_IDX(pool, cnt_free));
 	cnt_idx += batch * MLX5_CNT_BATCH_OFFSET;
 	/* Update the counter reset values. */
 	if (_flow_dv_query_count(dev, cnt_idx, &cnt_free->hits,
@@ -4462,7 +4465,7 @@ flow_dv_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 		if (cnt_ext && --cnt_ext->ref_cnt)
 			return;
 	}
diff --git a/drivers/net/mlx5/mlx5_flow_verbs.c b/drivers/net/mlx5/mlx5_flow_verbs.c
index d20098ce45..236d665852 100644
--- a/drivers/net/mlx5/mlx5_flow_verbs.c
+++ b/drivers/net/mlx5/mlx5_flow_verbs.c
@@ -64,7 +64,7 @@ flow_verbs_counter_get_by_idx(struct rte_eth_dev *dev,
 	MLX5_ASSERT(pool);
 	if (ppool)
 		*ppool = pool;
-	return &pool->counters_raw[idx % MLX5_COUNTERS_PER_POOL];
+	return MLX5_POOL_GET_CNT(pool, idx % MLX5_COUNTERS_PER_POOL);
 }
 
 /**
@@ -207,16 +207,16 @@ flow_verbs_counter_new(struct rte_eth_dev *dev, uint32_t shared, uint32_t id)
 		if (!pool)
 			return 0;
 		for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
-			cnt = &pool->counters_raw[i];
+			cnt = MLX5_POOL_GET_CNT(pool, i);
 			TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 		}
-		cnt = &pool->counters_raw[0];
+		cnt = MLX5_POOL_GET_CNT(pool, 0);
 		cont->pools[n_valid] = pool;
 		pool_idx = n_valid;
 		rte_atomic16_add(&cont->n_valid, 1);
 		TAILQ_INSERT_HEAD(&cont->pool_list, pool, next);
 	}
-	i = cnt - pool->counters_raw;
+	i = MLX5_CNT_ARRAY_IDX(pool, cnt);
 	cnt_ext = MLX5_GET_POOL_CNT_EXT(pool, i);
 	cnt_ext->id = id;
 	cnt_ext->shared = shared;
@@ -251,7 +251,7 @@ flow_verbs_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 
 	cnt = flow_verbs_counter_get_by_idx(dev, counter,
 					    &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	if (--cnt_ext->ref_cnt == 0) {
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 		claim_zero(mlx5_glue->destroy_counter_set(cnt_ext->cs));
@@ -282,7 +282,7 @@ flow_verbs_counter_query(struct rte_eth_dev *dev __rte_unused,
 		struct mlx5_flow_counter *cnt = flow_verbs_counter_get_by_idx
 						(dev, flow->counter, &pool);
 		struct mlx5_flow_counter_ext *cnt_ext = MLX5_CNT_TO_CNT_EXT
-							(pool, cnt);
+						(cnt);
 		struct rte_flow_query_count *qc = data;
 		uint64_t counters[2] = {0, 0};
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
@@ -1083,12 +1083,12 @@ flow_verbs_translate_action_count(struct mlx5_flow *dev_flow,
 	}
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	counter.counter_set_handle = cnt_ext->cs->handle;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #elif defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	counter.counters = cnt_ext->cs;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] net/mlx5: support flow aging
  2020-04-24 10:45     ` [dpdk-dev] [PATCH v2 0/2] " Bill Zhou
  2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
@ 2020-04-24 10:45       ` Bill Zhou
  2020-04-26  7:07         ` Suanming Mou
  2020-04-29  2:25       ` [dpdk-dev] [PATCH v3 0/2] " Bill Zhou
  2 siblings, 1 reply; 50+ messages in thread
From: Bill Zhou @ 2020-04-24 10:45 UTC (permalink / raw)
  To: matan, shahafs, viacheslavo, marko.kovacevic, john.mcnamara, orika; +Cc: dev

Currently, there is no flow aging check and age-out event callback
mechanism for mlx5 driver, this patch implements it. It's included:
- Splitting the current counter container to aged or no-aged container
  since reducing memory consumption. Aged container will allocate extra
  memory to save the aging parameter from user configuration.
- Aging check and age-out event callback mechanism based on current
  counter. When a flow be checked aged-out, RTE_ETH_EVENT_FLOW_AGED
  event will be triggered to applications.
- Implement the new API: rte_flow_get_aged_flows, applications can use
  this API to get aged flows.

Signed-off-by: Bill Zhou <dongz@mellanox.com>
---
v2: Moving aging list from struct mlx5_ibv_shared to struct mlx5_priv,
one port has one aging list. Update event be triggered once after last
call of rte_flow_get_aged_flows.
---
 doc/guides/rel_notes/release_20_05.rst |   1 +
 drivers/net/mlx5/mlx5.c                |  86 +++---
 drivers/net/mlx5/mlx5.h                |  49 +++-
 drivers/net/mlx5/mlx5_flow.c           | 201 ++++++++++++--
 drivers/net/mlx5/mlx5_flow.h           |  16 +-
 drivers/net/mlx5/mlx5_flow_dv.c        | 361 +++++++++++++++++++++----
 drivers/net/mlx5/mlx5_flow_verbs.c     |  14 +-
 7 files changed, 607 insertions(+), 121 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index b124c3f287..a5ba8a4792 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -141,6 +141,7 @@ New Features
   * Added support for creating Relaxed Ordering Memory Regions.
   * Added support for jumbo frame size (9K MTU) in Multi-Packet RQ mode.
   * Optimized the memory consumption of flow.
+  * Added support for flow aging based on hardware counter.
 
 * **Updated the AESNI MB crypto PMD.**
 
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 57d76cb741..674d0ea9d3 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -437,6 +437,20 @@ mlx5_flow_id_release(struct mlx5_flow_id_pool *pool, uint32_t id)
 	return 0;
 }
 
+/**
+ * Initialize the private aging list information.
+ *
+ * @param[in] priv
+ *   Pointer to the private device data structure.
+ */
+static void
+mlx5_flow_aging_list_init(struct mlx5_priv *priv)
+{
+	TAILQ_INIT(&priv->aged_counters);
+	rte_spinlock_init(&priv->aged_sl);
+	rte_atomic16_set(&priv->trigger_event, 1);
+}
+
 /**
  * Initialize the counters management structure.
  *
@@ -446,11 +460,14 @@ mlx5_flow_id_release(struct mlx5_flow_id_pool *pool, uint32_t id)
 static void
 mlx5_flow_counters_mng_init(struct mlx5_ibv_shared *sh)
 {
-	uint8_t i;
+	uint8_t i, age;
 
+	sh->cmng.age = 0;
 	TAILQ_INIT(&sh->cmng.flow_counters);
-	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
-		TAILQ_INIT(&sh->cmng.ccont[i].pool_list);
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
+			TAILQ_INIT(&sh->cmng.ccont[i][age].pool_list);
+	}
 }
 
 /**
@@ -480,7 +497,7 @@ static void
 mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 {
 	struct mlx5_counter_stats_mem_mng *mng;
-	uint8_t i;
+	uint8_t i, age = 0;
 	int j;
 	int retries = 1024;
 
@@ -491,36 +508,42 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 			break;
 		rte_pause();
 	}
-	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i) {
-		struct mlx5_flow_counter_pool *pool;
-		uint32_t batch = !!(i % 2);
 
-		if (!sh->cmng.ccont[i].pools)
-			continue;
-		pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
-		while (pool) {
-			if (batch) {
-				if (pool->min_dcs)
-					claim_zero
-					(mlx5_devx_cmd_destroy(pool->min_dcs));
-			}
-			for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
-				if (MLX5_POOL_GET_CNT(pool, j)->action)
-					claim_zero
-					(mlx5_glue->destroy_flow_action
-					 (MLX5_POOL_GET_CNT(pool, j)->action));
-				if (!batch && MLX5_GET_POOL_CNT_EXT
-				    (pool, j)->dcs)
-					claim_zero(mlx5_devx_cmd_destroy
-						  (MLX5_GET_POOL_CNT_EXT
-						  (pool, j)->dcs));
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i) {
+			struct mlx5_flow_counter_pool *pool;
+			uint32_t batch = !!(i % 2);
+
+			if (!sh->cmng.ccont[i][age].pools)
+				continue;
+			pool = TAILQ_FIRST(&sh->cmng.ccont[i][age].pool_list);
+			while (pool) {
+				if (batch) {
+					if (pool->min_dcs)
+						claim_zero
+						(mlx5_devx_cmd_destroy
+						(pool->min_dcs));
+				}
+				for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
+					if (MLX5_POOL_GET_CNT(pool, j)->action)
+						claim_zero
+						(mlx5_glue->destroy_flow_action
+						 (MLX5_POOL_GET_CNT
+						  (pool, j)->action));
+					if (!batch && MLX5_GET_POOL_CNT_EXT
+					    (pool, j)->dcs)
+						claim_zero(mlx5_devx_cmd_destroy
+							  (MLX5_GET_POOL_CNT_EXT
+							  (pool, j)->dcs));
+				}
+				TAILQ_REMOVE(&sh->cmng.ccont[i][age].pool_list,
+					pool, next);
+				rte_free(pool);
+				pool = TAILQ_FIRST
+					(&sh->cmng.ccont[i][age].pool_list);
 			}
-			TAILQ_REMOVE(&sh->cmng.ccont[i].pool_list, pool,
-				     next);
-			rte_free(pool);
-			pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
+			rte_free(sh->cmng.ccont[i][age].pools);
 		}
-		rte_free(sh->cmng.ccont[i].pools);
 	}
 	mng = LIST_FIRST(&sh->cmng.mem_mngs);
 	while (mng) {
@@ -3003,6 +3026,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 			goto error;
 		}
 	}
+	mlx5_flow_aging_list_init(priv);
 	return eth_dev;
 error:
 	if (priv) {
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 51c3f33e6b..d1b358e929 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -222,13 +222,21 @@ struct mlx5_drop {
 #define MLX5_COUNTERS_PER_POOL 512
 #define MLX5_MAX_PENDING_QUERIES 4
 #define MLX5_CNT_CONTAINER_RESIZE 64
+#define MLX5_CNT_AGE_OFFSET 0x80000000
 #define CNT_SIZE (sizeof(struct mlx5_flow_counter))
 #define CNTEXT_SIZE (sizeof(struct mlx5_flow_counter_ext))
+#define AGE_SIZE (sizeof(struct mlx5_age_param))
 
 #define CNT_POOL_TYPE_EXT	(1 << 0)
+#define CNT_POOL_TYPE_AGE	(1 << 1)
 #define IS_EXT_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_EXT)
+#define IS_AGE_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_AGE)
+#define MLX_CNT_IS_AGE(counter) ((counter) & MLX5_CNT_AGE_OFFSET ? 1 : 0)
+
 #define MLX5_CNT_LEN(pool) \
-	(CNT_SIZE + (IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
+	(CNT_SIZE + \
+	(IS_AGE_POOL(pool) ? AGE_SIZE : 0) + \
+	(IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
 #define MLX5_POOL_GET_CNT(pool, index) \
 	((struct mlx5_flow_counter *) \
 	((char *)((pool) + 1) + (index) * (MLX5_CNT_LEN(pool))))
@@ -242,13 +250,33 @@ struct mlx5_drop {
  */
 #define MLX5_MAKE_CNT_IDX(pi, offset) \
 	((pi) * MLX5_COUNTERS_PER_POOL + (offset) + 1)
-#define MLX5_CNT_TO_CNT_EXT(cnt) \
-	((struct mlx5_flow_counter_ext *)((cnt) + 1))
+#define MLX5_CNT_TO_CNT_EXT(pool, cnt) \
+	((struct mlx5_flow_counter_ext *)\
+	((char *)((cnt) + 1) + \
+	(IS_AGE_POOL(pool) ? AGE_SIZE : 0)))
 #define MLX5_GET_POOL_CNT_EXT(pool, offset) \
-	MLX5_CNT_TO_CNT_EXT(MLX5_POOL_GET_CNT((pool), (offset)))
+	MLX5_CNT_TO_CNT_EXT(pool, MLX5_POOL_GET_CNT((pool), (offset)))
+#define MLX5_CNT_TO_AGE(cnt) \
+	((struct mlx5_age_param *)((cnt) + 1))
 
 struct mlx5_flow_counter_pool;
 
+/*age status*/
+enum {
+	AGE_FREE,
+	AGE_CANDIDATE, /* Counter assigned to flows. */
+	AGE_TMOUT, /* Timeout, wait for aged flows query and destroy. */
+};
+
+/* Counter age parameter. */
+struct mlx5_age_param {
+	rte_atomic16_t state; /**< Age state. */
+	uint16_t port_id; /**< Port id of the counter. */
+	uint32_t timeout:15; /**< Age timeout in unit of 0.1sec. */
+	uint32_t expire:16; /**< Expire time(0.1sec) in the future. */
+	void *context; /**< Flow counter age context. */
+};
+
 struct flow_counter_stats {
 	uint64_t hits;
 	uint64_t bytes;
@@ -336,13 +364,14 @@ struct mlx5_pools_container {
 
 /* Counter global management structure. */
 struct mlx5_flow_counter_mng {
-	uint8_t mhi[2]; /* master \ host container index. */
-	struct mlx5_pools_container ccont[2 * 2];
-	/* 2 containers for single and for batch for double-buffer. */
+	uint8_t mhi[2][2]; /* master \ host and age \ no age container index. */
+	struct mlx5_pools_container ccont[2 * 2][2];
+	/* master \ host and age \ no age pools container. */
 	struct mlx5_counters flow_counters; /* Legacy flow counter list. */
 	uint8_t pending_queries;
 	uint8_t batch;
 	uint16_t pool_index;
+	uint8_t age;
 	uint8_t query_thread_on;
 	LIST_HEAD(mem_mngs, mlx5_counter_stats_mem_mng) mem_mngs;
 	LIST_HEAD(stat_raws, mlx5_counter_stats_raw) free_stat_raws;
@@ -566,6 +595,10 @@ struct mlx5_priv {
 	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
 	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
 	LIST_HEAD(fdir, mlx5_fdir_flow) fdir_flows; /* fdir flows. */
+	struct mlx5_counters aged_counters; /* Aged flow counter list. */
+	rte_spinlock_t aged_sl; /* Aged flow counter list lock. */
+	rte_atomic16_t trigger_event;
+	/* Event be triggered once after last call of rte_flow_get_aged_flows*/
 };
 
 #define PORT_ID(priv) ((priv)->dev_data->port_id)
@@ -764,6 +797,8 @@ int mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
 		       struct rte_flow_error *error);
 void mlx5_flow_rxq_dynf_metadata_set(struct rte_eth_dev *dev);
+int mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error);
 
 /* mlx5_mp.c */
 int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index cba1f23e81..c691b43722 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -24,6 +24,7 @@
 #include <rte_ether.h>
 #include <rte_ethdev_driver.h>
 #include <rte_flow.h>
+#include <rte_cycles.h>
 #include <rte_flow_driver.h>
 #include <rte_malloc.h>
 #include <rte_ip.h>
@@ -242,6 +243,7 @@ static const struct rte_flow_ops mlx5_flow_ops = {
 	.isolate = mlx5_flow_isolate,
 	.query = mlx5_flow_query,
 	.dev_dump = mlx5_flow_dev_dump,
+	.get_aged_flows = mlx5_flow_get_aged_flows,
 };
 
 /* Convert FDIR request to Generic flow. */
@@ -2531,6 +2533,8 @@ flow_drv_validate(struct rte_eth_dev *dev,
  *   Pointer to the list of items.
  * @param[in] actions
  *   Pointer to the list of actions.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Pointer to the error structure.
  *
@@ -2543,14 +2547,19 @@ flow_drv_prepare(struct rte_eth_dev *dev,
 		 const struct rte_flow_attr *attr,
 		 const struct rte_flow_item items[],
 		 const struct rte_flow_action actions[],
+		 uint32_t flow_idx,
 		 struct rte_flow_error *error)
 {
 	const struct mlx5_flow_driver_ops *fops;
 	enum mlx5_flow_drv_type type = flow->drv_type;
+	struct mlx5_flow *mlx5_flow = NULL;
 
 	MLX5_ASSERT(type > MLX5_FLOW_TYPE_MIN && type < MLX5_FLOW_TYPE_MAX);
 	fops = flow_get_drv_ops(type);
-	return fops->prepare(dev, attr, items, actions, error);
+	mlx5_flow = fops->prepare(dev, attr, items, actions, error);
+	if (mlx5_flow)
+		mlx5_flow->flow_idx = flow_idx;
+	return mlx5_flow;
 }
 
 /**
@@ -3498,6 +3507,8 @@ flow_hairpin_split(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -3511,11 +3522,13 @@ flow_create_split_inner(struct rte_eth_dev *dev,
 			const struct rte_flow_attr *attr,
 			const struct rte_flow_item items[],
 			const struct rte_flow_action actions[],
-			bool external, struct rte_flow_error *error)
+			bool external, uint32_t flow_idx,
+			struct rte_flow_error *error)
 {
 	struct mlx5_flow *dev_flow;
 
-	dev_flow = flow_drv_prepare(dev, flow, attr, items, actions, error);
+	dev_flow = flow_drv_prepare(dev, flow, attr, items, actions,
+		flow_idx, error);
 	if (!dev_flow)
 		return -rte_errno;
 	dev_flow->flow = flow;
@@ -3876,6 +3889,8 @@ flow_mreg_tx_copy_prep(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -3888,7 +3903,8 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 			   const struct rte_flow_attr *attr,
 			   const struct rte_flow_item items[],
 			   const struct rte_flow_action actions[],
-			   bool external, struct rte_flow_error *error)
+			   bool external, uint32_t flow_idx,
+			   struct rte_flow_error *error)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_dev_config *config = &priv->config;
@@ -3908,7 +3924,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 	    !mlx5_flow_ext_mreg_supported(dev))
 		return flow_create_split_inner(dev, flow, NULL, prefix_layers,
 					       attr, items, actions, external,
-					       error);
+					       flow_idx, error);
 	actions_n = flow_parse_metadata_split_actions_info(actions, &qrss,
 							   &encap_idx);
 	if (qrss) {
@@ -3992,7 +4008,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 	/* Add the unmodified original or prefix subflow. */
 	ret = flow_create_split_inner(dev, flow, &dev_flow, prefix_layers, attr,
 				      items, ext_actions ? ext_actions :
-				      actions, external, error);
+				      actions, external, flow_idx, error);
 	if (ret < 0)
 		goto exit;
 	MLX5_ASSERT(dev_flow);
@@ -4055,7 +4071,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 		ret = flow_create_split_inner(dev, flow, &dev_flow, layers,
 					      &q_attr, mtr_sfx ? items :
 					      q_items, q_actions,
-					      external, error);
+					      external, flow_idx, error);
 		if (ret < 0)
 			goto exit;
 		/* qrss ID should be freed if failed. */
@@ -4096,6 +4112,8 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -4107,7 +4125,8 @@ flow_create_split_meter(struct rte_eth_dev *dev,
 			   const struct rte_flow_attr *attr,
 			   const struct rte_flow_item items[],
 			   const struct rte_flow_action actions[],
-			   bool external, struct rte_flow_error *error)
+			   bool external, uint32_t flow_idx,
+			   struct rte_flow_error *error)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct rte_flow_action *sfx_actions = NULL;
@@ -4151,7 +4170,7 @@ flow_create_split_meter(struct rte_eth_dev *dev,
 		/* Add the prefix subflow. */
 		ret = flow_create_split_inner(dev, flow, &dev_flow, 0, attr,
 					      items, pre_actions, external,
-					      error);
+					      flow_idx, error);
 		if (ret) {
 			ret = -rte_errno;
 			goto exit;
@@ -4168,7 +4187,7 @@ flow_create_split_meter(struct rte_eth_dev *dev,
 					 0, &sfx_attr,
 					 sfx_items ? sfx_items : items,
 					 sfx_actions ? sfx_actions : actions,
-					 external, error);
+					 external, flow_idx, error);
 exit:
 	if (sfx_actions)
 		rte_free(sfx_actions);
@@ -4205,6 +4224,8 @@ flow_create_split_meter(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -4216,12 +4237,13 @@ flow_create_split_outer(struct rte_eth_dev *dev,
 			const struct rte_flow_attr *attr,
 			const struct rte_flow_item items[],
 			const struct rte_flow_action actions[],
-			bool external, struct rte_flow_error *error)
+			bool external, uint32_t flow_idx,
+			struct rte_flow_error *error)
 {
 	int ret;
 
 	ret = flow_create_split_meter(dev, flow, attr, items,
-					 actions, external, error);
+					 actions, external, flow_idx, error);
 	MLX5_ASSERT(ret <= 0);
 	return ret;
 }
@@ -4356,7 +4378,7 @@ flow_list_create(struct rte_eth_dev *dev, uint32_t *list,
 		 */
 		ret = flow_create_split_outer(dev, flow, attr,
 					      buf->entry[i].pattern,
-					      p_actions_rx, external,
+					      p_actions_rx, external, idx,
 					      error);
 		if (ret < 0)
 			goto error;
@@ -4367,7 +4389,8 @@ flow_list_create(struct rte_eth_dev *dev, uint32_t *list,
 		attr_tx.ingress = 0;
 		attr_tx.egress = 1;
 		dev_flow = flow_drv_prepare(dev, flow, &attr_tx, items_tx.items,
-					    actions_hairpin_tx.actions, error);
+					 actions_hairpin_tx.actions,
+					 idx, error);
 		if (!dev_flow)
 			goto error;
 		dev_flow->flow = flow;
@@ -5741,6 +5764,31 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 
 #define MLX5_POOL_QUERY_FREQ_US 1000000
 
+/**
+ * Get number of all validate pools.
+ *
+ * @param[in] sh
+ *   Pointer to mlx5_ibv_shared object.
+ *
+ * @return
+ *   The number of all validate pools.
+ */
+static uint32_t
+mlx5_get_all_valid_pool_count(struct mlx5_ibv_shared *sh)
+{
+	uint8_t age, i;
+	uint32_t pools_n = 0;
+	struct mlx5_pools_container *cont;
+
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < 2 ; ++i) {
+			cont = MLX5_CNT_CONTAINER(sh, i, 0, age);
+			pools_n += rte_atomic16_read(&cont->n_valid);
+		}
+	}
+	return pools_n;
+}
+
 /**
  * Set the periodic procedure for triggering asynchronous batch queries for all
  * the counter pools.
@@ -5751,12 +5799,9 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 void
 mlx5_set_query_alarm(struct mlx5_ibv_shared *sh)
 {
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(sh, 0, 0);
-	uint32_t pools_n = rte_atomic16_read(&cont->n_valid);
-	uint32_t us;
+	uint32_t pools_n, us;
 
-	cont = MLX5_CNT_CONTAINER(sh, 1, 0);
-	pools_n += rte_atomic16_read(&cont->n_valid);
+	pools_n = mlx5_get_all_valid_pool_count(sh);
 	us = MLX5_POOL_QUERY_FREQ_US / pools_n;
 	DRV_LOG(DEBUG, "Set alarm for %u pools each %u us", pools_n, us);
 	if (rte_eal_alarm_set(us, mlx5_flow_query_alarm, sh)) {
@@ -5782,6 +5827,7 @@ mlx5_flow_query_alarm(void *arg)
 	uint16_t offset;
 	int ret;
 	uint8_t batch = sh->cmng.batch;
+	uint8_t age = sh->cmng.age;
 	uint16_t pool_index = sh->cmng.pool_index;
 	struct mlx5_pools_container *cont;
 	struct mlx5_pools_container *mcont;
@@ -5790,8 +5836,8 @@ mlx5_flow_query_alarm(void *arg)
 	if (sh->cmng.pending_queries >= MLX5_MAX_PENDING_QUERIES)
 		goto set_alarm;
 next_container:
-	cont = MLX5_CNT_CONTAINER(sh, batch, 1);
-	mcont = MLX5_CNT_CONTAINER(sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(sh, batch, 1, age);
+	mcont = MLX5_CNT_CONTAINER(sh, batch, 0, age);
 	/* Check if resize was done and need to flip a container. */
 	if (cont != mcont) {
 		if (cont->pools) {
@@ -5801,15 +5847,22 @@ mlx5_flow_query_alarm(void *arg)
 		}
 		rte_cio_wmb();
 		 /* Flip the host container. */
-		sh->cmng.mhi[batch] ^= (uint8_t)2;
+		sh->cmng.mhi[batch][age] ^= (uint8_t)2;
 		cont = mcont;
 	}
 	if (!cont->pools) {
 		/* 2 empty containers case is unexpected. */
-		if (unlikely(batch != sh->cmng.batch))
+		if (unlikely(batch != sh->cmng.batch) &&
+			unlikely(age != sh->cmng.age)) {
 			goto set_alarm;
+		}
 		batch ^= 0x1;
 		pool_index = 0;
+		if (batch == 0 && pool_index == 0) {
+			age ^= 0x1;
+			sh->cmng.batch = batch;
+			sh->cmng.age = age;
+		}
 		goto next_container;
 	}
 	pool = cont->pools[pool_index];
@@ -5852,13 +5905,76 @@ mlx5_flow_query_alarm(void *arg)
 	if (pool_index >= rte_atomic16_read(&cont->n_valid)) {
 		batch ^= 0x1;
 		pool_index = 0;
+		if (batch == 0 && pool_index == 0)
+			age ^= 0x1;
 	}
 set_alarm:
 	sh->cmng.batch = batch;
 	sh->cmng.pool_index = pool_index;
+	sh->cmng.age = age;
 	mlx5_set_query_alarm(sh);
 }
 
+/**
+ * Check and callback event for new aged flow in the counter pool
+ *
+ * @param[in] pool
+ *   The pointer to Current counter pool.
+ */
+static void
+mlx5_flow_aging_check(struct mlx5_flow_counter_pool *pool)
+{
+	struct mlx5_priv *priv;
+	struct mlx5_flow_counter *cnt;
+	struct mlx5_age_param *age_param;
+	struct mlx5_counter_stats_raw *cur = pool->raw_hw;
+	struct mlx5_counter_stats_raw *prev = pool->raw;
+	uint16_t curr = rte_rdtsc() / (rte_get_tsc_hz() / 10);
+	uint64_t port_mask = 0;
+	uint32_t i;
+
+	for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
+		cnt = MLX5_POOL_GET_CNT(pool, i);
+		age_param = MLX5_CNT_TO_AGE(cnt);
+		if (rte_atomic16_read(&age_param->state) != AGE_CANDIDATE)
+			continue;
+		if (cur->data[i].hits != prev->data[i].hits) {
+			age_param->expire = curr + age_param->timeout;
+			continue;
+		}
+		if ((uint16_t)(curr - age_param->expire) >= (UINT16_MAX / 2))
+			continue;
+		/**
+		 * Hold the lock first, or if between the
+		 * state AGE_TMOUT and tailq operation the
+		 * release happened, the release procedure
+		 * may delete a non-existent tailq node.
+		 */
+		priv = rte_eth_devices[age_param->port_id].data->dev_private;
+		rte_spinlock_lock(&priv->aged_sl);
+		/* If the cpmset fails, release happens. */
+		if (rte_atomic16_cmpset((volatile uint16_t *)
+					&age_param->state,
+					AGE_CANDIDATE,
+					AGE_TMOUT) ==
+					AGE_CANDIDATE) {
+			TAILQ_INSERT_TAIL(&priv->aged_counters, cnt, next);
+			port_mask |= (1ull << age_param->port_id);
+		}
+		rte_spinlock_unlock(&priv->aged_sl);
+	}
+	for (i = 0; i < 64; i++) {
+		if (port_mask & (1ull << i)) {
+			priv = rte_eth_devices[i].data->dev_private;
+			if (!rte_atomic16_read(&priv->trigger_event))
+				continue;
+			_rte_eth_dev_callback_process(&rte_eth_devices[i],
+				RTE_ETH_EVENT_FLOW_AGED, NULL);
+			rte_atomic16_set(&priv->trigger_event, 0);
+		}
+	}
+}
+
 /**
  * Handler for the HW respond about ready values from an asynchronous batch
  * query. This function is probably called by the host thread.
@@ -5883,6 +5999,8 @@ mlx5_flow_async_pool_query_handle(struct mlx5_ibv_shared *sh,
 		raw_to_free = pool->raw_hw;
 	} else {
 		raw_to_free = pool->raw;
+		if (IS_AGE_POOL(pool))
+			mlx5_flow_aging_check(pool);
 		rte_spinlock_lock(&pool->sl);
 		pool->raw = pool->raw_hw;
 		rte_spinlock_unlock(&pool->sl);
@@ -6034,3 +6152,40 @@ mlx5_flow_dev_dump(struct rte_eth_dev *dev,
 	return mlx5_devx_cmd_flow_dump(sh->fdb_domain, sh->rx_domain,
 				       sh->tx_domain, file);
 }
+
+/**
+ * Get aged-out flows.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] context
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_countexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   how many contexts get in success, otherwise negative errno value.
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array.
+ */
+int
+mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error)
+{
+	const struct mlx5_flow_driver_ops *fops;
+	struct rte_flow_attr attr = { .transfer = 0 };
+
+	if (flow_get_drv_type(dev, &attr) == MLX5_FLOW_TYPE_DV) {
+		fops = flow_get_drv_ops(MLX5_FLOW_TYPE_DV);
+		return fops->get_aged_flows(dev, contexts, nb_contexts,
+						    error);
+	}
+	DRV_LOG(ERR,
+		"port %u get aged flows is not supported.",
+		 dev->data->port_id);
+	return -ENOTSUP;
+}
diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index 2a1f59698c..bf1d5beb9b 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -199,6 +199,7 @@ enum mlx5_feature_name {
 #define MLX5_FLOW_ACTION_METER (1ull << 31)
 #define MLX5_FLOW_ACTION_SET_IPV4_DSCP (1ull << 32)
 #define MLX5_FLOW_ACTION_SET_IPV6_DSCP (1ull << 33)
+#define MLX5_FLOW_ACTION_AGE (1ull << 34)
 
 #define MLX5_FLOW_FATE_ACTIONS \
 	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE | \
@@ -650,6 +651,7 @@ struct mlx5_flow_verbs_workspace {
 /** Device flow structure. */
 struct mlx5_flow {
 	struct rte_flow *flow; /**< Pointer to the main flow. */
+	uint32_t flow_idx; /**< The memory pool index to the main flow. */
 	uint64_t hash_fields; /**< Verbs hash Rx queue hash fields. */
 	uint64_t act_flags;
 	/**< Bit-fields of detected actions, see MLX5_FLOW_ACTION_*. */
@@ -873,6 +875,11 @@ typedef int (*mlx5_flow_counter_query_t)(struct rte_eth_dev *dev,
 					 uint32_t cnt,
 					 bool clear, uint64_t *pkts,
 					 uint64_t *bytes);
+typedef int (*mlx5_flow_get_aged_flows_t)
+					(struct rte_eth_dev *dev,
+					 void **context,
+					 uint32_t nb_contexts,
+					 struct rte_flow_error *error);
 struct mlx5_flow_driver_ops {
 	mlx5_flow_validate_t validate;
 	mlx5_flow_prepare_t prepare;
@@ -888,13 +895,14 @@ struct mlx5_flow_driver_ops {
 	mlx5_flow_counter_alloc_t counter_alloc;
 	mlx5_flow_counter_free_t counter_free;
 	mlx5_flow_counter_query_t counter_query;
+	mlx5_flow_get_aged_flows_t get_aged_flows;
 };
 
 
-#define MLX5_CNT_CONTAINER(sh, batch, thread) (&(sh)->cmng.ccont \
-	[(((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
-#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread) (&(sh)->cmng.ccont \
-	[(~((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
+#define MLX5_CNT_CONTAINER(sh, batch, thread, age) (&(sh)->cmng.ccont \
+	[(((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
+#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread, age) (&(sh)->cmng.ccont \
+	[(~((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
 
 /* mlx5_flow.c */
 
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index 784a62c521..73a5f477f8 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -24,6 +24,7 @@
 #include <rte_flow.h>
 #include <rte_flow_driver.h>
 #include <rte_malloc.h>
+#include <rte_cycles.h>
 #include <rte_ip.h>
 #include <rte_gre.h>
 #include <rte_vxlan.h>
@@ -3719,6 +3720,50 @@ mlx5_flow_validate_action_meter(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/**
+ * Validate the age action.
+ *
+ * @param[in] action_flags
+ *   Holds the actions detected until now.
+ * @param[in] action
+ *   Pointer to the age action.
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[out] error
+ *   Pointer to error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+flow_dv_validate_action_age(uint64_t action_flags,
+			    const struct rte_flow_action *action,
+			    struct rte_eth_dev *dev,
+			    struct rte_flow_error *error)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	const struct rte_flow_action_age *age = action->conf;
+
+	if (!priv->config.devx)
+		return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+					  NULL,
+					  "age action not supported");
+	if (!(action->conf))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "configuration cannot be null");
+	if (age->timeout >= UINT16_MAX / 2 / 10)
+		return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "Max age time: 3275 seconds");
+	if (action_flags & MLX5_FLOW_ACTION_AGE)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, NULL,
+					  "Duplicate age ctions set");
+	return 0;
+}
+
 /**
  * Validate the modify-header IPv4 DSCP actions.
  *
@@ -3896,14 +3941,16 @@ flow_dv_counter_get_by_idx(struct rte_eth_dev *dev,
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont;
 	struct mlx5_flow_counter_pool *pool;
-	uint32_t batch = 0;
+	uint32_t batch = 0, age = 0;
 
 	idx--;
+	age = MLX_CNT_IS_AGE(idx);
+	idx = age ? idx - MLX5_CNT_AGE_OFFSET : idx;
 	if (idx >= MLX5_CNT_BATCH_OFFSET) {
 		idx -= MLX5_CNT_BATCH_OFFSET;
 		batch = 1;
 	}
-	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	MLX5_ASSERT(idx / MLX5_COUNTERS_PER_POOL < cont->n);
 	pool = cont->pools[idx / MLX5_COUNTERS_PER_POOL];
 	MLX5_ASSERT(pool);
@@ -4023,18 +4070,21 @@ flow_dv_create_counter_stat_mem_mng(struct rte_eth_dev *dev, int raws_n)
  *   Pointer to the Ethernet device structure.
  * @param[in] batch
  *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the pool is for Aging counter.
  *
  * @return
  *   The new container pointer on success, otherwise NULL and rte_errno is set.
  */
 static struct mlx5_pools_container *
-flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
+flow_dv_container_resize(struct rte_eth_dev *dev,
+				uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont =
-			MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+			MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	struct mlx5_pools_container *new_cont =
-			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0);
+			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0, age);
 	struct mlx5_counter_stats_mem_mng *mem_mng = NULL;
 	uint32_t resize = cont->n + MLX5_CNT_CONTAINER_RESIZE;
 	uint32_t mem_size = sizeof(struct mlx5_flow_counter_pool *) * resize;
@@ -4042,7 +4092,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
 
 	/* Fallback mode has no background thread. Skip the check. */
 	if (!priv->counter_fallback &&
-	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1)) {
+	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1, age)) {
 		/* The last resize still hasn't detected by the host thread. */
 		rte_errno = EAGAIN;
 		return NULL;
@@ -4085,7 +4135,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
 	new_cont->init_mem_mng = mem_mng;
 	rte_cio_wmb();
 	 /* Flip the master container. */
-	priv->sh->cmng.mhi[batch] ^= (uint8_t)1;
+	priv->sh->cmng.mhi[batch][age] ^= (uint8_t)1;
 	return new_cont;
 }
 
@@ -4117,7 +4167,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 		if (priv->counter_fallback)
 			return mlx5_devx_cmd_flow_counter_query(cnt_ext->dcs, 0,
 					0, pkts, bytes, 0, NULL, NULL, 0);
@@ -4150,6 +4200,8 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
  *   The devX counter handle.
  * @param[in] batch
  *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the pool is for counter that was allocated for aging.
  * @param[in/out] cont_cur
  *   Pointer to the container pointer, it will be update in pool resize.
  *
@@ -4158,24 +4210,23 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
  */
 static struct mlx5_pools_container *
 flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
-		    uint32_t batch)
+		    uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_flow_counter_pool *pool;
 	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
-							       0);
+							       0, age);
 	int16_t n_valid = rte_atomic16_read(&cont->n_valid);
-	uint32_t size;
+	uint32_t size = sizeof(*pool);
 
 	if (cont->n == n_valid) {
-		cont = flow_dv_container_resize(dev, batch);
+		cont = flow_dv_container_resize(dev, batch, age);
 		if (!cont)
 			return NULL;
 	}
-	size = sizeof(*pool);
 	size += MLX5_COUNTERS_PER_POOL * CNT_SIZE;
-	if (!batch)
-		size += MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE;
+	size += (batch ? 0 : MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE);
+	size += (!age ? 0 : MLX5_COUNTERS_PER_POOL * AGE_SIZE);
 	pool = rte_calloc(__func__, 1, size, 0);
 	if (!pool) {
 		rte_errno = ENOMEM;
@@ -4187,8 +4238,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 						     MLX5_CNT_CONTAINER_RESIZE;
 	pool->raw_hw = NULL;
 	pool->type = 0;
-	if (!batch)
-		pool->type |= CNT_POOL_TYPE_EXT;
+	pool->type |= (batch ? 0 :  CNT_POOL_TYPE_EXT);
+	pool->type |= (!age ? 0 :  CNT_POOL_TYPE_AGE);
 	rte_spinlock_init(&pool->sl);
 	/*
 	 * The generation of the new allocated counters in this pool is 0, 2 in
@@ -4215,6 +4266,39 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 	return cont;
 }
 
+/**
+ * Update the minimum dcs-id for aged or no-aged counter pool.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] pool
+ *   Current counter pool.
+ * @param[in] batch
+ *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the counter is for aging.
+ */
+static void
+flow_dv_counter_update_min_dcs(struct rte_eth_dev *dev,
+			struct mlx5_flow_counter_pool *pool,
+			uint32_t batch, uint32_t age)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_flow_counter_pool *other;
+	struct mlx5_pools_container *cont;
+
+	cont = MLX5_CNT_CONTAINER(priv->sh,	batch, 0, (age ^ 0x1));
+	other = flow_dv_find_pool_by_id(cont, pool->min_dcs->id);
+	if (!other)
+		return;
+	if (pool->min_dcs->id < other->min_dcs->id) {
+		rte_atomic64_set(&other->a64_dcs,
+			rte_atomic64_read(&pool->a64_dcs));
+	} else {
+		rte_atomic64_set(&pool->a64_dcs,
+			rte_atomic64_read(&other->a64_dcs));
+	}
+}
 /**
  * Prepare a new counter and/or a new counter pool.
  *
@@ -4224,6 +4308,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
  *   Where to put the pointer of a new counter.
  * @param[in] batch
  *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the pool is for counter that was allocated for aging.
  *
  * @return
  *   The counter container pointer and @p cnt_free is set on success,
@@ -4232,7 +4318,7 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 static struct mlx5_pools_container *
 flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			     struct mlx5_flow_counter **cnt_free,
-			     uint32_t batch)
+			     uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont;
@@ -4241,7 +4327,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 	struct mlx5_flow_counter *cnt;
 	uint32_t i;
 
-	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	if (!batch) {
 		/* bulk_bitmap must be 0 for single counter allocation. */
 		dcs = mlx5_devx_cmd_flow_counter_alloc(priv->sh->ctx, 0);
@@ -4249,7 +4335,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			return NULL;
 		pool = flow_dv_find_pool_by_id(cont, dcs->id);
 		if (!pool) {
-			cont = flow_dv_pool_create(dev, dcs, batch);
+			cont = flow_dv_pool_create(dev, dcs, batch, age);
 			if (!cont) {
 				mlx5_devx_cmd_destroy(dcs);
 				return NULL;
@@ -4259,6 +4345,8 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			rte_atomic64_set(&pool->a64_dcs,
 					 (int64_t)(uintptr_t)dcs);
 		}
+		flow_dv_counter_update_min_dcs(dev,
+						pool, batch, age);
 		i = dcs->id % MLX5_COUNTERS_PER_POOL;
 		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
@@ -4273,7 +4361,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 		rte_errno = ENODATA;
 		return NULL;
 	}
-	cont = flow_dv_pool_create(dev, dcs, batch);
+	cont = flow_dv_pool_create(dev, dcs, batch, age);
 	if (!cont) {
 		mlx5_devx_cmd_destroy(dcs);
 		return NULL;
@@ -4334,13 +4422,15 @@ flow_dv_counter_shared_search(struct mlx5_pools_container *cont, uint32_t id,
  *   Counter identifier.
  * @param[in] group
  *   Counter flow group.
+ * @param[in] age
+ *   Whether the counter was allocated for aging.
  *
  * @return
  *   Index to flow counter on success, 0 otherwise and rte_errno is set.
  */
 static uint32_t
 flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
-		      uint16_t group)
+		      uint16_t group, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_flow_counter_pool *pool = NULL;
@@ -4356,7 +4446,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	 */
 	uint32_t batch = (group && !shared && !priv->counter_fallback) ? 1 : 0;
 	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
-							       0);
+							       0, age);
 	uint32_t cnt_idx;
 
 	if (!priv->config.devx) {
@@ -4395,13 +4485,13 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		cnt_free = NULL;
 	}
 	if (!cnt_free) {
-		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch);
+		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch, age);
 		if (!cont)
 			return 0;
 		pool = TAILQ_FIRST(&cont->pool_list);
 	}
 	if (!batch)
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt_free);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt_free);
 	/* Create a DV counter action only in the first time usage. */
 	if (!cnt_free->action) {
 		uint16_t offset;
@@ -4424,6 +4514,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	cnt_idx = MLX5_MAKE_CNT_IDX(pool->index,
 				MLX5_CNT_ARRAY_IDX(pool, cnt_free));
 	cnt_idx += batch * MLX5_CNT_BATCH_OFFSET;
+	cnt_idx += age * MLX5_CNT_AGE_OFFSET;
 	/* Update the counter reset values. */
 	if (_flow_dv_query_count(dev, cnt_idx, &cnt_free->hits,
 				 &cnt_free->bytes))
@@ -4445,6 +4536,62 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	return cnt_idx;
 }
 
+/**
+ * Get age param from counter index.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] counter
+ *   Index to the counter handler.
+ *
+ * @return
+ *   The aging parameter specified for the counter index.
+ */
+static struct mlx5_age_param*
+flow_dv_counter_idx_get_age(struct rte_eth_dev *dev,
+				uint32_t counter)
+{
+	struct mlx5_flow_counter *cnt;
+	struct mlx5_flow_counter_pool *pool = NULL;
+
+	flow_dv_counter_get_by_idx(dev, counter, &pool);
+	counter = (counter - 1) % MLX5_COUNTERS_PER_POOL;
+	cnt = MLX5_POOL_GET_CNT(pool, counter);
+	return MLX5_CNT_TO_AGE(cnt);
+}
+
+/**
+ * Remove a flow counter from aged counter list.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] counter
+ *   Index to the counter handler.
+ * @param[in] cnt
+ *   Pointer to the counter handler.
+ */
+static void
+flow_dv_counter_remove_from_age(struct rte_eth_dev *dev,
+				uint32_t counter, struct mlx5_flow_counter *cnt)
+{
+	struct mlx5_age_param *age_param;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
+	age_param = flow_dv_counter_idx_get_age(dev, counter);
+	if (rte_atomic16_cmpset((volatile uint16_t *)
+			&age_param->state,
+			AGE_CANDIDATE, AGE_FREE)
+			!= AGE_CANDIDATE) {
+		/**
+		 * We need the lock even it is age timeout,
+		 * since counter may still in process.
+		 */
+		rte_spinlock_lock(&priv->aged_sl);
+		TAILQ_REMOVE(&priv->aged_counters, cnt, next);
+		rte_spinlock_unlock(&priv->aged_sl);
+	}
+	rte_atomic16_set(&age_param->state, AGE_FREE);
+}
 /**
  * Release a flow counter.
  *
@@ -4465,10 +4612,12 @@ flow_dv_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 		if (cnt_ext && --cnt_ext->ref_cnt)
 			return;
 	}
+	if (IS_AGE_POOL(pool))
+		flow_dv_counter_remove_from_age(dev, counter, cnt);
 	/* Put the counter in the end - the last updated one. */
 	TAILQ_INSERT_TAIL(&pool->counters, cnt, next);
 	/*
@@ -5243,6 +5392,15 @@ flow_dv_validate(struct rte_eth_dev *dev, const struct rte_flow_attr *attr,
 			/* Meter action will add one more TAG action. */
 			rw_act_num += MLX5_ACT_NUM_SET_TAG;
 			break;
+		case RTE_FLOW_ACTION_TYPE_AGE:
+			ret = flow_dv_validate_action_age(action_flags,
+							  actions, dev,
+							  error);
+			if (ret < 0)
+				return ret;
+			action_flags |= MLX5_FLOW_ACTION_AGE;
+			++actions_n;
+			break;
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP:
 			ret = flow_dv_validate_action_modify_ipv4_dscp
 							 (action_flags,
@@ -7281,6 +7439,54 @@ flow_dv_translate_action_port_id(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/**
+ * Create a counter with aging configuration.
+ *
+ * @param[in] dev
+ *   Pointer to rte_eth_dev structure.
+ * @param[out] count
+ *   Pointer to the counter action configuration.
+ * @param[in] age
+ *   Pointer to the aging action configuration.
+ *
+ * @return
+ *   Index to flow counter on success, 0 otherwise.
+ */
+static uint32_t
+flow_dv_translate_create_counter(struct rte_eth_dev *dev,
+				struct mlx5_flow *dev_flow,
+				const struct rte_flow_action_count *count,
+				const struct rte_flow_action_age *age)
+{
+	uint32_t counter;
+	struct mlx5_age_param *age_param;
+
+	counter = flow_dv_counter_alloc(dev,
+				count ? count->shared : 0,
+				count ? count->id : 0,
+				dev_flow->dv.group, !!age);
+
+	if (!counter || age == NULL)
+		return counter;
+	age_param  = flow_dv_counter_idx_get_age(dev, counter);
+	/*
+	 * The counter age accuracy may have a bit delay. Have 3/4
+	 * second bias on the timeount in order to let it age in time.
+	 */
+	age_param->context = age->context ? age->context :
+		(void *)(uintptr_t)(dev_flow->flow_idx);
+	/*
+	 * The counter age accuracy may have a bit delay. Have 3/4
+	 * second bias on the timeount in order to let it age in time.
+	 */
+	age_param->timeout = age->timeout * 10 - 7;
+	/* Set expire time in unit of 0.1 sec. */
+	age_param->port_id = dev->data->port_id;
+	age_param->expire = age_param->timeout +
+			rte_rdtsc() / (rte_get_tsc_hz() / 10);
+	rte_atomic16_set(&age_param->state, AGE_CANDIDATE);
+	return counter;
+}
 /**
  * Add Tx queue matcher
  *
@@ -7450,6 +7656,8 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 			    (MLX5_MAX_MODIFY_NUM + 1)];
 	} mhdr_dummy;
 	struct mlx5_flow_dv_modify_hdr_resource *mhdr_res = &mhdr_dummy.res;
+	const struct rte_flow_action_count *count = NULL;
+	const struct rte_flow_action_age *age = NULL;
 	union flow_dv_attr flow_attr = { .attr = 0 };
 	uint32_t tag_be;
 	union mlx5_flow_tbl_key tbl_key;
@@ -7478,7 +7686,6 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 		const struct rte_flow_action_queue *queue;
 		const struct rte_flow_action_rss *rss;
 		const struct rte_flow_action *action = actions;
-		const struct rte_flow_action_count *count = action->conf;
 		const uint8_t *rss_key;
 		const struct rte_flow_action_jump *jump_data;
 		const struct rte_flow_action_meter *mtr;
@@ -7607,36 +7814,21 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 			action_flags |= MLX5_FLOW_ACTION_RSS;
 			dev_flow->handle->fate_action = MLX5_FLOW_FATE_QUEUE;
 			break;
+		case RTE_FLOW_ACTION_TYPE_AGE:
 		case RTE_FLOW_ACTION_TYPE_COUNT:
 			if (!dev_conf->devx) {
-				rte_errno = ENOTSUP;
-				goto cnt_err;
-			}
-			flow->counter = flow_dv_counter_alloc(dev,
-							count->shared,
-							count->id,
-							dev_flow->dv.group);
-			if (!flow->counter)
-				goto cnt_err;
-			dev_flow->dv.actions[actions_n++] =
-				  (flow_dv_counter_get_by_idx(dev,
-				  flow->counter, NULL))->action;
-			action_flags |= MLX5_FLOW_ACTION_COUNT;
-			break;
-cnt_err:
-			if (rte_errno == ENOTSUP)
 				return rte_flow_error_set
 					      (error, ENOTSUP,
 					       RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 					       NULL,
 					       "count action not supported");
+			}
+			/* Save information first, will apply later. */
+			if (actions->type == RTE_FLOW_ACTION_TYPE_COUNT)
+				count = action->conf;
 			else
-				return rte_flow_error_set
-						(error, rte_errno,
-						 RTE_FLOW_ERROR_TYPE_ACTION,
-						 action,
-						 "cannot create counter"
-						  " object.");
+				age = action->conf;
+			action_flags |= MLX5_FLOW_ACTION_COUNT;
 			break;
 		case RTE_FLOW_ACTION_TYPE_OF_POP_VLAN:
 			dev_flow->dv.actions[actions_n++] =
@@ -7909,6 +8101,22 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 				dev_flow->dv.actions[modify_action_position] =
 					handle->dvh.modify_hdr->verbs_action;
 			}
+			if (action_flags & MLX5_FLOW_ACTION_COUNT) {
+				flow->counter =
+					flow_dv_translate_create_counter(dev,
+						dev_flow, count, age);
+
+				if (!flow->counter)
+					return rte_flow_error_set
+						(error, rte_errno,
+						RTE_FLOW_ERROR_TYPE_ACTION,
+						NULL,
+						"cannot create counter"
+						" object.");
+				dev_flow->dv.actions[actions_n++] =
+					  (flow_dv_counter_get_by_idx(dev,
+					  flow->counter, NULL))->action;
+			}
 			break;
 		default:
 			break;
@@ -9169,6 +9377,58 @@ flow_dv_counter_query(struct rte_eth_dev *dev, uint32_t counter, bool clear,
 	return 0;
 }
 
+/**
+ * Get aged-out flows.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] context
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_contexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   how many contexts get in success, otherwise negative errno value.
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array.
+ * @note: only stub for now
+ */
+static int
+flow_get_aged_flows(struct rte_eth_dev *dev,
+		    void **context,
+		    uint32_t nb_contexts,
+		    struct rte_flow_error *error)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_age_param *age_param;
+	struct mlx5_flow_counter *counter;
+	int nb_flows = 0;
+
+	if (nb_contexts && !context)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+					  NULL,
+					  "Should assign at least one flow or"
+					  " context to get if nb_contexts != 0");
+	rte_spinlock_lock(&priv->aged_sl);
+	TAILQ_FOREACH(counter, &priv->aged_counters, next) {
+		nb_flows++;
+		if (nb_contexts) {
+			age_param = MLX5_CNT_TO_AGE(counter);
+			context[nb_flows - 1] = age_param->context;
+			if (!(--nb_contexts))
+				break;
+		}
+	}
+	rte_spinlock_unlock(&priv->aged_sl);
+	rte_atomic16_set(&priv->trigger_event, 1);
+	return nb_flows;
+}
+
 /*
  * Mutex-protected thunk to lock-free  __flow_dv_translate().
  */
@@ -9235,7 +9495,7 @@ flow_dv_counter_allocate(struct rte_eth_dev *dev)
 	uint32_t cnt;
 
 	flow_dv_shared_lock(dev);
-	cnt = flow_dv_counter_alloc(dev, 0, 0, 1);
+	cnt = flow_dv_counter_alloc(dev, 0, 0, 1, 0);
 	flow_dv_shared_unlock(dev);
 	return cnt;
 }
@@ -9266,6 +9526,7 @@ const struct mlx5_flow_driver_ops mlx5_flow_dv_drv_ops = {
 	.counter_alloc = flow_dv_counter_allocate,
 	.counter_free = flow_dv_counter_free,
 	.counter_query = flow_dv_counter_query,
+	.get_aged_flows = flow_get_aged_flows,
 };
 
 #endif /* HAVE_IBV_FLOW_DV_SUPPORT */
diff --git a/drivers/net/mlx5/mlx5_flow_verbs.c b/drivers/net/mlx5/mlx5_flow_verbs.c
index 236d665852..7efd97f547 100644
--- a/drivers/net/mlx5/mlx5_flow_verbs.c
+++ b/drivers/net/mlx5/mlx5_flow_verbs.c
@@ -56,7 +56,8 @@ flow_verbs_counter_get_by_idx(struct rte_eth_dev *dev,
 			      struct mlx5_flow_counter_pool **ppool)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
+	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
+									0);
 	struct mlx5_flow_counter_pool *pool;
 
 	idx--;
@@ -151,7 +152,8 @@ static uint32_t
 flow_verbs_counter_new(struct rte_eth_dev *dev, uint32_t shared, uint32_t id)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
+	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
+									0);
 	struct mlx5_flow_counter_pool *pool = NULL;
 	struct mlx5_flow_counter_ext *cnt_ext = NULL;
 	struct mlx5_flow_counter *cnt = NULL;
@@ -251,7 +253,7 @@ flow_verbs_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 
 	cnt = flow_verbs_counter_get_by_idx(dev, counter,
 					    &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	if (--cnt_ext->ref_cnt == 0) {
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 		claim_zero(mlx5_glue->destroy_counter_set(cnt_ext->cs));
@@ -282,7 +284,7 @@ flow_verbs_counter_query(struct rte_eth_dev *dev __rte_unused,
 		struct mlx5_flow_counter *cnt = flow_verbs_counter_get_by_idx
 						(dev, flow->counter, &pool);
 		struct mlx5_flow_counter_ext *cnt_ext = MLX5_CNT_TO_CNT_EXT
-						(cnt);
+						(pool, cnt);
 		struct rte_flow_query_count *qc = data;
 		uint64_t counters[2] = {0, 0};
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
@@ -1083,12 +1085,12 @@ flow_verbs_translate_action_count(struct mlx5_flow *dev_flow,
 	}
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	counter.counter_set_handle = cnt_ext->cs->handle;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #elif defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	counter.counters = cnt_ext->cs;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] net/mlx5: support flow aging
  2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: support flow aging Bill Zhou
@ 2020-04-26  7:07         ` Suanming Mou
  0 siblings, 0 replies; 50+ messages in thread
From: Suanming Mou @ 2020-04-26  7:07 UTC (permalink / raw)
  To: Bill Zhou, matan, shahafs, viacheslavo, marko.kovacevic,
	john.mcnamara, orika
  Cc: dev

On 4/24/2020 6:45 PM, Bill Zhou wrote:
> Currently, there is no flow aging check and age-out event callback
> mechanism for mlx5 driver, this patch implements it. It's included:
> - Splitting the current counter container to aged or no-aged container
>    since reducing memory consumption. Aged container will allocate extra
>    memory to save the aging parameter from user configuration.
> - Aging check and age-out event callback mechanism based on current
>    counter. When a flow be checked aged-out, RTE_ETH_EVENT_FLOW_AGED
>    event will be triggered to applications.
> - Implement the new API: rte_flow_get_aged_flows, applications can use
>    this API to get aged flows.
>
> Signed-off-by: Bill Zhou <dongz@mellanox.com>
Reviewed-by: Suanming Mou <suanmingm@mellanox.com>
> ---
> v2: Moving aging list from struct mlx5_ibv_shared to struct mlx5_priv,
> one port has one aging list. Update event be triggered once after last
> call of rte_flow_get_aged_flows.
> ---
>   doc/guides/rel_notes/release_20_05.rst |   1 +
>   drivers/net/mlx5/mlx5.c                |  86 +++---
>   drivers/net/mlx5/mlx5.h                |  49 +++-
>   drivers/net/mlx5/mlx5_flow.c           | 201 ++++++++++++--
>   drivers/net/mlx5/mlx5_flow.h           |  16 +-
>   drivers/net/mlx5/mlx5_flow_dv.c        | 361 +++++++++++++++++++++----
>   drivers/net/mlx5/mlx5_flow_verbs.c     |  14 +-
>   7 files changed, 607 insertions(+), 121 deletions(-)
>
> diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
> index b124c3f287..a5ba8a4792 100644
> --- a/doc/guides/rel_notes/release_20_05.rst
> +++ b/doc/guides/rel_notes/release_20_05.rst
> @@ -141,6 +141,7 @@ New Features
>     * Added support for creating Relaxed Ordering Memory Regions.
>     * Added support for jumbo frame size (9K MTU) in Multi-Packet RQ mode.
>     * Optimized the memory consumption of flow.
> +  * Added support for flow aging based on hardware counter.
>   
>   * **Updated the AESNI MB crypto PMD.**
>   
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
> index 57d76cb741..674d0ea9d3 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -437,6 +437,20 @@ mlx5_flow_id_release(struct mlx5_flow_id_pool *pool, uint32_t id)
>   	return 0;
>   }
>   
> +/**
> + * Initialize the private aging list information.
> + *
> + * @param[in] priv
> + *   Pointer to the private device data structure.
> + */
> +static void
> +mlx5_flow_aging_list_init(struct mlx5_priv *priv)
> +{
> +	TAILQ_INIT(&priv->aged_counters);
> +	rte_spinlock_init(&priv->aged_sl);
> +	rte_atomic16_set(&priv->trigger_event, 1);
> +}
> +
>   /**
>    * Initialize the counters management structure.
>    *
> @@ -446,11 +460,14 @@ mlx5_flow_id_release(struct mlx5_flow_id_pool *pool, uint32_t id)
>   static void
>   mlx5_flow_counters_mng_init(struct mlx5_ibv_shared *sh)
>   {
> -	uint8_t i;
> +	uint8_t i, age;
>   
> +	sh->cmng.age = 0;
>   	TAILQ_INIT(&sh->cmng.flow_counters);
> -	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
> -		TAILQ_INIT(&sh->cmng.ccont[i].pool_list);
> +	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
> +		for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
> +			TAILQ_INIT(&sh->cmng.ccont[i][age].pool_list);
> +	}
>   }
>   
>   /**
> @@ -480,7 +497,7 @@ static void
>   mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
>   {
>   	struct mlx5_counter_stats_mem_mng *mng;
> -	uint8_t i;
> +	uint8_t i, age = 0;
>   	int j;
>   	int retries = 1024;
>   
> @@ -491,36 +508,42 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
>   			break;
>   		rte_pause();
>   	}
> -	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i) {
> -		struct mlx5_flow_counter_pool *pool;
> -		uint32_t batch = !!(i % 2);
>   
> -		if (!sh->cmng.ccont[i].pools)
> -			continue;
> -		pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
> -		while (pool) {
> -			if (batch) {
> -				if (pool->min_dcs)
> -					claim_zero
> -					(mlx5_devx_cmd_destroy(pool->min_dcs));
> -			}
> -			for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
> -				if (MLX5_POOL_GET_CNT(pool, j)->action)
> -					claim_zero
> -					(mlx5_glue->destroy_flow_action
> -					 (MLX5_POOL_GET_CNT(pool, j)->action));
> -				if (!batch && MLX5_GET_POOL_CNT_EXT
> -				    (pool, j)->dcs)
> -					claim_zero(mlx5_devx_cmd_destroy
> -						  (MLX5_GET_POOL_CNT_EXT
> -						  (pool, j)->dcs));
> +	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
> +		for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i) {
> +			struct mlx5_flow_counter_pool *pool;
> +			uint32_t batch = !!(i % 2);
> +
> +			if (!sh->cmng.ccont[i][age].pools)
> +				continue;
> +			pool = TAILQ_FIRST(&sh->cmng.ccont[i][age].pool_list);
> +			while (pool) {
> +				if (batch) {
> +					if (pool->min_dcs)
> +						claim_zero
> +						(mlx5_devx_cmd_destroy
> +						(pool->min_dcs));
> +				}
> +				for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
> +					if (MLX5_POOL_GET_CNT(pool, j)->action)
> +						claim_zero
> +						(mlx5_glue->destroy_flow_action
> +						 (MLX5_POOL_GET_CNT
> +						  (pool, j)->action));
> +					if (!batch && MLX5_GET_POOL_CNT_EXT
> +					    (pool, j)->dcs)
> +						claim_zero(mlx5_devx_cmd_destroy
> +							  (MLX5_GET_POOL_CNT_EXT
> +							  (pool, j)->dcs));
> +				}
> +				TAILQ_REMOVE(&sh->cmng.ccont[i][age].pool_list,
> +					pool, next);
> +				rte_free(pool);
> +				pool = TAILQ_FIRST
> +					(&sh->cmng.ccont[i][age].pool_list);
>   			}
> -			TAILQ_REMOVE(&sh->cmng.ccont[i].pool_list, pool,
> -				     next);
> -			rte_free(pool);
> -			pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
> +			rte_free(sh->cmng.ccont[i][age].pools);
>   		}
> -		rte_free(sh->cmng.ccont[i].pools);
>   	}
>   	mng = LIST_FIRST(&sh->cmng.mem_mngs);
>   	while (mng) {
> @@ -3003,6 +3026,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
>   			goto error;
>   		}
>   	}
> +	mlx5_flow_aging_list_init(priv);
>   	return eth_dev;
>   error:
>   	if (priv) {
> diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
> index 51c3f33e6b..d1b358e929 100644
> --- a/drivers/net/mlx5/mlx5.h
> +++ b/drivers/net/mlx5/mlx5.h
> @@ -222,13 +222,21 @@ struct mlx5_drop {
>   #define MLX5_COUNTERS_PER_POOL 512
>   #define MLX5_MAX_PENDING_QUERIES 4
>   #define MLX5_CNT_CONTAINER_RESIZE 64
> +#define MLX5_CNT_AGE_OFFSET 0x80000000
>   #define CNT_SIZE (sizeof(struct mlx5_flow_counter))
>   #define CNTEXT_SIZE (sizeof(struct mlx5_flow_counter_ext))
> +#define AGE_SIZE (sizeof(struct mlx5_age_param))
>   
>   #define CNT_POOL_TYPE_EXT	(1 << 0)
> +#define CNT_POOL_TYPE_AGE	(1 << 1)
>   #define IS_EXT_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_EXT)
> +#define IS_AGE_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_AGE)
> +#define MLX_CNT_IS_AGE(counter) ((counter) & MLX5_CNT_AGE_OFFSET ? 1 : 0)
> +
>   #define MLX5_CNT_LEN(pool) \
> -	(CNT_SIZE + (IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
> +	(CNT_SIZE + \
> +	(IS_AGE_POOL(pool) ? AGE_SIZE : 0) + \
> +	(IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
>   #define MLX5_POOL_GET_CNT(pool, index) \
>   	((struct mlx5_flow_counter *) \
>   	((char *)((pool) + 1) + (index) * (MLX5_CNT_LEN(pool))))
> @@ -242,13 +250,33 @@ struct mlx5_drop {
>    */
>   #define MLX5_MAKE_CNT_IDX(pi, offset) \
>   	((pi) * MLX5_COUNTERS_PER_POOL + (offset) + 1)
> -#define MLX5_CNT_TO_CNT_EXT(cnt) \
> -	((struct mlx5_flow_counter_ext *)((cnt) + 1))
> +#define MLX5_CNT_TO_CNT_EXT(pool, cnt) \
> +	((struct mlx5_flow_counter_ext *)\
> +	((char *)((cnt) + 1) + \
> +	(IS_AGE_POOL(pool) ? AGE_SIZE : 0)))
>   #define MLX5_GET_POOL_CNT_EXT(pool, offset) \
> -	MLX5_CNT_TO_CNT_EXT(MLX5_POOL_GET_CNT((pool), (offset)))
> +	MLX5_CNT_TO_CNT_EXT(pool, MLX5_POOL_GET_CNT((pool), (offset)))
> +#define MLX5_CNT_TO_AGE(cnt) \
> +	((struct mlx5_age_param *)((cnt) + 1))
>   
>   struct mlx5_flow_counter_pool;
>   
> +/*age status*/
> +enum {
> +	AGE_FREE,
> +	AGE_CANDIDATE, /* Counter assigned to flows. */
> +	AGE_TMOUT, /* Timeout, wait for aged flows query and destroy. */
> +};
> +
> +/* Counter age parameter. */
> +struct mlx5_age_param {
> +	rte_atomic16_t state; /**< Age state. */
> +	uint16_t port_id; /**< Port id of the counter. */
> +	uint32_t timeout:15; /**< Age timeout in unit of 0.1sec. */
> +	uint32_t expire:16; /**< Expire time(0.1sec) in the future. */
> +	void *context; /**< Flow counter age context. */
> +};
> +
>   struct flow_counter_stats {
>   	uint64_t hits;
>   	uint64_t bytes;
> @@ -336,13 +364,14 @@ struct mlx5_pools_container {
>   
>   /* Counter global management structure. */
>   struct mlx5_flow_counter_mng {
> -	uint8_t mhi[2]; /* master \ host container index. */
> -	struct mlx5_pools_container ccont[2 * 2];
> -	/* 2 containers for single and for batch for double-buffer. */
> +	uint8_t mhi[2][2]; /* master \ host and age \ no age container index. */
> +	struct mlx5_pools_container ccont[2 * 2][2];
> +	/* master \ host and age \ no age pools container. */
>   	struct mlx5_counters flow_counters; /* Legacy flow counter list. */
>   	uint8_t pending_queries;
>   	uint8_t batch;
>   	uint16_t pool_index;
> +	uint8_t age;
>   	uint8_t query_thread_on;
>   	LIST_HEAD(mem_mngs, mlx5_counter_stats_mem_mng) mem_mngs;
>   	LIST_HEAD(stat_raws, mlx5_counter_stats_raw) free_stat_raws;
> @@ -566,6 +595,10 @@ struct mlx5_priv {
>   	uint8_t fdb_def_rule; /* Whether fdb jump to table 1 is configured. */
>   	struct mlx5_mp_id mp_id; /* ID of a multi-process process */
>   	LIST_HEAD(fdir, mlx5_fdir_flow) fdir_flows; /* fdir flows. */
> +	struct mlx5_counters aged_counters; /* Aged flow counter list. */
> +	rte_spinlock_t aged_sl; /* Aged flow counter list lock. */
> +	rte_atomic16_t trigger_event;
> +	/* Event be triggered once after last call of rte_flow_get_aged_flows*/
>   };
>   
>   #define PORT_ID(priv) ((priv)->dev_data->port_id)
> @@ -764,6 +797,8 @@ int mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
>   int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
>   		       struct rte_flow_error *error);
>   void mlx5_flow_rxq_dynf_metadata_set(struct rte_eth_dev *dev);
> +int mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
> +			uint32_t nb_contexts, struct rte_flow_error *error);
>   
>   /* mlx5_mp.c */
>   int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
> diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
> index cba1f23e81..c691b43722 100644
> --- a/drivers/net/mlx5/mlx5_flow.c
> +++ b/drivers/net/mlx5/mlx5_flow.c
> @@ -24,6 +24,7 @@
>   #include <rte_ether.h>
>   #include <rte_ethdev_driver.h>
>   #include <rte_flow.h>
> +#include <rte_cycles.h>
>   #include <rte_flow_driver.h>
>   #include <rte_malloc.h>
>   #include <rte_ip.h>
> @@ -242,6 +243,7 @@ static const struct rte_flow_ops mlx5_flow_ops = {
>   	.isolate = mlx5_flow_isolate,
>   	.query = mlx5_flow_query,
>   	.dev_dump = mlx5_flow_dev_dump,
> +	.get_aged_flows = mlx5_flow_get_aged_flows,
>   };
>   
>   /* Convert FDIR request to Generic flow. */
> @@ -2531,6 +2533,8 @@ flow_drv_validate(struct rte_eth_dev *dev,
>    *   Pointer to the list of items.
>    * @param[in] actions
>    *   Pointer to the list of actions.
> + * @param[in] flow_idx
> + *   This memory pool index to the flow.
>    * @param[out] error
>    *   Pointer to the error structure.
>    *
> @@ -2543,14 +2547,19 @@ flow_drv_prepare(struct rte_eth_dev *dev,
>   		 const struct rte_flow_attr *attr,
>   		 const struct rte_flow_item items[],
>   		 const struct rte_flow_action actions[],
> +		 uint32_t flow_idx,
>   		 struct rte_flow_error *error)
>   {
>   	const struct mlx5_flow_driver_ops *fops;
>   	enum mlx5_flow_drv_type type = flow->drv_type;
> +	struct mlx5_flow *mlx5_flow = NULL;
>   
>   	MLX5_ASSERT(type > MLX5_FLOW_TYPE_MIN && type < MLX5_FLOW_TYPE_MAX);
>   	fops = flow_get_drv_ops(type);
> -	return fops->prepare(dev, attr, items, actions, error);
> +	mlx5_flow = fops->prepare(dev, attr, items, actions, error);
> +	if (mlx5_flow)
> +		mlx5_flow->flow_idx = flow_idx;
> +	return mlx5_flow;
>   }
>   
>   /**
> @@ -3498,6 +3507,8 @@ flow_hairpin_split(struct rte_eth_dev *dev,
>    *   Associated actions (list terminated by the END action).
>    * @param[in] external
>    *   This flow rule is created by request external to PMD.
> + * @param[in] flow_idx
> + *   This memory pool index to the flow.
>    * @param[out] error
>    *   Perform verbose error reporting if not NULL.
>    * @return
> @@ -3511,11 +3522,13 @@ flow_create_split_inner(struct rte_eth_dev *dev,
>   			const struct rte_flow_attr *attr,
>   			const struct rte_flow_item items[],
>   			const struct rte_flow_action actions[],
> -			bool external, struct rte_flow_error *error)
> +			bool external, uint32_t flow_idx,
> +			struct rte_flow_error *error)
>   {
>   	struct mlx5_flow *dev_flow;
>   
> -	dev_flow = flow_drv_prepare(dev, flow, attr, items, actions, error);
> +	dev_flow = flow_drv_prepare(dev, flow, attr, items, actions,
> +		flow_idx, error);
>   	if (!dev_flow)
>   		return -rte_errno;
>   	dev_flow->flow = flow;
> @@ -3876,6 +3889,8 @@ flow_mreg_tx_copy_prep(struct rte_eth_dev *dev,
>    *   Associated actions (list terminated by the END action).
>    * @param[in] external
>    *   This flow rule is created by request external to PMD.
> + * @param[in] flow_idx
> + *   This memory pool index to the flow.
>    * @param[out] error
>    *   Perform verbose error reporting if not NULL.
>    * @return
> @@ -3888,7 +3903,8 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
>   			   const struct rte_flow_attr *attr,
>   			   const struct rte_flow_item items[],
>   			   const struct rte_flow_action actions[],
> -			   bool external, struct rte_flow_error *error)
> +			   bool external, uint32_t flow_idx,
> +			   struct rte_flow_error *error)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
>   	struct mlx5_dev_config *config = &priv->config;
> @@ -3908,7 +3924,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
>   	    !mlx5_flow_ext_mreg_supported(dev))
>   		return flow_create_split_inner(dev, flow, NULL, prefix_layers,
>   					       attr, items, actions, external,
> -					       error);
> +					       flow_idx, error);
>   	actions_n = flow_parse_metadata_split_actions_info(actions, &qrss,
>   							   &encap_idx);
>   	if (qrss) {
> @@ -3992,7 +4008,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
>   	/* Add the unmodified original or prefix subflow. */
>   	ret = flow_create_split_inner(dev, flow, &dev_flow, prefix_layers, attr,
>   				      items, ext_actions ? ext_actions :
> -				      actions, external, error);
> +				      actions, external, flow_idx, error);
>   	if (ret < 0)
>   		goto exit;
>   	MLX5_ASSERT(dev_flow);
> @@ -4055,7 +4071,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
>   		ret = flow_create_split_inner(dev, flow, &dev_flow, layers,
>   					      &q_attr, mtr_sfx ? items :
>   					      q_items, q_actions,
> -					      external, error);
> +					      external, flow_idx, error);
>   		if (ret < 0)
>   			goto exit;
>   		/* qrss ID should be freed if failed. */
> @@ -4096,6 +4112,8 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
>    *   Associated actions (list terminated by the END action).
>    * @param[in] external
>    *   This flow rule is created by request external to PMD.
> + * @param[in] flow_idx
> + *   This memory pool index to the flow.
>    * @param[out] error
>    *   Perform verbose error reporting if not NULL.
>    * @return
> @@ -4107,7 +4125,8 @@ flow_create_split_meter(struct rte_eth_dev *dev,
>   			   const struct rte_flow_attr *attr,
>   			   const struct rte_flow_item items[],
>   			   const struct rte_flow_action actions[],
> -			   bool external, struct rte_flow_error *error)
> +			   bool external, uint32_t flow_idx,
> +			   struct rte_flow_error *error)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
>   	struct rte_flow_action *sfx_actions = NULL;
> @@ -4151,7 +4170,7 @@ flow_create_split_meter(struct rte_eth_dev *dev,
>   		/* Add the prefix subflow. */
>   		ret = flow_create_split_inner(dev, flow, &dev_flow, 0, attr,
>   					      items, pre_actions, external,
> -					      error);
> +					      flow_idx, error);
>   		if (ret) {
>   			ret = -rte_errno;
>   			goto exit;
> @@ -4168,7 +4187,7 @@ flow_create_split_meter(struct rte_eth_dev *dev,
>   					 0, &sfx_attr,
>   					 sfx_items ? sfx_items : items,
>   					 sfx_actions ? sfx_actions : actions,
> -					 external, error);
> +					 external, flow_idx, error);
>   exit:
>   	if (sfx_actions)
>   		rte_free(sfx_actions);
> @@ -4205,6 +4224,8 @@ flow_create_split_meter(struct rte_eth_dev *dev,
>    *   Associated actions (list terminated by the END action).
>    * @param[in] external
>    *   This flow rule is created by request external to PMD.
> + * @param[in] flow_idx
> + *   This memory pool index to the flow.
>    * @param[out] error
>    *   Perform verbose error reporting if not NULL.
>    * @return
> @@ -4216,12 +4237,13 @@ flow_create_split_outer(struct rte_eth_dev *dev,
>   			const struct rte_flow_attr *attr,
>   			const struct rte_flow_item items[],
>   			const struct rte_flow_action actions[],
> -			bool external, struct rte_flow_error *error)
> +			bool external, uint32_t flow_idx,
> +			struct rte_flow_error *error)
>   {
>   	int ret;
>   
>   	ret = flow_create_split_meter(dev, flow, attr, items,
> -					 actions, external, error);
> +					 actions, external, flow_idx, error);
>   	MLX5_ASSERT(ret <= 0);
>   	return ret;
>   }
> @@ -4356,7 +4378,7 @@ flow_list_create(struct rte_eth_dev *dev, uint32_t *list,
>   		 */
>   		ret = flow_create_split_outer(dev, flow, attr,
>   					      buf->entry[i].pattern,
> -					      p_actions_rx, external,
> +					      p_actions_rx, external, idx,
>   					      error);
>   		if (ret < 0)
>   			goto error;
> @@ -4367,7 +4389,8 @@ flow_list_create(struct rte_eth_dev *dev, uint32_t *list,
>   		attr_tx.ingress = 0;
>   		attr_tx.egress = 1;
>   		dev_flow = flow_drv_prepare(dev, flow, &attr_tx, items_tx.items,
> -					    actions_hairpin_tx.actions, error);
> +					 actions_hairpin_tx.actions,
> +					 idx, error);
>   		if (!dev_flow)
>   			goto error;
>   		dev_flow->flow = flow;
> @@ -5741,6 +5764,31 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
>   
>   #define MLX5_POOL_QUERY_FREQ_US 1000000
>   
> +/**
> + * Get number of all validate pools.
> + *
> + * @param[in] sh
> + *   Pointer to mlx5_ibv_shared object.
> + *
> + * @return
> + *   The number of all validate pools.
> + */
> +static uint32_t
> +mlx5_get_all_valid_pool_count(struct mlx5_ibv_shared *sh)
> +{
> +	uint8_t age, i;
> +	uint32_t pools_n = 0;
> +	struct mlx5_pools_container *cont;
> +
> +	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
> +		for (i = 0; i < 2 ; ++i) {
> +			cont = MLX5_CNT_CONTAINER(sh, i, 0, age);
> +			pools_n += rte_atomic16_read(&cont->n_valid);
> +		}
> +	}
> +	return pools_n;
> +}
> +
>   /**
>    * Set the periodic procedure for triggering asynchronous batch queries for all
>    * the counter pools.
> @@ -5751,12 +5799,9 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
>   void
>   mlx5_set_query_alarm(struct mlx5_ibv_shared *sh)
>   {
> -	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(sh, 0, 0);
> -	uint32_t pools_n = rte_atomic16_read(&cont->n_valid);
> -	uint32_t us;
> +	uint32_t pools_n, us;
>   
> -	cont = MLX5_CNT_CONTAINER(sh, 1, 0);
> -	pools_n += rte_atomic16_read(&cont->n_valid);
> +	pools_n = mlx5_get_all_valid_pool_count(sh);
>   	us = MLX5_POOL_QUERY_FREQ_US / pools_n;
>   	DRV_LOG(DEBUG, "Set alarm for %u pools each %u us", pools_n, us);
>   	if (rte_eal_alarm_set(us, mlx5_flow_query_alarm, sh)) {
> @@ -5782,6 +5827,7 @@ mlx5_flow_query_alarm(void *arg)
>   	uint16_t offset;
>   	int ret;
>   	uint8_t batch = sh->cmng.batch;
> +	uint8_t age = sh->cmng.age;
>   	uint16_t pool_index = sh->cmng.pool_index;
>   	struct mlx5_pools_container *cont;
>   	struct mlx5_pools_container *mcont;
> @@ -5790,8 +5836,8 @@ mlx5_flow_query_alarm(void *arg)
>   	if (sh->cmng.pending_queries >= MLX5_MAX_PENDING_QUERIES)
>   		goto set_alarm;
>   next_container:
> -	cont = MLX5_CNT_CONTAINER(sh, batch, 1);
> -	mcont = MLX5_CNT_CONTAINER(sh, batch, 0);
> +	cont = MLX5_CNT_CONTAINER(sh, batch, 1, age);
> +	mcont = MLX5_CNT_CONTAINER(sh, batch, 0, age);
>   	/* Check if resize was done and need to flip a container. */
>   	if (cont != mcont) {
>   		if (cont->pools) {
> @@ -5801,15 +5847,22 @@ mlx5_flow_query_alarm(void *arg)
>   		}
>   		rte_cio_wmb();
>   		 /* Flip the host container. */
> -		sh->cmng.mhi[batch] ^= (uint8_t)2;
> +		sh->cmng.mhi[batch][age] ^= (uint8_t)2;
>   		cont = mcont;
>   	}
>   	if (!cont->pools) {
>   		/* 2 empty containers case is unexpected. */
> -		if (unlikely(batch != sh->cmng.batch))
> +		if (unlikely(batch != sh->cmng.batch) &&
> +			unlikely(age != sh->cmng.age)) {
>   			goto set_alarm;
> +		}
>   		batch ^= 0x1;
>   		pool_index = 0;
> +		if (batch == 0 && pool_index == 0) {
> +			age ^= 0x1;
> +			sh->cmng.batch = batch;
> +			sh->cmng.age = age;
> +		}
>   		goto next_container;
>   	}
>   	pool = cont->pools[pool_index];
> @@ -5852,13 +5905,76 @@ mlx5_flow_query_alarm(void *arg)
>   	if (pool_index >= rte_atomic16_read(&cont->n_valid)) {
>   		batch ^= 0x1;
>   		pool_index = 0;
> +		if (batch == 0 && pool_index == 0)
> +			age ^= 0x1;
>   	}
>   set_alarm:
>   	sh->cmng.batch = batch;
>   	sh->cmng.pool_index = pool_index;
> +	sh->cmng.age = age;
>   	mlx5_set_query_alarm(sh);
>   }
>   
> +/**
> + * Check and callback event for new aged flow in the counter pool
> + *
> + * @param[in] pool
> + *   The pointer to Current counter pool.
> + */
> +static void
> +mlx5_flow_aging_check(struct mlx5_flow_counter_pool *pool)
> +{
> +	struct mlx5_priv *priv;
> +	struct mlx5_flow_counter *cnt;
> +	struct mlx5_age_param *age_param;
> +	struct mlx5_counter_stats_raw *cur = pool->raw_hw;
> +	struct mlx5_counter_stats_raw *prev = pool->raw;
> +	uint16_t curr = rte_rdtsc() / (rte_get_tsc_hz() / 10);
> +	uint64_t port_mask = 0;
> +	uint32_t i;
> +
> +	for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
> +		cnt = MLX5_POOL_GET_CNT(pool, i);
> +		age_param = MLX5_CNT_TO_AGE(cnt);
> +		if (rte_atomic16_read(&age_param->state) != AGE_CANDIDATE)
> +			continue;
> +		if (cur->data[i].hits != prev->data[i].hits) {
> +			age_param->expire = curr + age_param->timeout;
> +			continue;
> +		}
> +		if ((uint16_t)(curr - age_param->expire) >= (UINT16_MAX / 2))
> +			continue;
> +		/**
> +		 * Hold the lock first, or if between the
> +		 * state AGE_TMOUT and tailq operation the
> +		 * release happened, the release procedure
> +		 * may delete a non-existent tailq node.
> +		 */
> +		priv = rte_eth_devices[age_param->port_id].data->dev_private;
> +		rte_spinlock_lock(&priv->aged_sl);
> +		/* If the cpmset fails, release happens. */
> +		if (rte_atomic16_cmpset((volatile uint16_t *)
> +					&age_param->state,
> +					AGE_CANDIDATE,
> +					AGE_TMOUT) ==
> +					AGE_CANDIDATE) {
> +			TAILQ_INSERT_TAIL(&priv->aged_counters, cnt, next);
> +			port_mask |= (1ull << age_param->port_id);
> +		}
> +		rte_spinlock_unlock(&priv->aged_sl);
> +	}
> +	for (i = 0; i < 64; i++) {
> +		if (port_mask & (1ull << i)) {
> +			priv = rte_eth_devices[i].data->dev_private;
> +			if (!rte_atomic16_read(&priv->trigger_event))
> +				continue;
> +			_rte_eth_dev_callback_process(&rte_eth_devices[i],
> +				RTE_ETH_EVENT_FLOW_AGED, NULL);
> +			rte_atomic16_set(&priv->trigger_event, 0);
> +		}
> +	}
> +}
> +
>   /**
>    * Handler for the HW respond about ready values from an asynchronous batch
>    * query. This function is probably called by the host thread.
> @@ -5883,6 +5999,8 @@ mlx5_flow_async_pool_query_handle(struct mlx5_ibv_shared *sh,
>   		raw_to_free = pool->raw_hw;
>   	} else {
>   		raw_to_free = pool->raw;
> +		if (IS_AGE_POOL(pool))
> +			mlx5_flow_aging_check(pool);
>   		rte_spinlock_lock(&pool->sl);
>   		pool->raw = pool->raw_hw;
>   		rte_spinlock_unlock(&pool->sl);
> @@ -6034,3 +6152,40 @@ mlx5_flow_dev_dump(struct rte_eth_dev *dev,
>   	return mlx5_devx_cmd_flow_dump(sh->fdb_domain, sh->rx_domain,
>   				       sh->tx_domain, file);
>   }
> +
> +/**
> + * Get aged-out flows.
> + *
> + * @param[in] dev
> + *   Pointer to the Ethernet device structure.
> + * @param[in] context
> + *   The address of an array of pointers to the aged-out flows contexts.
> + * @param[in] nb_countexts
> + *   The length of context array pointers.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL. Initialized in case of
> + *   error only.
> + *
> + * @return
> + *   how many contexts get in success, otherwise negative errno value.
> + *   if nb_contexts is 0, return the amount of all aged contexts.
> + *   if nb_contexts is not 0 , return the amount of aged flows reported
> + *   in the context array.
> + */
> +int
> +mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
> +			uint32_t nb_contexts, struct rte_flow_error *error)
> +{
> +	const struct mlx5_flow_driver_ops *fops;
> +	struct rte_flow_attr attr = { .transfer = 0 };
> +
> +	if (flow_get_drv_type(dev, &attr) == MLX5_FLOW_TYPE_DV) {
> +		fops = flow_get_drv_ops(MLX5_FLOW_TYPE_DV);
> +		return fops->get_aged_flows(dev, contexts, nb_contexts,
> +						    error);
> +	}
> +	DRV_LOG(ERR,
> +		"port %u get aged flows is not supported.",
> +		 dev->data->port_id);
> +	return -ENOTSUP;
> +}
> diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
> index 2a1f59698c..bf1d5beb9b 100644
> --- a/drivers/net/mlx5/mlx5_flow.h
> +++ b/drivers/net/mlx5/mlx5_flow.h
> @@ -199,6 +199,7 @@ enum mlx5_feature_name {
>   #define MLX5_FLOW_ACTION_METER (1ull << 31)
>   #define MLX5_FLOW_ACTION_SET_IPV4_DSCP (1ull << 32)
>   #define MLX5_FLOW_ACTION_SET_IPV6_DSCP (1ull << 33)
> +#define MLX5_FLOW_ACTION_AGE (1ull << 34)
>   
>   #define MLX5_FLOW_FATE_ACTIONS \
>   	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE | \
> @@ -650,6 +651,7 @@ struct mlx5_flow_verbs_workspace {
>   /** Device flow structure. */
>   struct mlx5_flow {
>   	struct rte_flow *flow; /**< Pointer to the main flow. */
> +	uint32_t flow_idx; /**< The memory pool index to the main flow. */
>   	uint64_t hash_fields; /**< Verbs hash Rx queue hash fields. */
>   	uint64_t act_flags;
>   	/**< Bit-fields of detected actions, see MLX5_FLOW_ACTION_*. */
> @@ -873,6 +875,11 @@ typedef int (*mlx5_flow_counter_query_t)(struct rte_eth_dev *dev,
>   					 uint32_t cnt,
>   					 bool clear, uint64_t *pkts,
>   					 uint64_t *bytes);
> +typedef int (*mlx5_flow_get_aged_flows_t)
> +					(struct rte_eth_dev *dev,
> +					 void **context,
> +					 uint32_t nb_contexts,
> +					 struct rte_flow_error *error);
>   struct mlx5_flow_driver_ops {
>   	mlx5_flow_validate_t validate;
>   	mlx5_flow_prepare_t prepare;
> @@ -888,13 +895,14 @@ struct mlx5_flow_driver_ops {
>   	mlx5_flow_counter_alloc_t counter_alloc;
>   	mlx5_flow_counter_free_t counter_free;
>   	mlx5_flow_counter_query_t counter_query;
> +	mlx5_flow_get_aged_flows_t get_aged_flows;
>   };
>   
>   
> -#define MLX5_CNT_CONTAINER(sh, batch, thread) (&(sh)->cmng.ccont \
> -	[(((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
> -#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread) (&(sh)->cmng.ccont \
> -	[(~((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
> +#define MLX5_CNT_CONTAINER(sh, batch, thread, age) (&(sh)->cmng.ccont \
> +	[(((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
> +#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread, age) (&(sh)->cmng.ccont \
> +	[(~((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
>   
>   /* mlx5_flow.c */
>   
> diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
> index 784a62c521..73a5f477f8 100644
> --- a/drivers/net/mlx5/mlx5_flow_dv.c
> +++ b/drivers/net/mlx5/mlx5_flow_dv.c
> @@ -24,6 +24,7 @@
>   #include <rte_flow.h>
>   #include <rte_flow_driver.h>
>   #include <rte_malloc.h>
> +#include <rte_cycles.h>
>   #include <rte_ip.h>
>   #include <rte_gre.h>
>   #include <rte_vxlan.h>
> @@ -3719,6 +3720,50 @@ mlx5_flow_validate_action_meter(struct rte_eth_dev *dev,
>   	return 0;
>   }
>   
> +/**
> + * Validate the age action.
> + *
> + * @param[in] action_flags
> + *   Holds the actions detected until now.
> + * @param[in] action
> + *   Pointer to the age action.
> + * @param[in] dev
> + *   Pointer to the Ethernet device structure.
> + * @param[out] error
> + *   Pointer to error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +static int
> +flow_dv_validate_action_age(uint64_t action_flags,
> +			    const struct rte_flow_action *action,
> +			    struct rte_eth_dev *dev,
> +			    struct rte_flow_error *error)
> +{
> +	struct mlx5_priv *priv = dev->data->dev_private;
> +	const struct rte_flow_action_age *age = action->conf;
> +
> +	if (!priv->config.devx)
> +		return rte_flow_error_set(error, ENOTSUP,
> +					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +					  NULL,
> +					  "age action not supported");
> +	if (!(action->conf))
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "configuration cannot be null");
> +	if (age->timeout >= UINT16_MAX / 2 / 10)
> +		return rte_flow_error_set(error, ENOTSUP,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "Max age time: 3275 seconds");
> +	if (action_flags & MLX5_FLOW_ACTION_AGE)
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, NULL,
> +					  "Duplicate age ctions set");
> +	return 0;
> +}
> +
>   /**
>    * Validate the modify-header IPv4 DSCP actions.
>    *
> @@ -3896,14 +3941,16 @@ flow_dv_counter_get_by_idx(struct rte_eth_dev *dev,
>   	struct mlx5_priv *priv = dev->data->dev_private;
>   	struct mlx5_pools_container *cont;
>   	struct mlx5_flow_counter_pool *pool;
> -	uint32_t batch = 0;
> +	uint32_t batch = 0, age = 0;
>   
>   	idx--;
> +	age = MLX_CNT_IS_AGE(idx);
> +	idx = age ? idx - MLX5_CNT_AGE_OFFSET : idx;
>   	if (idx >= MLX5_CNT_BATCH_OFFSET) {
>   		idx -= MLX5_CNT_BATCH_OFFSET;
>   		batch = 1;
>   	}
> -	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
> +	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
>   	MLX5_ASSERT(idx / MLX5_COUNTERS_PER_POOL < cont->n);
>   	pool = cont->pools[idx / MLX5_COUNTERS_PER_POOL];
>   	MLX5_ASSERT(pool);
> @@ -4023,18 +4070,21 @@ flow_dv_create_counter_stat_mem_mng(struct rte_eth_dev *dev, int raws_n)
>    *   Pointer to the Ethernet device structure.
>    * @param[in] batch
>    *   Whether the pool is for counter that was allocated by batch command.
> + * @param[in] age
> + *   Whether the pool is for Aging counter.
>    *
>    * @return
>    *   The new container pointer on success, otherwise NULL and rte_errno is set.
>    */
>   static struct mlx5_pools_container *
> -flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
> +flow_dv_container_resize(struct rte_eth_dev *dev,
> +				uint32_t batch, uint32_t age)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
>   	struct mlx5_pools_container *cont =
> -			MLX5_CNT_CONTAINER(priv->sh, batch, 0);
> +			MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
>   	struct mlx5_pools_container *new_cont =
> -			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0);
> +			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0, age);
>   	struct mlx5_counter_stats_mem_mng *mem_mng = NULL;
>   	uint32_t resize = cont->n + MLX5_CNT_CONTAINER_RESIZE;
>   	uint32_t mem_size = sizeof(struct mlx5_flow_counter_pool *) * resize;
> @@ -4042,7 +4092,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
>   
>   	/* Fallback mode has no background thread. Skip the check. */
>   	if (!priv->counter_fallback &&
> -	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1)) {
> +	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1, age)) {
>   		/* The last resize still hasn't detected by the host thread. */
>   		rte_errno = EAGAIN;
>   		return NULL;
> @@ -4085,7 +4135,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
>   	new_cont->init_mem_mng = mem_mng;
>   	rte_cio_wmb();
>   	 /* Flip the master container. */
> -	priv->sh->cmng.mhi[batch] ^= (uint8_t)1;
> +	priv->sh->cmng.mhi[batch][age] ^= (uint8_t)1;
>   	return new_cont;
>   }
>   
> @@ -4117,7 +4167,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
>   	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
>   	MLX5_ASSERT(pool);
>   	if (counter < MLX5_CNT_BATCH_OFFSET) {
> -		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
> +		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
>   		if (priv->counter_fallback)
>   			return mlx5_devx_cmd_flow_counter_query(cnt_ext->dcs, 0,
>   					0, pkts, bytes, 0, NULL, NULL, 0);
> @@ -4150,6 +4200,8 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
>    *   The devX counter handle.
>    * @param[in] batch
>    *   Whether the pool is for counter that was allocated by batch command.
> + * @param[in] age
> + *   Whether the pool is for counter that was allocated for aging.
>    * @param[in/out] cont_cur
>    *   Pointer to the container pointer, it will be update in pool resize.
>    *
> @@ -4158,24 +4210,23 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
>    */
>   static struct mlx5_pools_container *
>   flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
> -		    uint32_t batch)
> +		    uint32_t batch, uint32_t age)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
>   	struct mlx5_flow_counter_pool *pool;
>   	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
> -							       0);
> +							       0, age);
>   	int16_t n_valid = rte_atomic16_read(&cont->n_valid);
> -	uint32_t size;
> +	uint32_t size = sizeof(*pool);
>   
>   	if (cont->n == n_valid) {
> -		cont = flow_dv_container_resize(dev, batch);
> +		cont = flow_dv_container_resize(dev, batch, age);
>   		if (!cont)
>   			return NULL;
>   	}
> -	size = sizeof(*pool);
>   	size += MLX5_COUNTERS_PER_POOL * CNT_SIZE;
> -	if (!batch)
> -		size += MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE;
> +	size += (batch ? 0 : MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE);
> +	size += (!age ? 0 : MLX5_COUNTERS_PER_POOL * AGE_SIZE);
>   	pool = rte_calloc(__func__, 1, size, 0);
>   	if (!pool) {
>   		rte_errno = ENOMEM;
> @@ -4187,8 +4238,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
>   						     MLX5_CNT_CONTAINER_RESIZE;
>   	pool->raw_hw = NULL;
>   	pool->type = 0;
> -	if (!batch)
> -		pool->type |= CNT_POOL_TYPE_EXT;
> +	pool->type |= (batch ? 0 :  CNT_POOL_TYPE_EXT);
> +	pool->type |= (!age ? 0 :  CNT_POOL_TYPE_AGE);
>   	rte_spinlock_init(&pool->sl);
>   	/*
>   	 * The generation of the new allocated counters in this pool is 0, 2 in
> @@ -4215,6 +4266,39 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
>   	return cont;
>   }
>   
> +/**
> + * Update the minimum dcs-id for aged or no-aged counter pool.
> + *
> + * @param[in] dev
> + *   Pointer to the Ethernet device structure.
> + * @param[in] pool
> + *   Current counter pool.
> + * @param[in] batch
> + *   Whether the pool is for counter that was allocated by batch command.
> + * @param[in] age
> + *   Whether the counter is for aging.
> + */
> +static void
> +flow_dv_counter_update_min_dcs(struct rte_eth_dev *dev,
> +			struct mlx5_flow_counter_pool *pool,
> +			uint32_t batch, uint32_t age)
> +{
> +	struct mlx5_priv *priv = dev->data->dev_private;
> +	struct mlx5_flow_counter_pool *other;
> +	struct mlx5_pools_container *cont;
> +
> +	cont = MLX5_CNT_CONTAINER(priv->sh,	batch, 0, (age ^ 0x1));
Too much space.
> +	other = flow_dv_find_pool_by_id(cont, pool->min_dcs->id);
> +	if (!other)
> +		return;
> +	if (pool->min_dcs->id < other->min_dcs->id) {
> +		rte_atomic64_set(&other->a64_dcs,
> +			rte_atomic64_read(&pool->a64_dcs));
> +	} else {
> +		rte_atomic64_set(&pool->a64_dcs,
> +			rte_atomic64_read(&other->a64_dcs));
> +	}
> +}
>   /**
>    * Prepare a new counter and/or a new counter pool.
>    *
> @@ -4224,6 +4308,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
>    *   Where to put the pointer of a new counter.
>    * @param[in] batch
>    *   Whether the pool is for counter that was allocated by batch command.
> + * @param[in] age
> + *   Whether the pool is for counter that was allocated for aging.
>    *
>    * @return
>    *   The counter container pointer and @p cnt_free is set on success,
> @@ -4232,7 +4318,7 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
>   static struct mlx5_pools_container *
>   flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
>   			     struct mlx5_flow_counter **cnt_free,
> -			     uint32_t batch)
> +			     uint32_t batch, uint32_t age)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
>   	struct mlx5_pools_container *cont;
> @@ -4241,7 +4327,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
>   	struct mlx5_flow_counter *cnt;
>   	uint32_t i;
>   
> -	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
> +	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
>   	if (!batch) {
>   		/* bulk_bitmap must be 0 for single counter allocation. */
>   		dcs = mlx5_devx_cmd_flow_counter_alloc(priv->sh->ctx, 0);
> @@ -4249,7 +4335,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
>   			return NULL;
>   		pool = flow_dv_find_pool_by_id(cont, dcs->id);
>   		if (!pool) {
> -			cont = flow_dv_pool_create(dev, dcs, batch);
> +			cont = flow_dv_pool_create(dev, dcs, batch, age);
>   			if (!cont) {
>   				mlx5_devx_cmd_destroy(dcs);
>   				return NULL;
> @@ -4259,6 +4345,8 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
>   			rte_atomic64_set(&pool->a64_dcs,
>   					 (int64_t)(uintptr_t)dcs);
>   		}
> +		flow_dv_counter_update_min_dcs(dev,
> +						pool, batch, age);

As the above "else if"  updates the min_dcs and this function name also 
shows it will update the min_dcs, better to align the update in one 
function.

Or rename the function a much better one to indicate it will update the 
"other" pool with same id?

Not insist to.

>   		i = dcs->id % MLX5_COUNTERS_PER_POOL;
>   		cnt = MLX5_POOL_GET_CNT(pool, i);
>   		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
> @@ -4273,7 +4361,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
>   		rte_errno = ENODATA;
>   		return NULL;
>   	}
> -	cont = flow_dv_pool_create(dev, dcs, batch);
> +	cont = flow_dv_pool_create(dev, dcs, batch, age);
>   	if (!cont) {
>   		mlx5_devx_cmd_destroy(dcs);
>   		return NULL;
> @@ -4334,13 +4422,15 @@ flow_dv_counter_shared_search(struct mlx5_pools_container *cont, uint32_t id,
>    *   Counter identifier.
>    * @param[in] group
>    *   Counter flow group.
> + * @param[in] age
> + *   Whether the counter was allocated for aging.
>    *
>    * @return
>    *   Index to flow counter on success, 0 otherwise and rte_errno is set.
>    */
>   static uint32_t
>   flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
> -		      uint16_t group)
> +		      uint16_t group, uint32_t age)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
>   	struct mlx5_flow_counter_pool *pool = NULL;
> @@ -4356,7 +4446,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
>   	 */
>   	uint32_t batch = (group && !shared && !priv->counter_fallback) ? 1 : 0;
>   	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
> -							       0);
> +							       0, age);
>   	uint32_t cnt_idx;
>   
>   	if (!priv->config.devx) {
> @@ -4395,13 +4485,13 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
>   		cnt_free = NULL;
>   	}
>   	if (!cnt_free) {
> -		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch);
> +		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch, age);
>   		if (!cont)
>   			return 0;
>   		pool = TAILQ_FIRST(&cont->pool_list);
>   	}
>   	if (!batch)
> -		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt_free);
> +		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt_free);
>   	/* Create a DV counter action only in the first time usage. */
>   	if (!cnt_free->action) {
>   		uint16_t offset;
> @@ -4424,6 +4514,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
>   	cnt_idx = MLX5_MAKE_CNT_IDX(pool->index,
>   				MLX5_CNT_ARRAY_IDX(pool, cnt_free));
>   	cnt_idx += batch * MLX5_CNT_BATCH_OFFSET;
> +	cnt_idx += age * MLX5_CNT_AGE_OFFSET;
>   	/* Update the counter reset values. */
>   	if (_flow_dv_query_count(dev, cnt_idx, &cnt_free->hits,
>   				 &cnt_free->bytes))
> @@ -4445,6 +4536,62 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
>   	return cnt_idx;
>   }
>   
> +/**
> + * Get age param from counter index.
> + *
> + * @param[in] dev
> + *   Pointer to the Ethernet device structure.
> + * @param[in] counter
> + *   Index to the counter handler.
> + *
> + * @return
> + *   The aging parameter specified for the counter index.
> + */
> +static struct mlx5_age_param*
> +flow_dv_counter_idx_get_age(struct rte_eth_dev *dev,
> +				uint32_t counter)
> +{
> +	struct mlx5_flow_counter *cnt;
> +	struct mlx5_flow_counter_pool *pool = NULL;
> +
> +	flow_dv_counter_get_by_idx(dev, counter, &pool);
> +	counter = (counter - 1) % MLX5_COUNTERS_PER_POOL;
> +	cnt = MLX5_POOL_GET_CNT(pool, counter);
> +	return MLX5_CNT_TO_AGE(cnt);
> +}
> +
> +/**
> + * Remove a flow counter from aged counter list.
> + *
> + * @param[in] dev
> + *   Pointer to the Ethernet device structure.
> + * @param[in] counter
> + *   Index to the counter handler.
> + * @param[in] cnt
> + *   Pointer to the counter handler.
> + */
> +static void
> +flow_dv_counter_remove_from_age(struct rte_eth_dev *dev,
> +				uint32_t counter, struct mlx5_flow_counter *cnt)
> +{
> +	struct mlx5_age_param *age_param;
> +	struct mlx5_priv *priv = dev->data->dev_private;
> +
> +	age_param = flow_dv_counter_idx_get_age(dev, counter);
> +	if (rte_atomic16_cmpset((volatile uint16_t *)
> +			&age_param->state,
> +			AGE_CANDIDATE, AGE_FREE)
> +			!= AGE_CANDIDATE) {
> +		/**
> +		 * We need the lock even it is age timeout,
> +		 * since counter may still in process.
> +		 */
> +		rte_spinlock_lock(&priv->aged_sl);
> +		TAILQ_REMOVE(&priv->aged_counters, cnt, next);
> +		rte_spinlock_unlock(&priv->aged_sl);
> +	}
> +	rte_atomic16_set(&age_param->state, AGE_FREE);
> +}
>   /**
>    * Release a flow counter.
>    *
> @@ -4465,10 +4612,12 @@ flow_dv_counter_release(struct rte_eth_dev *dev, uint32_t counter)
>   	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
>   	MLX5_ASSERT(pool);
>   	if (counter < MLX5_CNT_BATCH_OFFSET) {
> -		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
> +		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
>   		if (cnt_ext && --cnt_ext->ref_cnt)
>   			return;
>   	}
> +	if (IS_AGE_POOL(pool))
> +		flow_dv_counter_remove_from_age(dev, counter, cnt);
>   	/* Put the counter in the end - the last updated one. */
>   	TAILQ_INSERT_TAIL(&pool->counters, cnt, next);
>   	/*
> @@ -5243,6 +5392,15 @@ flow_dv_validate(struct rte_eth_dev *dev, const struct rte_flow_attr *attr,
>   			/* Meter action will add one more TAG action. */
>   			rw_act_num += MLX5_ACT_NUM_SET_TAG;
>   			break;
> +		case RTE_FLOW_ACTION_TYPE_AGE:
> +			ret = flow_dv_validate_action_age(action_flags,
> +							  actions, dev,
> +							  error);
> +			if (ret < 0)
> +				return ret;
> +			action_flags |= MLX5_FLOW_ACTION_AGE;
> +			++actions_n;
> +			break;
>   		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP:
>   			ret = flow_dv_validate_action_modify_ipv4_dscp
>   							 (action_flags,
> @@ -7281,6 +7439,54 @@ flow_dv_translate_action_port_id(struct rte_eth_dev *dev,
>   	return 0;
>   }
>   
> +/**
> + * Create a counter with aging configuration.
> + *
> + * @param[in] dev
> + *   Pointer to rte_eth_dev structure.
> + * @param[out] count
> + *   Pointer to the counter action configuration.
> + * @param[in] age
> + *   Pointer to the aging action configuration.
> + *
> + * @return
> + *   Index to flow counter on success, 0 otherwise.
> + */
> +static uint32_t
> +flow_dv_translate_create_counter(struct rte_eth_dev *dev,
> +				struct mlx5_flow *dev_flow,
> +				const struct rte_flow_action_count *count,
> +				const struct rte_flow_action_age *age)
> +{
> +	uint32_t counter;
> +	struct mlx5_age_param *age_param;
> +
> +	counter = flow_dv_counter_alloc(dev,
> +				count ? count->shared : 0,
> +				count ? count->id : 0,
> +				dev_flow->dv.group, !!age);
> +
> +	if (!counter || age == NULL)
> +		return counter;
> +	age_param  = flow_dv_counter_idx_get_age(dev, counter);
> +	/*
> +	 * The counter age accuracy may have a bit delay. Have 3/4
> +	 * second bias on the timeount in order to let it age in time.
> +	 */
> +	age_param->context = age->context ? age->context :
> +		(void *)(uintptr_t)(dev_flow->flow_idx);
> +	/*
> +	 * The counter age accuracy may have a bit delay. Have 3/4
> +	 * second bias on the timeount in order to let it age in time.
> +	 */
> +	age_param->timeout = age->timeout * 10 - 7;
> +	/* Set expire time in unit of 0.1 sec. */
> +	age_param->port_id = dev->data->port_id;
> +	age_param->expire = age_param->timeout +
> +			rte_rdtsc() / (rte_get_tsc_hz() / 10);
> +	rte_atomic16_set(&age_param->state, AGE_CANDIDATE);
> +	return counter;
> +}
>   /**
>    * Add Tx queue matcher
>    *
> @@ -7450,6 +7656,8 @@ __flow_dv_translate(struct rte_eth_dev *dev,
>   			    (MLX5_MAX_MODIFY_NUM + 1)];
>   	} mhdr_dummy;
>   	struct mlx5_flow_dv_modify_hdr_resource *mhdr_res = &mhdr_dummy.res;
> +	const struct rte_flow_action_count *count = NULL;
> +	const struct rte_flow_action_age *age = NULL;
>   	union flow_dv_attr flow_attr = { .attr = 0 };
>   	uint32_t tag_be;
>   	union mlx5_flow_tbl_key tbl_key;
> @@ -7478,7 +7686,6 @@ __flow_dv_translate(struct rte_eth_dev *dev,
>   		const struct rte_flow_action_queue *queue;
>   		const struct rte_flow_action_rss *rss;
>   		const struct rte_flow_action *action = actions;
> -		const struct rte_flow_action_count *count = action->conf;
>   		const uint8_t *rss_key;
>   		const struct rte_flow_action_jump *jump_data;
>   		const struct rte_flow_action_meter *mtr;
> @@ -7607,36 +7814,21 @@ __flow_dv_translate(struct rte_eth_dev *dev,
>   			action_flags |= MLX5_FLOW_ACTION_RSS;
>   			dev_flow->handle->fate_action = MLX5_FLOW_FATE_QUEUE;
>   			break;
> +		case RTE_FLOW_ACTION_TYPE_AGE:
>   		case RTE_FLOW_ACTION_TYPE_COUNT:
>   			if (!dev_conf->devx) {
> -				rte_errno = ENOTSUP;
> -				goto cnt_err;
> -			}
> -			flow->counter = flow_dv_counter_alloc(dev,
> -							count->shared,
> -							count->id,
> -							dev_flow->dv.group);
> -			if (!flow->counter)
> -				goto cnt_err;
> -			dev_flow->dv.actions[actions_n++] =
> -				  (flow_dv_counter_get_by_idx(dev,
> -				  flow->counter, NULL))->action;
> -			action_flags |= MLX5_FLOW_ACTION_COUNT;
> -			break;
> -cnt_err:
> -			if (rte_errno == ENOTSUP)
>   				return rte_flow_error_set
>   					      (error, ENOTSUP,
>   					       RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
>   					       NULL,
>   					       "count action not supported");
> +			}
> +			/* Save information first, will apply later. */
> +			if (actions->type == RTE_FLOW_ACTION_TYPE_COUNT)
> +				count = action->conf;
>   			else
> -				return rte_flow_error_set
> -						(error, rte_errno,
> -						 RTE_FLOW_ERROR_TYPE_ACTION,
> -						 action,
> -						 "cannot create counter"
> -						  " object.");
> +				age = action->conf;
> +			action_flags |= MLX5_FLOW_ACTION_COUNT;
>   			break;
>   		case RTE_FLOW_ACTION_TYPE_OF_POP_VLAN:
>   			dev_flow->dv.actions[actions_n++] =
> @@ -7909,6 +8101,22 @@ __flow_dv_translate(struct rte_eth_dev *dev,
>   				dev_flow->dv.actions[modify_action_position] =
>   					handle->dvh.modify_hdr->verbs_action;
>   			}
> +			if (action_flags & MLX5_FLOW_ACTION_COUNT) {
> +				flow->counter =
> +					flow_dv_translate_create_counter(dev,
> +						dev_flow, count, age);
> +
> +				if (!flow->counter)
> +					return rte_flow_error_set
> +						(error, rte_errno,
> +						RTE_FLOW_ERROR_TYPE_ACTION,
> +						NULL,
> +						"cannot create counter"
> +						" object.");
> +				dev_flow->dv.actions[actions_n++] =
> +					  (flow_dv_counter_get_by_idx(dev,
> +					  flow->counter, NULL))->action;
> +			}
>   			break;
>   		default:
>   			break;
> @@ -9169,6 +9377,58 @@ flow_dv_counter_query(struct rte_eth_dev *dev, uint32_t counter, bool clear,
>   	return 0;
>   }
>   
> +/**
> + * Get aged-out flows.
> + *
> + * @param[in] dev
> + *   Pointer to the Ethernet device structure.
> + * @param[in] context
> + *   The address of an array of pointers to the aged-out flows contexts.
> + * @param[in] nb_contexts
> + *   The length of context array pointers.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL. Initialized in case of
> + *   error only.
> + *
> + * @return
> + *   how many contexts get in success, otherwise negative errno value.
> + *   if nb_contexts is 0, return the amount of all aged contexts.
> + *   if nb_contexts is not 0 , return the amount of aged flows reported
> + *   in the context array.
> + * @note: only stub for now
> + */
> +static int
> +flow_get_aged_flows(struct rte_eth_dev *dev,
> +		    void **context,
> +		    uint32_t nb_contexts,
> +		    struct rte_flow_error *error)
> +{
> +	struct mlx5_priv *priv = dev->data->dev_private;
> +	struct mlx5_age_param *age_param;
> +	struct mlx5_flow_counter *counter;
> +	int nb_flows = 0;
> +
> +	if (nb_contexts && !context)
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +					  NULL,
> +					  "Should assign at least one flow or"
> +					  " context to get if nb_contexts != 0");
> +	rte_spinlock_lock(&priv->aged_sl);
> +	TAILQ_FOREACH(counter, &priv->aged_counters, next) {
> +		nb_flows++;
> +		if (nb_contexts) {
> +			age_param = MLX5_CNT_TO_AGE(counter);
> +			context[nb_flows - 1] = age_param->context;
> +			if (!(--nb_contexts))
> +				break;
> +		}
> +	}
> +	rte_spinlock_unlock(&priv->aged_sl);
> +	rte_atomic16_set(&priv->trigger_event, 1);
> +	return nb_flows;
> +}
> +
>   /*
>    * Mutex-protected thunk to lock-free  __flow_dv_translate().
>    */
> @@ -9235,7 +9495,7 @@ flow_dv_counter_allocate(struct rte_eth_dev *dev)
>   	uint32_t cnt;
>   
>   	flow_dv_shared_lock(dev);
> -	cnt = flow_dv_counter_alloc(dev, 0, 0, 1);
> +	cnt = flow_dv_counter_alloc(dev, 0, 0, 1, 0);
>   	flow_dv_shared_unlock(dev);
>   	return cnt;
>   }
> @@ -9266,6 +9526,7 @@ const struct mlx5_flow_driver_ops mlx5_flow_dv_drv_ops = {
>   	.counter_alloc = flow_dv_counter_allocate,
>   	.counter_free = flow_dv_counter_free,
>   	.counter_query = flow_dv_counter_query,
> +	.get_aged_flows = flow_get_aged_flows,
>   };
>   
>   #endif /* HAVE_IBV_FLOW_DV_SUPPORT */
> diff --git a/drivers/net/mlx5/mlx5_flow_verbs.c b/drivers/net/mlx5/mlx5_flow_verbs.c
> index 236d665852..7efd97f547 100644
> --- a/drivers/net/mlx5/mlx5_flow_verbs.c
> +++ b/drivers/net/mlx5/mlx5_flow_verbs.c
> @@ -56,7 +56,8 @@ flow_verbs_counter_get_by_idx(struct rte_eth_dev *dev,
>   			      struct mlx5_flow_counter_pool **ppool)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
> +	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
> +									0);
>   	struct mlx5_flow_counter_pool *pool;
>   
>   	idx--;
> @@ -151,7 +152,8 @@ static uint32_t
>   flow_verbs_counter_new(struct rte_eth_dev *dev, uint32_t shared, uint32_t id)
>   {
>   	struct mlx5_priv *priv = dev->data->dev_private;
> -	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
> +	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
> +									0);
>   	struct mlx5_flow_counter_pool *pool = NULL;
>   	struct mlx5_flow_counter_ext *cnt_ext = NULL;
>   	struct mlx5_flow_counter *cnt = NULL;
> @@ -251,7 +253,7 @@ flow_verbs_counter_release(struct rte_eth_dev *dev, uint32_t counter)
>   
>   	cnt = flow_verbs_counter_get_by_idx(dev, counter,
>   					    &pool);
> -	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
> +	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
>   	if (--cnt_ext->ref_cnt == 0) {
>   #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
>   		claim_zero(mlx5_glue->destroy_counter_set(cnt_ext->cs));
> @@ -282,7 +284,7 @@ flow_verbs_counter_query(struct rte_eth_dev *dev __rte_unused,
>   		struct mlx5_flow_counter *cnt = flow_verbs_counter_get_by_idx
>   						(dev, flow->counter, &pool);
>   		struct mlx5_flow_counter_ext *cnt_ext = MLX5_CNT_TO_CNT_EXT
> -						(cnt);
> +						(pool, cnt);
>   		struct rte_flow_query_count *qc = data;
>   		uint64_t counters[2] = {0, 0};
>   #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
> @@ -1083,12 +1085,12 @@ flow_verbs_translate_action_count(struct mlx5_flow *dev_flow,
>   	}
>   #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
>   	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
> -	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
> +	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
>   	counter.counter_set_handle = cnt_ext->cs->handle;
>   	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
>   #elif defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
>   	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
> -	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
> +	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
>   	counter.counters = cnt_ext->cs;
>   	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
>   #endif



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] net/mlx5: support flow aging
  2020-04-24 10:45     ` [dpdk-dev] [PATCH v2 0/2] " Bill Zhou
  2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
  2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: support flow aging Bill Zhou
@ 2020-04-29  2:25       ` Bill Zhou
  2020-04-29  2:25         ` [dpdk-dev] [PATCH v3 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
                           ` (2 more replies)
  2 siblings, 3 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-29  2:25 UTC (permalink / raw)
  To: matan, orika, shahafs, viacheslavo, marko.kovacevic, john.mcnamara; +Cc: dev

Those patches implement flow aging for mlx5 driver. First patch is to modify
the current additional memory allocation for counter, so that it's easy to
get every counter additional memory location by using offsetting. Second patch
implements aging check and age-out event callback mechanism for mlx5 driver.


Bill Zhou (2):
  net/mlx5: modify ext-counter memory allocation
  net/mlx5: support flow aging

 doc/guides/rel_notes/release_20_05.rst |   1 +
 drivers/net/mlx5/mlx5.c                |  93 ++++--
 drivers/net/mlx5/mlx5.h                |  79 +++++-
 drivers/net/mlx5/mlx5_flow.c           | 205 ++++++++++++--
 drivers/net/mlx5/mlx5_flow.h           |  16 +-
 drivers/net/mlx5/mlx5_flow_dv.c        | 373 +++++++++++++++++++++----
 drivers/net/mlx5/mlx5_flow_verbs.c     |  16 +-
 7 files changed, 655 insertions(+), 128 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] net/mlx5: modify ext-counter memory allocation
  2020-04-29  2:25       ` [dpdk-dev] [PATCH v3 0/2] " Bill Zhou
@ 2020-04-29  2:25         ` Bill Zhou
  2020-04-29  2:25         ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: support flow aging Bill Zhou
  2020-05-03  7:41         ` [dpdk-dev] [PATCH v3 0/2] " Matan Azrad
  2 siblings, 0 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-29  2:25 UTC (permalink / raw)
  To: matan, orika, shahafs, viacheslavo, marko.kovacevic, john.mcnamara; +Cc: dev

Currently, the counter pool needs 512 ext-counter memory for no batch
counters, it's allocated separately by once, behind the 512 basic-counter
memory. This is not easy to get ext-counter pointer by corresponding
basic-counter pointer. This is also no easy for expanding some other
potential additional type of counter memory.

So, need allocate every one of ext-counter and basic-counter together,
as a single piece of memory. It's will be same for further additional
type of counter memory. In this case, one piece of memory contains all
type of memory for one counter, it's easy to get each type memory by
using offsetting.

Signed-off-by: Bill Zhou <dongz@mellanox.com>
---
v2: Update some comments for new adding fields.
v3: Update some macro definitions.
---
 drivers/net/mlx5/mlx5.c            |  4 ++--
 drivers/net/mlx5/mlx5.h            | 23 +++++++++++++++++------
 drivers/net/mlx5/mlx5_flow_dv.c    | 27 +++++++++++++++------------
 drivers/net/mlx5/mlx5_flow_verbs.c | 16 ++++++++--------
 4 files changed, 42 insertions(+), 28 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index cc13e447d6..57d76cb741 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -505,10 +505,10 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 					(mlx5_devx_cmd_destroy(pool->min_dcs));
 			}
 			for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
-				if (pool->counters_raw[j].action)
+				if (MLX5_POOL_GET_CNT(pool, j)->action)
 					claim_zero
 					(mlx5_glue->destroy_flow_action
-					       (pool->counters_raw[j].action));
+					 (MLX5_POOL_GET_CNT(pool, j)->action));
 				if (!batch && MLX5_GET_POOL_CNT_EXT
 				    (pool, j)->dcs)
 					claim_zero(mlx5_devx_cmd_destroy
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 50349abf34..4d9984f603 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -222,6 +222,19 @@ struct mlx5_drop {
 #define MLX5_COUNTERS_PER_POOL 512
 #define MLX5_MAX_PENDING_QUERIES 4
 #define MLX5_CNT_CONTAINER_RESIZE 64
+#define CNT_SIZE (sizeof(struct mlx5_flow_counter))
+#define CNTEXT_SIZE (sizeof(struct mlx5_flow_counter_ext))
+
+#define CNT_POOL_TYPE_EXT	(1 << 0)
+#define IS_EXT_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_EXT)
+#define MLX5_CNT_LEN(pool) \
+	(CNT_SIZE + (IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
+#define MLX5_POOL_GET_CNT(pool, index) \
+	((struct mlx5_flow_counter *) \
+	((uint8_t *)((pool) + 1) + (index) * (MLX5_CNT_LEN(pool))))
+#define MLX5_CNT_ARRAY_IDX(pool, cnt) \
+	((int)(((uint8_t *)(cnt) - (uint8_t *)((pool) + 1)) / \
+	MLX5_CNT_LEN(pool)))
 /*
  * The pool index and offset of counter in the pool array makes up the
  * counter index. In case the counter is from pool 0 and offset 0, it
@@ -230,11 +243,10 @@ struct mlx5_drop {
  */
 #define MLX5_MAKE_CNT_IDX(pi, offset) \
 	((pi) * MLX5_COUNTERS_PER_POOL + (offset) + 1)
-#define MLX5_CNT_TO_CNT_EXT(pool, cnt) (&((struct mlx5_flow_counter_ext *) \
-			    ((pool) + 1))[((cnt) - (pool)->counters_raw)])
+#define MLX5_CNT_TO_CNT_EXT(cnt) \
+	((struct mlx5_flow_counter_ext *)((cnt) + 1))
 #define MLX5_GET_POOL_CNT_EXT(pool, offset) \
-			      (&((struct mlx5_flow_counter_ext *) \
-			      ((pool) + 1))[offset])
+	MLX5_CNT_TO_CNT_EXT(MLX5_POOL_GET_CNT((pool), (offset)))
 
 struct mlx5_flow_counter_pool;
 
@@ -287,11 +299,10 @@ struct mlx5_flow_counter_pool {
 	rte_atomic64_t start_query_gen; /* Query start round. */
 	rte_atomic64_t end_query_gen; /* Query end round. */
 	uint32_t index; /* Pool index in container. */
+	uint32_t type: 2; /* Memory type behind the counter array. */
 	rte_spinlock_t sl; /* The pool lock. */
 	struct mlx5_counter_stats_raw *raw;
 	struct mlx5_counter_stats_raw *raw_hw; /* The raw on HW working. */
-	struct mlx5_flow_counter counters_raw[MLX5_COUNTERS_PER_POOL];
-	/* The pool counters memory. */
 };
 
 struct mlx5_counter_stats_raw;
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index 6263ecc731..784a62c521 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -3909,7 +3909,7 @@ flow_dv_counter_get_by_idx(struct rte_eth_dev *dev,
 	MLX5_ASSERT(pool);
 	if (ppool)
 		*ppool = pool;
-	return &pool->counters_raw[idx % MLX5_COUNTERS_PER_POOL];
+	return MLX5_POOL_GET_CNT(pool, idx % MLX5_COUNTERS_PER_POOL);
 }
 
 /**
@@ -4117,7 +4117,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 		if (priv->counter_fallback)
 			return mlx5_devx_cmd_flow_counter_query(cnt_ext->dcs, 0,
 					0, pkts, bytes, 0, NULL, NULL, 0);
@@ -4133,7 +4133,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 		*pkts = 0;
 		*bytes = 0;
 	} else {
-		offset = cnt - &pool->counters_raw[0];
+		offset = MLX5_CNT_ARRAY_IDX(pool, cnt);
 		*pkts = rte_be_to_cpu_64(pool->raw->data[offset].hits);
 		*bytes = rte_be_to_cpu_64(pool->raw->data[offset].bytes);
 	}
@@ -4173,9 +4173,9 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 			return NULL;
 	}
 	size = sizeof(*pool);
+	size += MLX5_COUNTERS_PER_POOL * CNT_SIZE;
 	if (!batch)
-		size += MLX5_COUNTERS_PER_POOL *
-			sizeof(struct mlx5_flow_counter_ext);
+		size += MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE;
 	pool = rte_calloc(__func__, 1, size, 0);
 	if (!pool) {
 		rte_errno = ENOMEM;
@@ -4186,6 +4186,9 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 		pool->raw = cont->init_mem_mng->raws + n_valid %
 						     MLX5_CNT_CONTAINER_RESIZE;
 	pool->raw_hw = NULL;
+	pool->type = 0;
+	if (!batch)
+		pool->type |= CNT_POOL_TYPE_EXT;
 	rte_spinlock_init(&pool->sl);
 	/*
 	 * The generation of the new allocated counters in this pool is 0, 2 in
@@ -4257,7 +4260,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 					 (int64_t)(uintptr_t)dcs);
 		}
 		i = dcs->id % MLX5_COUNTERS_PER_POOL;
-		cnt = &pool->counters_raw[i];
+		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 		MLX5_GET_POOL_CNT_EXT(pool, i)->dcs = dcs;
 		*cnt_free = cnt;
@@ -4277,10 +4280,10 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 	}
 	pool = TAILQ_FIRST(&cont->pool_list);
 	for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
-		cnt = &pool->counters_raw[i];
+		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 	}
-	*cnt_free = &pool->counters_raw[0];
+	*cnt_free = MLX5_POOL_GET_CNT(pool, 0);
 	return cont;
 }
 
@@ -4398,14 +4401,14 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		pool = TAILQ_FIRST(&cont->pool_list);
 	}
 	if (!batch)
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt_free);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt_free);
 	/* Create a DV counter action only in the first time usage. */
 	if (!cnt_free->action) {
 		uint16_t offset;
 		struct mlx5_devx_obj *dcs;
 
 		if (batch) {
-			offset = cnt_free - &pool->counters_raw[0];
+			offset = MLX5_CNT_ARRAY_IDX(pool, cnt_free);
 			dcs = pool->min_dcs;
 		} else {
 			offset = 0;
@@ -4419,7 +4422,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		}
 	}
 	cnt_idx = MLX5_MAKE_CNT_IDX(pool->index,
-				    (cnt_free - pool->counters_raw));
+				MLX5_CNT_ARRAY_IDX(pool, cnt_free));
 	cnt_idx += batch * MLX5_CNT_BATCH_OFFSET;
 	/* Update the counter reset values. */
 	if (_flow_dv_query_count(dev, cnt_idx, &cnt_free->hits,
@@ -4462,7 +4465,7 @@ flow_dv_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 		if (cnt_ext && --cnt_ext->ref_cnt)
 			return;
 	}
diff --git a/drivers/net/mlx5/mlx5_flow_verbs.c b/drivers/net/mlx5/mlx5_flow_verbs.c
index d20098ce45..236d665852 100644
--- a/drivers/net/mlx5/mlx5_flow_verbs.c
+++ b/drivers/net/mlx5/mlx5_flow_verbs.c
@@ -64,7 +64,7 @@ flow_verbs_counter_get_by_idx(struct rte_eth_dev *dev,
 	MLX5_ASSERT(pool);
 	if (ppool)
 		*ppool = pool;
-	return &pool->counters_raw[idx % MLX5_COUNTERS_PER_POOL];
+	return MLX5_POOL_GET_CNT(pool, idx % MLX5_COUNTERS_PER_POOL);
 }
 
 /**
@@ -207,16 +207,16 @@ flow_verbs_counter_new(struct rte_eth_dev *dev, uint32_t shared, uint32_t id)
 		if (!pool)
 			return 0;
 		for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
-			cnt = &pool->counters_raw[i];
+			cnt = MLX5_POOL_GET_CNT(pool, i);
 			TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
 		}
-		cnt = &pool->counters_raw[0];
+		cnt = MLX5_POOL_GET_CNT(pool, 0);
 		cont->pools[n_valid] = pool;
 		pool_idx = n_valid;
 		rte_atomic16_add(&cont->n_valid, 1);
 		TAILQ_INSERT_HEAD(&cont->pool_list, pool, next);
 	}
-	i = cnt - pool->counters_raw;
+	i = MLX5_CNT_ARRAY_IDX(pool, cnt);
 	cnt_ext = MLX5_GET_POOL_CNT_EXT(pool, i);
 	cnt_ext->id = id;
 	cnt_ext->shared = shared;
@@ -251,7 +251,7 @@ flow_verbs_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 
 	cnt = flow_verbs_counter_get_by_idx(dev, counter,
 					    &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	if (--cnt_ext->ref_cnt == 0) {
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 		claim_zero(mlx5_glue->destroy_counter_set(cnt_ext->cs));
@@ -282,7 +282,7 @@ flow_verbs_counter_query(struct rte_eth_dev *dev __rte_unused,
 		struct mlx5_flow_counter *cnt = flow_verbs_counter_get_by_idx
 						(dev, flow->counter, &pool);
 		struct mlx5_flow_counter_ext *cnt_ext = MLX5_CNT_TO_CNT_EXT
-							(pool, cnt);
+						(cnt);
 		struct rte_flow_query_count *qc = data;
 		uint64_t counters[2] = {0, 0};
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
@@ -1083,12 +1083,12 @@ flow_verbs_translate_action_count(struct mlx5_flow *dev_flow,
 	}
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	counter.counter_set_handle = cnt_ext->cs->handle;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #elif defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
 	counter.counters = cnt_ext->cs;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] net/mlx5: support flow aging
  2020-04-29  2:25       ` [dpdk-dev] [PATCH v3 0/2] " Bill Zhou
  2020-04-29  2:25         ` [dpdk-dev] [PATCH v3 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
@ 2020-04-29  2:25         ` Bill Zhou
  2020-05-03  7:41         ` [dpdk-dev] [PATCH v3 0/2] " Matan Azrad
  2 siblings, 0 replies; 50+ messages in thread
From: Bill Zhou @ 2020-04-29  2:25 UTC (permalink / raw)
  To: matan, orika, shahafs, viacheslavo, marko.kovacevic, john.mcnamara; +Cc: dev

Currently, there is no flow aging check and age-out event callback
mechanism for mlx5 driver, this patch implements it. It's included:
- Splitting the current counter container to aged or no-aged container
  since reducing memory consumption. Aged container will allocate extra
  memory to save the aging parameter from user configuration.
- Aging check and age-out event callback mechanism based on current
  counter. When a flow be checked aged-out, RTE_ETH_EVENT_FLOW_AGED
  event will be triggered to applications.
- Implement the new API: rte_flow_get_aged_flows, applications can use
  this API to get aged flows.

Signed-off-by: Bill Zhou <dongz@mellanox.com>
---
v2: Moving aging list from struct mlx5_ibv_shared to struct mlx5_priv,
one port has one aging list. Update event be triggered once after last
call of rte_flow_get_aged_flows.
v3: Update the way of aging event callback, update some comments.
---
 doc/guides/rel_notes/release_20_05.rst |   1 +
 drivers/net/mlx5/mlx5.c                |  93 ++++---
 drivers/net/mlx5/mlx5.h                |  66 ++++-
 drivers/net/mlx5/mlx5_flow.c           | 205 ++++++++++++--
 drivers/net/mlx5/mlx5_flow.h           |  16 +-
 drivers/net/mlx5/mlx5_flow_dv.c        | 364 +++++++++++++++++++++----
 drivers/net/mlx5/mlx5_flow_verbs.c     |  14 +-
 7 files changed, 636 insertions(+), 123 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index b124c3f287..a5ba8a4792 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -141,6 +141,7 @@ New Features
   * Added support for creating Relaxed Ordering Memory Regions.
   * Added support for jumbo frame size (9K MTU) in Multi-Packet RQ mode.
   * Optimized the memory consumption of flow.
+  * Added support for flow aging based on hardware counter.
 
 * **Updated the AESNI MB crypto PMD.**
 
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 57d76cb741..ad3d92bce2 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -437,6 +437,27 @@ mlx5_flow_id_release(struct mlx5_flow_id_pool *pool, uint32_t id)
 	return 0;
 }
 
+/**
+ * Initialize the shared aging list information per port.
+ *
+ * @param[in] sh
+ *   Pointer to mlx5_ibv_shared object.
+ */
+static void
+mlx5_flow_aging_init(struct mlx5_ibv_shared *sh)
+{
+	uint32_t i;
+	struct mlx5_age_info *age_info;
+
+	for (i = 0; i < sh->max_port; i++) {
+		age_info = &sh->port[i].age_info;
+		age_info->flags = 0;
+		TAILQ_INIT(&age_info->aged_counters);
+		rte_spinlock_init(&age_info->aged_sl);
+		MLX5_AGE_SET(age_info, MLX5_AGE_TRIGGER);
+	}
+}
+
 /**
  * Initialize the counters management structure.
  *
@@ -446,11 +467,14 @@ mlx5_flow_id_release(struct mlx5_flow_id_pool *pool, uint32_t id)
 static void
 mlx5_flow_counters_mng_init(struct mlx5_ibv_shared *sh)
 {
-	uint8_t i;
+	uint8_t i, age;
 
+	sh->cmng.age = 0;
 	TAILQ_INIT(&sh->cmng.flow_counters);
-	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
-		TAILQ_INIT(&sh->cmng.ccont[i].pool_list);
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i)
+			TAILQ_INIT(&sh->cmng.ccont[i][age].pool_list);
+	}
 }
 
 /**
@@ -480,7 +504,7 @@ static void
 mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 {
 	struct mlx5_counter_stats_mem_mng *mng;
-	uint8_t i;
+	uint8_t i, age = 0;
 	int j;
 	int retries = 1024;
 
@@ -491,36 +515,42 @@ mlx5_flow_counters_mng_close(struct mlx5_ibv_shared *sh)
 			break;
 		rte_pause();
 	}
-	for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i) {
-		struct mlx5_flow_counter_pool *pool;
-		uint32_t batch = !!(i % 2);
 
-		if (!sh->cmng.ccont[i].pools)
-			continue;
-		pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
-		while (pool) {
-			if (batch) {
-				if (pool->min_dcs)
-					claim_zero
-					(mlx5_devx_cmd_destroy(pool->min_dcs));
-			}
-			for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
-				if (MLX5_POOL_GET_CNT(pool, j)->action)
-					claim_zero
-					(mlx5_glue->destroy_flow_action
-					 (MLX5_POOL_GET_CNT(pool, j)->action));
-				if (!batch && MLX5_GET_POOL_CNT_EXT
-				    (pool, j)->dcs)
-					claim_zero(mlx5_devx_cmd_destroy
-						  (MLX5_GET_POOL_CNT_EXT
-						  (pool, j)->dcs));
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < RTE_DIM(sh->cmng.ccont); ++i) {
+			struct mlx5_flow_counter_pool *pool;
+			uint32_t batch = !!(i % 2);
+
+			if (!sh->cmng.ccont[i][age].pools)
+				continue;
+			pool = TAILQ_FIRST(&sh->cmng.ccont[i][age].pool_list);
+			while (pool) {
+				if (batch) {
+					if (pool->min_dcs)
+						claim_zero
+						(mlx5_devx_cmd_destroy
+						(pool->min_dcs));
+				}
+				for (j = 0; j < MLX5_COUNTERS_PER_POOL; ++j) {
+					if (MLX5_POOL_GET_CNT(pool, j)->action)
+						claim_zero
+						(mlx5_glue->destroy_flow_action
+						 (MLX5_POOL_GET_CNT
+						  (pool, j)->action));
+					if (!batch && MLX5_GET_POOL_CNT_EXT
+					    (pool, j)->dcs)
+						claim_zero(mlx5_devx_cmd_destroy
+							  (MLX5_GET_POOL_CNT_EXT
+							  (pool, j)->dcs));
+				}
+				TAILQ_REMOVE(&sh->cmng.ccont[i][age].pool_list,
+					pool, next);
+				rte_free(pool);
+				pool = TAILQ_FIRST
+					(&sh->cmng.ccont[i][age].pool_list);
 			}
-			TAILQ_REMOVE(&sh->cmng.ccont[i].pool_list, pool,
-				     next);
-			rte_free(pool);
-			pool = TAILQ_FIRST(&sh->cmng.ccont[i].pool_list);
+			rte_free(sh->cmng.ccont[i][age].pools);
 		}
-		rte_free(sh->cmng.ccont[i].pools);
 	}
 	mng = LIST_FIRST(&sh->cmng.mem_mngs);
 	while (mng) {
@@ -788,6 +818,7 @@ mlx5_alloc_shared_ibctx(const struct mlx5_dev_spawn_data *spawn,
 		err = rte_errno;
 		goto error;
 	}
+	mlx5_flow_aging_init(sh);
 	mlx5_flow_counters_mng_init(sh);
 	mlx5_flow_ipool_create(sh, config);
 	/* Add device to memory callback list. */
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 4d9984f603..1740d4ae89 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -222,13 +222,22 @@ struct mlx5_drop {
 #define MLX5_COUNTERS_PER_POOL 512
 #define MLX5_MAX_PENDING_QUERIES 4
 #define MLX5_CNT_CONTAINER_RESIZE 64
+#define MLX5_CNT_AGE_OFFSET 0x80000000
 #define CNT_SIZE (sizeof(struct mlx5_flow_counter))
 #define CNTEXT_SIZE (sizeof(struct mlx5_flow_counter_ext))
+#define AGE_SIZE (sizeof(struct mlx5_age_param))
+#define MLX5_AGING_TIME_DELAY	7
 
 #define CNT_POOL_TYPE_EXT	(1 << 0)
+#define CNT_POOL_TYPE_AGE	(1 << 1)
 #define IS_EXT_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_EXT)
+#define IS_AGE_POOL(pool) (((pool)->type) & CNT_POOL_TYPE_AGE)
+#define MLX_CNT_IS_AGE(counter) ((counter) & MLX5_CNT_AGE_OFFSET ? 1 : 0)
+
 #define MLX5_CNT_LEN(pool) \
-	(CNT_SIZE + (IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
+	(CNT_SIZE + \
+	(IS_AGE_POOL(pool) ? AGE_SIZE : 0) + \
+	(IS_EXT_POOL(pool) ? CNTEXT_SIZE : 0))
 #define MLX5_POOL_GET_CNT(pool, index) \
 	((struct mlx5_flow_counter *) \
 	((uint8_t *)((pool) + 1) + (index) * (MLX5_CNT_LEN(pool))))
@@ -243,13 +252,33 @@ struct mlx5_drop {
  */
 #define MLX5_MAKE_CNT_IDX(pi, offset) \
 	((pi) * MLX5_COUNTERS_PER_POOL + (offset) + 1)
-#define MLX5_CNT_TO_CNT_EXT(cnt) \
-	((struct mlx5_flow_counter_ext *)((cnt) + 1))
+#define MLX5_CNT_TO_CNT_EXT(pool, cnt) \
+	((struct mlx5_flow_counter_ext *)\
+	((uint8_t *)((cnt) + 1) + \
+	(IS_AGE_POOL(pool) ? AGE_SIZE : 0)))
 #define MLX5_GET_POOL_CNT_EXT(pool, offset) \
-	MLX5_CNT_TO_CNT_EXT(MLX5_POOL_GET_CNT((pool), (offset)))
+	MLX5_CNT_TO_CNT_EXT(pool, MLX5_POOL_GET_CNT((pool), (offset)))
+#define MLX5_CNT_TO_AGE(cnt) \
+	((struct mlx5_age_param *)((cnt) + 1))
 
 struct mlx5_flow_counter_pool;
 
+/*age status*/
+enum {
+	AGE_FREE, /* Initialized state. */
+	AGE_CANDIDATE, /* Counter assigned to flows. */
+	AGE_TMOUT, /* Timeout, wait for rte_flow_get_aged_flows and destroy. */
+};
+
+/* Counter age parameter. */
+struct mlx5_age_param {
+	rte_atomic16_t state; /**< Age state. */
+	uint16_t port_id; /**< Port id of the counter. */
+	uint32_t timeout:15; /**< Age timeout in unit of 0.1sec. */
+	uint32_t expire:16; /**< Expire time(0.1sec) in the future. */
+	void *context; /**< Flow counter age context. */
+};
+
 struct flow_counter_stats {
 	uint64_t hits;
 	uint64_t bytes;
@@ -299,7 +328,7 @@ struct mlx5_flow_counter_pool {
 	rte_atomic64_t start_query_gen; /* Query start round. */
 	rte_atomic64_t end_query_gen; /* Query end round. */
 	uint32_t index; /* Pool index in container. */
-	uint32_t type: 2; /* Memory type behind the counter array. */
+	uint8_t type; /* Memory type behind the counter array. */
 	rte_spinlock_t sl; /* The pool lock. */
 	struct mlx5_counter_stats_raw *raw;
 	struct mlx5_counter_stats_raw *raw_hw; /* The raw on HW working. */
@@ -337,18 +366,33 @@ struct mlx5_pools_container {
 
 /* Counter global management structure. */
 struct mlx5_flow_counter_mng {
-	uint8_t mhi[2]; /* master \ host container index. */
-	struct mlx5_pools_container ccont[2 * 2];
-	/* 2 containers for single and for batch for double-buffer. */
+	uint8_t mhi[2][2]; /* master \ host and age \ no age container index. */
+	struct mlx5_pools_container ccont[2 * 2][2];
+	/* master \ host and age \ no age pools container. */
 	struct mlx5_counters flow_counters; /* Legacy flow counter list. */
 	uint8_t pending_queries;
 	uint8_t batch;
 	uint16_t pool_index;
+	uint8_t age;
 	uint8_t query_thread_on;
 	LIST_HEAD(mem_mngs, mlx5_counter_stats_mem_mng) mem_mngs;
 	LIST_HEAD(stat_raws, mlx5_counter_stats_raw) free_stat_raws;
 };
-
+#define MLX5_AGE_EVENT_NEW		1
+#define MLX5_AGE_TRIGGER		2
+#define MLX5_AGE_SET(age_info, BIT) \
+	((age_info)->flags |= (1 << (BIT)))
+#define MLX5_AGE_GET(age_info, BIT) \
+	((age_info)->flags & (1 << (BIT)))
+#define GET_PORT_AGE_INFO(priv) \
+	(&((priv)->sh->port[(priv)->ibv_port - 1].age_info))
+
+/* Aging information for per port. */
+struct mlx5_age_info {
+	uint8_t flags; /*Indicate if is new event or need be trigered*/
+	struct mlx5_counters aged_counters; /* Aged flow counter list. */
+	rte_spinlock_t aged_sl; /* Aged flow counter list lock. */
+};
 /* Per port data of shared IB device. */
 struct mlx5_ibv_shared_port {
 	uint32_t ih_port_id;
@@ -360,6 +404,8 @@ struct mlx5_ibv_shared_port {
 	 * RTE_MAX_ETHPORTS it means there is no subhandler
 	 * installed for specified IB port index.
 	 */
+	struct mlx5_age_info age_info;
+	/* Aging information for per port. */
 };
 
 /* Table key of the hash organization. */
@@ -765,6 +811,8 @@ int mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 int mlx5_flow_dev_dump(struct rte_eth_dev *dev, FILE *file,
 		       struct rte_flow_error *error);
 void mlx5_flow_rxq_dynf_metadata_set(struct rte_eth_dev *dev);
+int mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error);
 
 /* mlx5_mp.c */
 int mlx5_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer);
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index cba1f23e81..fb56c1eb8a 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -24,6 +24,7 @@
 #include <rte_ether.h>
 #include <rte_ethdev_driver.h>
 #include <rte_flow.h>
+#include <rte_cycles.h>
 #include <rte_flow_driver.h>
 #include <rte_malloc.h>
 #include <rte_ip.h>
@@ -242,6 +243,7 @@ static const struct rte_flow_ops mlx5_flow_ops = {
 	.isolate = mlx5_flow_isolate,
 	.query = mlx5_flow_query,
 	.dev_dump = mlx5_flow_dev_dump,
+	.get_aged_flows = mlx5_flow_get_aged_flows,
 };
 
 /* Convert FDIR request to Generic flow. */
@@ -2531,6 +2533,8 @@ flow_drv_validate(struct rte_eth_dev *dev,
  *   Pointer to the list of items.
  * @param[in] actions
  *   Pointer to the list of actions.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Pointer to the error structure.
  *
@@ -2543,14 +2547,19 @@ flow_drv_prepare(struct rte_eth_dev *dev,
 		 const struct rte_flow_attr *attr,
 		 const struct rte_flow_item items[],
 		 const struct rte_flow_action actions[],
+		 uint32_t flow_idx,
 		 struct rte_flow_error *error)
 {
 	const struct mlx5_flow_driver_ops *fops;
 	enum mlx5_flow_drv_type type = flow->drv_type;
+	struct mlx5_flow *mlx5_flow = NULL;
 
 	MLX5_ASSERT(type > MLX5_FLOW_TYPE_MIN && type < MLX5_FLOW_TYPE_MAX);
 	fops = flow_get_drv_ops(type);
-	return fops->prepare(dev, attr, items, actions, error);
+	mlx5_flow = fops->prepare(dev, attr, items, actions, error);
+	if (mlx5_flow)
+		mlx5_flow->flow_idx = flow_idx;
+	return mlx5_flow;
 }
 
 /**
@@ -3498,6 +3507,8 @@ flow_hairpin_split(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -3511,11 +3522,13 @@ flow_create_split_inner(struct rte_eth_dev *dev,
 			const struct rte_flow_attr *attr,
 			const struct rte_flow_item items[],
 			const struct rte_flow_action actions[],
-			bool external, struct rte_flow_error *error)
+			bool external, uint32_t flow_idx,
+			struct rte_flow_error *error)
 {
 	struct mlx5_flow *dev_flow;
 
-	dev_flow = flow_drv_prepare(dev, flow, attr, items, actions, error);
+	dev_flow = flow_drv_prepare(dev, flow, attr, items, actions,
+		flow_idx, error);
 	if (!dev_flow)
 		return -rte_errno;
 	dev_flow->flow = flow;
@@ -3876,6 +3889,8 @@ flow_mreg_tx_copy_prep(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -3888,7 +3903,8 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 			   const struct rte_flow_attr *attr,
 			   const struct rte_flow_item items[],
 			   const struct rte_flow_action actions[],
-			   bool external, struct rte_flow_error *error)
+			   bool external, uint32_t flow_idx,
+			   struct rte_flow_error *error)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_dev_config *config = &priv->config;
@@ -3908,7 +3924,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 	    !mlx5_flow_ext_mreg_supported(dev))
 		return flow_create_split_inner(dev, flow, NULL, prefix_layers,
 					       attr, items, actions, external,
-					       error);
+					       flow_idx, error);
 	actions_n = flow_parse_metadata_split_actions_info(actions, &qrss,
 							   &encap_idx);
 	if (qrss) {
@@ -3992,7 +4008,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 	/* Add the unmodified original or prefix subflow. */
 	ret = flow_create_split_inner(dev, flow, &dev_flow, prefix_layers, attr,
 				      items, ext_actions ? ext_actions :
-				      actions, external, error);
+				      actions, external, flow_idx, error);
 	if (ret < 0)
 		goto exit;
 	MLX5_ASSERT(dev_flow);
@@ -4055,7 +4071,7 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
 		ret = flow_create_split_inner(dev, flow, &dev_flow, layers,
 					      &q_attr, mtr_sfx ? items :
 					      q_items, q_actions,
-					      external, error);
+					      external, flow_idx, error);
 		if (ret < 0)
 			goto exit;
 		/* qrss ID should be freed if failed. */
@@ -4096,6 +4112,8 @@ flow_create_split_metadata(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -4107,7 +4125,8 @@ flow_create_split_meter(struct rte_eth_dev *dev,
 			   const struct rte_flow_attr *attr,
 			   const struct rte_flow_item items[],
 			   const struct rte_flow_action actions[],
-			   bool external, struct rte_flow_error *error)
+			   bool external, uint32_t flow_idx,
+			   struct rte_flow_error *error)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct rte_flow_action *sfx_actions = NULL;
@@ -4151,7 +4170,7 @@ flow_create_split_meter(struct rte_eth_dev *dev,
 		/* Add the prefix subflow. */
 		ret = flow_create_split_inner(dev, flow, &dev_flow, 0, attr,
 					      items, pre_actions, external,
-					      error);
+					      flow_idx, error);
 		if (ret) {
 			ret = -rte_errno;
 			goto exit;
@@ -4168,7 +4187,7 @@ flow_create_split_meter(struct rte_eth_dev *dev,
 					 0, &sfx_attr,
 					 sfx_items ? sfx_items : items,
 					 sfx_actions ? sfx_actions : actions,
-					 external, error);
+					 external, flow_idx, error);
 exit:
 	if (sfx_actions)
 		rte_free(sfx_actions);
@@ -4205,6 +4224,8 @@ flow_create_split_meter(struct rte_eth_dev *dev,
  *   Associated actions (list terminated by the END action).
  * @param[in] external
  *   This flow rule is created by request external to PMD.
+ * @param[in] flow_idx
+ *   This memory pool index to the flow.
  * @param[out] error
  *   Perform verbose error reporting if not NULL.
  * @return
@@ -4216,12 +4237,13 @@ flow_create_split_outer(struct rte_eth_dev *dev,
 			const struct rte_flow_attr *attr,
 			const struct rte_flow_item items[],
 			const struct rte_flow_action actions[],
-			bool external, struct rte_flow_error *error)
+			bool external, uint32_t flow_idx,
+			struct rte_flow_error *error)
 {
 	int ret;
 
 	ret = flow_create_split_meter(dev, flow, attr, items,
-					 actions, external, error);
+					 actions, external, flow_idx, error);
 	MLX5_ASSERT(ret <= 0);
 	return ret;
 }
@@ -4356,7 +4378,7 @@ flow_list_create(struct rte_eth_dev *dev, uint32_t *list,
 		 */
 		ret = flow_create_split_outer(dev, flow, attr,
 					      buf->entry[i].pattern,
-					      p_actions_rx, external,
+					      p_actions_rx, external, idx,
 					      error);
 		if (ret < 0)
 			goto error;
@@ -4367,7 +4389,8 @@ flow_list_create(struct rte_eth_dev *dev, uint32_t *list,
 		attr_tx.ingress = 0;
 		attr_tx.egress = 1;
 		dev_flow = flow_drv_prepare(dev, flow, &attr_tx, items_tx.items,
-					    actions_hairpin_tx.actions, error);
+					 actions_hairpin_tx.actions,
+					 idx, error);
 		if (!dev_flow)
 			goto error;
 		dev_flow->flow = flow;
@@ -5741,6 +5764,31 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 
 #define MLX5_POOL_QUERY_FREQ_US 1000000
 
+/**
+ * Get number of all validate pools.
+ *
+ * @param[in] sh
+ *   Pointer to mlx5_ibv_shared object.
+ *
+ * @return
+ *   The number of all validate pools.
+ */
+static uint32_t
+mlx5_get_all_valid_pool_count(struct mlx5_ibv_shared *sh)
+{
+	uint8_t age, i;
+	uint32_t pools_n = 0;
+	struct mlx5_pools_container *cont;
+
+	for (age = 0; age < RTE_DIM(sh->cmng.ccont[0]); ++age) {
+		for (i = 0; i < 2 ; ++i) {
+			cont = MLX5_CNT_CONTAINER(sh, i, 0, age);
+			pools_n += rte_atomic16_read(&cont->n_valid);
+		}
+	}
+	return pools_n;
+}
+
 /**
  * Set the periodic procedure for triggering asynchronous batch queries for all
  * the counter pools.
@@ -5751,12 +5799,9 @@ mlx5_counter_query(struct rte_eth_dev *dev, uint32_t cnt,
 void
 mlx5_set_query_alarm(struct mlx5_ibv_shared *sh)
 {
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(sh, 0, 0);
-	uint32_t pools_n = rte_atomic16_read(&cont->n_valid);
-	uint32_t us;
+	uint32_t pools_n, us;
 
-	cont = MLX5_CNT_CONTAINER(sh, 1, 0);
-	pools_n += rte_atomic16_read(&cont->n_valid);
+	pools_n = mlx5_get_all_valid_pool_count(sh);
 	us = MLX5_POOL_QUERY_FREQ_US / pools_n;
 	DRV_LOG(DEBUG, "Set alarm for %u pools each %u us", pools_n, us);
 	if (rte_eal_alarm_set(us, mlx5_flow_query_alarm, sh)) {
@@ -5782,6 +5827,7 @@ mlx5_flow_query_alarm(void *arg)
 	uint16_t offset;
 	int ret;
 	uint8_t batch = sh->cmng.batch;
+	uint8_t age = sh->cmng.age;
 	uint16_t pool_index = sh->cmng.pool_index;
 	struct mlx5_pools_container *cont;
 	struct mlx5_pools_container *mcont;
@@ -5790,8 +5836,8 @@ mlx5_flow_query_alarm(void *arg)
 	if (sh->cmng.pending_queries >= MLX5_MAX_PENDING_QUERIES)
 		goto set_alarm;
 next_container:
-	cont = MLX5_CNT_CONTAINER(sh, batch, 1);
-	mcont = MLX5_CNT_CONTAINER(sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(sh, batch, 1, age);
+	mcont = MLX5_CNT_CONTAINER(sh, batch, 0, age);
 	/* Check if resize was done and need to flip a container. */
 	if (cont != mcont) {
 		if (cont->pools) {
@@ -5801,15 +5847,22 @@ mlx5_flow_query_alarm(void *arg)
 		}
 		rte_cio_wmb();
 		 /* Flip the host container. */
-		sh->cmng.mhi[batch] ^= (uint8_t)2;
+		sh->cmng.mhi[batch][age] ^= (uint8_t)2;
 		cont = mcont;
 	}
 	if (!cont->pools) {
 		/* 2 empty containers case is unexpected. */
-		if (unlikely(batch != sh->cmng.batch))
+		if (unlikely(batch != sh->cmng.batch) &&
+			unlikely(age != sh->cmng.age)) {
 			goto set_alarm;
+		}
 		batch ^= 0x1;
 		pool_index = 0;
+		if (batch == 0 && pool_index == 0) {
+			age ^= 0x1;
+			sh->cmng.batch = batch;
+			sh->cmng.age = age;
+		}
 		goto next_container;
 	}
 	pool = cont->pools[pool_index];
@@ -5852,13 +5905,80 @@ mlx5_flow_query_alarm(void *arg)
 	if (pool_index >= rte_atomic16_read(&cont->n_valid)) {
 		batch ^= 0x1;
 		pool_index = 0;
+		if (batch == 0 && pool_index == 0)
+			age ^= 0x1;
 	}
 set_alarm:
 	sh->cmng.batch = batch;
 	sh->cmng.pool_index = pool_index;
+	sh->cmng.age = age;
 	mlx5_set_query_alarm(sh);
 }
 
+/**
+ * Check and callback event for new aged flow in the counter pool
+ *
+ * @param[in] sh
+ *   Pointer to mlx5_ibv_shared object.
+ * @param[in] pool
+ *   Pointer to Current counter pool.
+ */
+static void
+mlx5_flow_aging_check(struct mlx5_ibv_shared *sh,
+		   struct mlx5_flow_counter_pool *pool)
+{
+	struct mlx5_priv *priv;
+	struct mlx5_flow_counter *cnt;
+	struct mlx5_age_info *age_info;
+	struct mlx5_age_param *age_param;
+	struct mlx5_counter_stats_raw *cur = pool->raw_hw;
+	struct mlx5_counter_stats_raw *prev = pool->raw;
+	uint16_t curr = rte_rdtsc() / (rte_get_tsc_hz() / 10);
+	uint32_t i;
+
+	for (i = 0; i < MLX5_COUNTERS_PER_POOL; ++i) {
+		cnt = MLX5_POOL_GET_CNT(pool, i);
+		age_param = MLX5_CNT_TO_AGE(cnt);
+		if (rte_atomic16_read(&age_param->state) != AGE_CANDIDATE)
+			continue;
+		if (cur->data[i].hits != prev->data[i].hits) {
+			age_param->expire = curr + age_param->timeout;
+			continue;
+		}
+		if ((uint16_t)(curr - age_param->expire) >= (UINT16_MAX / 2))
+			continue;
+		/**
+		 * Hold the lock first, or if between the
+		 * state AGE_TMOUT and tailq operation the
+		 * release happened, the release procedure
+		 * may delete a non-existent tailq node.
+		 */
+		priv = rte_eth_devices[age_param->port_id].data->dev_private;
+		age_info = GET_PORT_AGE_INFO(priv);
+		rte_spinlock_lock(&age_info->aged_sl);
+		/* If the cpmset fails, release happens. */
+		if (rte_atomic16_cmpset((volatile uint16_t *)
+					&age_param->state,
+					AGE_CANDIDATE,
+					AGE_TMOUT) ==
+					AGE_CANDIDATE) {
+			TAILQ_INSERT_TAIL(&age_info->aged_counters, cnt, next);
+			MLX5_AGE_SET(age_info, MLX5_AGE_EVENT_NEW);
+		}
+		rte_spinlock_unlock(&age_info->aged_sl);
+	}
+	for (i = 0; i < sh->max_port; i++) {
+		age_info = &sh->port[i].age_info;
+		if (!MLX5_AGE_GET(age_info, MLX5_AGE_EVENT_NEW))
+			continue;
+		if (MLX5_AGE_GET(age_info, MLX5_AGE_TRIGGER))
+			_rte_eth_dev_callback_process
+				(&rte_eth_devices[sh->port[i].devx_ih_port_id],
+				RTE_ETH_EVENT_FLOW_AGED, NULL);
+		age_info->flags = 0;
+	}
+}
+
 /**
  * Handler for the HW respond about ready values from an asynchronous batch
  * query. This function is probably called by the host thread.
@@ -5883,6 +6003,8 @@ mlx5_flow_async_pool_query_handle(struct mlx5_ibv_shared *sh,
 		raw_to_free = pool->raw_hw;
 	} else {
 		raw_to_free = pool->raw;
+		if (IS_AGE_POOL(pool))
+			mlx5_flow_aging_check(sh, pool);
 		rte_spinlock_lock(&pool->sl);
 		pool->raw = pool->raw_hw;
 		rte_spinlock_unlock(&pool->sl);
@@ -6034,3 +6156,40 @@ mlx5_flow_dev_dump(struct rte_eth_dev *dev,
 	return mlx5_devx_cmd_flow_dump(sh->fdb_domain, sh->rx_domain,
 				       sh->tx_domain, file);
 }
+
+/**
+ * Get aged-out flows.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] context
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_countexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   how many contexts get in success, otherwise negative errno value.
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array.
+ */
+int
+mlx5_flow_get_aged_flows(struct rte_eth_dev *dev, void **contexts,
+			uint32_t nb_contexts, struct rte_flow_error *error)
+{
+	const struct mlx5_flow_driver_ops *fops;
+	struct rte_flow_attr attr = { .transfer = 0 };
+
+	if (flow_get_drv_type(dev, &attr) == MLX5_FLOW_TYPE_DV) {
+		fops = flow_get_drv_ops(MLX5_FLOW_TYPE_DV);
+		return fops->get_aged_flows(dev, contexts, nb_contexts,
+						    error);
+	}
+	DRV_LOG(ERR,
+		"port %u get aged flows is not supported.",
+		 dev->data->port_id);
+	return -ENOTSUP;
+}
diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index 2a1f59698c..bf1d5beb9b 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -199,6 +199,7 @@ enum mlx5_feature_name {
 #define MLX5_FLOW_ACTION_METER (1ull << 31)
 #define MLX5_FLOW_ACTION_SET_IPV4_DSCP (1ull << 32)
 #define MLX5_FLOW_ACTION_SET_IPV6_DSCP (1ull << 33)
+#define MLX5_FLOW_ACTION_AGE (1ull << 34)
 
 #define MLX5_FLOW_FATE_ACTIONS \
 	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE | \
@@ -650,6 +651,7 @@ struct mlx5_flow_verbs_workspace {
 /** Device flow structure. */
 struct mlx5_flow {
 	struct rte_flow *flow; /**< Pointer to the main flow. */
+	uint32_t flow_idx; /**< The memory pool index to the main flow. */
 	uint64_t hash_fields; /**< Verbs hash Rx queue hash fields. */
 	uint64_t act_flags;
 	/**< Bit-fields of detected actions, see MLX5_FLOW_ACTION_*. */
@@ -873,6 +875,11 @@ typedef int (*mlx5_flow_counter_query_t)(struct rte_eth_dev *dev,
 					 uint32_t cnt,
 					 bool clear, uint64_t *pkts,
 					 uint64_t *bytes);
+typedef int (*mlx5_flow_get_aged_flows_t)
+					(struct rte_eth_dev *dev,
+					 void **context,
+					 uint32_t nb_contexts,
+					 struct rte_flow_error *error);
 struct mlx5_flow_driver_ops {
 	mlx5_flow_validate_t validate;
 	mlx5_flow_prepare_t prepare;
@@ -888,13 +895,14 @@ struct mlx5_flow_driver_ops {
 	mlx5_flow_counter_alloc_t counter_alloc;
 	mlx5_flow_counter_free_t counter_free;
 	mlx5_flow_counter_query_t counter_query;
+	mlx5_flow_get_aged_flows_t get_aged_flows;
 };
 
 
-#define MLX5_CNT_CONTAINER(sh, batch, thread) (&(sh)->cmng.ccont \
-	[(((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
-#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread) (&(sh)->cmng.ccont \
-	[(~((sh)->cmng.mhi[batch] >> (thread)) & 0x1) * 2 + (batch)])
+#define MLX5_CNT_CONTAINER(sh, batch, thread, age) (&(sh)->cmng.ccont \
+	[(((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
+#define MLX5_CNT_CONTAINER_UNUSED(sh, batch, thread, age) (&(sh)->cmng.ccont \
+	[(~((sh)->cmng.mhi[batch][age] >> (thread)) & 0x1) * 2 + (batch)][age])
 
 /* mlx5_flow.c */
 
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index 784a62c521..e4ab07f7c9 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -24,6 +24,7 @@
 #include <rte_flow.h>
 #include <rte_flow_driver.h>
 #include <rte_malloc.h>
+#include <rte_cycles.h>
 #include <rte_ip.h>
 #include <rte_gre.h>
 #include <rte_vxlan.h>
@@ -3719,6 +3720,50 @@ mlx5_flow_validate_action_meter(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/**
+ * Validate the age action.
+ *
+ * @param[in] action_flags
+ *   Holds the actions detected until now.
+ * @param[in] action
+ *   Pointer to the age action.
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[out] error
+ *   Pointer to error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+flow_dv_validate_action_age(uint64_t action_flags,
+			    const struct rte_flow_action *action,
+			    struct rte_eth_dev *dev,
+			    struct rte_flow_error *error)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	const struct rte_flow_action_age *age = action->conf;
+
+	if (!priv->config.devx || priv->counter_fallback)
+		return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+					  NULL,
+					  "age action not supported");
+	if (!(action->conf))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "configuration cannot be null");
+	if (age->timeout >= UINT16_MAX / 2 / 10)
+		return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "Max age time: 3275 seconds");
+	if (action_flags & MLX5_FLOW_ACTION_AGE)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, NULL,
+					  "Duplicate age ctions set");
+	return 0;
+}
+
 /**
  * Validate the modify-header IPv4 DSCP actions.
  *
@@ -3896,14 +3941,16 @@ flow_dv_counter_get_by_idx(struct rte_eth_dev *dev,
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont;
 	struct mlx5_flow_counter_pool *pool;
-	uint32_t batch = 0;
+	uint32_t batch = 0, age = 0;
 
 	idx--;
+	age = MLX_CNT_IS_AGE(idx);
+	idx = age ? idx - MLX5_CNT_AGE_OFFSET : idx;
 	if (idx >= MLX5_CNT_BATCH_OFFSET) {
 		idx -= MLX5_CNT_BATCH_OFFSET;
 		batch = 1;
 	}
-	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	MLX5_ASSERT(idx / MLX5_COUNTERS_PER_POOL < cont->n);
 	pool = cont->pools[idx / MLX5_COUNTERS_PER_POOL];
 	MLX5_ASSERT(pool);
@@ -4023,18 +4070,21 @@ flow_dv_create_counter_stat_mem_mng(struct rte_eth_dev *dev, int raws_n)
  *   Pointer to the Ethernet device structure.
  * @param[in] batch
  *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the pool is for Aging counter.
  *
  * @return
  *   The new container pointer on success, otherwise NULL and rte_errno is set.
  */
 static struct mlx5_pools_container *
-flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
+flow_dv_container_resize(struct rte_eth_dev *dev,
+				uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont =
-			MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+			MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	struct mlx5_pools_container *new_cont =
-			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0);
+			MLX5_CNT_CONTAINER_UNUSED(priv->sh, batch, 0, age);
 	struct mlx5_counter_stats_mem_mng *mem_mng = NULL;
 	uint32_t resize = cont->n + MLX5_CNT_CONTAINER_RESIZE;
 	uint32_t mem_size = sizeof(struct mlx5_flow_counter_pool *) * resize;
@@ -4042,7 +4092,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
 
 	/* Fallback mode has no background thread. Skip the check. */
 	if (!priv->counter_fallback &&
-	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1)) {
+	    cont != MLX5_CNT_CONTAINER(priv->sh, batch, 1, age)) {
 		/* The last resize still hasn't detected by the host thread. */
 		rte_errno = EAGAIN;
 		return NULL;
@@ -4085,7 +4135,7 @@ flow_dv_container_resize(struct rte_eth_dev *dev, uint32_t batch)
 	new_cont->init_mem_mng = mem_mng;
 	rte_cio_wmb();
 	 /* Flip the master container. */
-	priv->sh->cmng.mhi[batch] ^= (uint8_t)1;
+	priv->sh->cmng.mhi[batch][age] ^= (uint8_t)1;
 	return new_cont;
 }
 
@@ -4117,7 +4167,7 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 		if (priv->counter_fallback)
 			return mlx5_devx_cmd_flow_counter_query(cnt_ext->dcs, 0,
 					0, pkts, bytes, 0, NULL, NULL, 0);
@@ -4150,6 +4200,8 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
  *   The devX counter handle.
  * @param[in] batch
  *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the pool is for counter that was allocated for aging.
  * @param[in/out] cont_cur
  *   Pointer to the container pointer, it will be update in pool resize.
  *
@@ -4158,24 +4210,23 @@ _flow_dv_query_count(struct rte_eth_dev *dev, uint32_t counter, uint64_t *pkts,
  */
 static struct mlx5_pools_container *
 flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
-		    uint32_t batch)
+		    uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_flow_counter_pool *pool;
 	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
-							       0);
+							       0, age);
 	int16_t n_valid = rte_atomic16_read(&cont->n_valid);
-	uint32_t size;
+	uint32_t size = sizeof(*pool);
 
 	if (cont->n == n_valid) {
-		cont = flow_dv_container_resize(dev, batch);
+		cont = flow_dv_container_resize(dev, batch, age);
 		if (!cont)
 			return NULL;
 	}
-	size = sizeof(*pool);
 	size += MLX5_COUNTERS_PER_POOL * CNT_SIZE;
-	if (!batch)
-		size += MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE;
+	size += (batch ? 0 : MLX5_COUNTERS_PER_POOL * CNTEXT_SIZE);
+	size += (!age ? 0 : MLX5_COUNTERS_PER_POOL * AGE_SIZE);
 	pool = rte_calloc(__func__, 1, size, 0);
 	if (!pool) {
 		rte_errno = ENOMEM;
@@ -4187,8 +4238,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 						     MLX5_CNT_CONTAINER_RESIZE;
 	pool->raw_hw = NULL;
 	pool->type = 0;
-	if (!batch)
-		pool->type |= CNT_POOL_TYPE_EXT;
+	pool->type |= (batch ? 0 :  CNT_POOL_TYPE_EXT);
+	pool->type |= (!age ? 0 :  CNT_POOL_TYPE_AGE);
 	rte_spinlock_init(&pool->sl);
 	/*
 	 * The generation of the new allocated counters in this pool is 0, 2 in
@@ -4215,6 +4266,39 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 	return cont;
 }
 
+/**
+ * Update the minimum dcs-id for aged or no-aged counter pool.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] pool
+ *   Current counter pool.
+ * @param[in] batch
+ *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the counter is for aging.
+ */
+static void
+flow_dv_counter_update_min_dcs(struct rte_eth_dev *dev,
+			struct mlx5_flow_counter_pool *pool,
+			uint32_t batch, uint32_t age)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_flow_counter_pool *other;
+	struct mlx5_pools_container *cont;
+
+	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, (age ^ 0x1));
+	other = flow_dv_find_pool_by_id(cont, pool->min_dcs->id);
+	if (!other)
+		return;
+	if (pool->min_dcs->id < other->min_dcs->id) {
+		rte_atomic64_set(&other->a64_dcs,
+			rte_atomic64_read(&pool->a64_dcs));
+	} else {
+		rte_atomic64_set(&pool->a64_dcs,
+			rte_atomic64_read(&other->a64_dcs));
+	}
+}
 /**
  * Prepare a new counter and/or a new counter pool.
  *
@@ -4224,6 +4308,8 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
  *   Where to put the pointer of a new counter.
  * @param[in] batch
  *   Whether the pool is for counter that was allocated by batch command.
+ * @param[in] age
+ *   Whether the pool is for counter that was allocated for aging.
  *
  * @return
  *   The counter container pointer and @p cnt_free is set on success,
@@ -4232,7 +4318,7 @@ flow_dv_pool_create(struct rte_eth_dev *dev, struct mlx5_devx_obj *dcs,
 static struct mlx5_pools_container *
 flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			     struct mlx5_flow_counter **cnt_free,
-			     uint32_t batch)
+			     uint32_t batch, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_pools_container *cont;
@@ -4241,7 +4327,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 	struct mlx5_flow_counter *cnt;
 	uint32_t i;
 
-	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0);
+	cont = MLX5_CNT_CONTAINER(priv->sh, batch, 0, age);
 	if (!batch) {
 		/* bulk_bitmap must be 0 for single counter allocation. */
 		dcs = mlx5_devx_cmd_flow_counter_alloc(priv->sh->ctx, 0);
@@ -4249,7 +4335,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			return NULL;
 		pool = flow_dv_find_pool_by_id(cont, dcs->id);
 		if (!pool) {
-			cont = flow_dv_pool_create(dev, dcs, batch);
+			cont = flow_dv_pool_create(dev, dcs, batch, age);
 			if (!cont) {
 				mlx5_devx_cmd_destroy(dcs);
 				return NULL;
@@ -4259,6 +4345,8 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 			rte_atomic64_set(&pool->a64_dcs,
 					 (int64_t)(uintptr_t)dcs);
 		}
+		flow_dv_counter_update_min_dcs(dev,
+						pool, batch, age);
 		i = dcs->id % MLX5_COUNTERS_PER_POOL;
 		cnt = MLX5_POOL_GET_CNT(pool, i);
 		TAILQ_INSERT_HEAD(&pool->counters, cnt, next);
@@ -4273,7 +4361,7 @@ flow_dv_counter_pool_prepare(struct rte_eth_dev *dev,
 		rte_errno = ENODATA;
 		return NULL;
 	}
-	cont = flow_dv_pool_create(dev, dcs, batch);
+	cont = flow_dv_pool_create(dev, dcs, batch, age);
 	if (!cont) {
 		mlx5_devx_cmd_destroy(dcs);
 		return NULL;
@@ -4334,13 +4422,15 @@ flow_dv_counter_shared_search(struct mlx5_pools_container *cont, uint32_t id,
  *   Counter identifier.
  * @param[in] group
  *   Counter flow group.
+ * @param[in] age
+ *   Whether the counter was allocated for aging.
  *
  * @return
  *   Index to flow counter on success, 0 otherwise and rte_errno is set.
  */
 static uint32_t
 flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
-		      uint16_t group)
+		      uint16_t group, uint32_t age)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_flow_counter_pool *pool = NULL;
@@ -4356,7 +4446,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	 */
 	uint32_t batch = (group && !shared && !priv->counter_fallback) ? 1 : 0;
 	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, batch,
-							       0);
+							       0, age);
 	uint32_t cnt_idx;
 
 	if (!priv->config.devx) {
@@ -4395,13 +4485,13 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 		cnt_free = NULL;
 	}
 	if (!cnt_free) {
-		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch);
+		cont = flow_dv_counter_pool_prepare(dev, &cnt_free, batch, age);
 		if (!cont)
 			return 0;
 		pool = TAILQ_FIRST(&cont->pool_list);
 	}
 	if (!batch)
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt_free);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt_free);
 	/* Create a DV counter action only in the first time usage. */
 	if (!cnt_free->action) {
 		uint16_t offset;
@@ -4424,6 +4514,7 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	cnt_idx = MLX5_MAKE_CNT_IDX(pool->index,
 				MLX5_CNT_ARRAY_IDX(pool, cnt_free));
 	cnt_idx += batch * MLX5_CNT_BATCH_OFFSET;
+	cnt_idx += age * MLX5_CNT_AGE_OFFSET;
 	/* Update the counter reset values. */
 	if (_flow_dv_query_count(dev, cnt_idx, &cnt_free->hits,
 				 &cnt_free->bytes))
@@ -4445,6 +4536,64 @@ flow_dv_counter_alloc(struct rte_eth_dev *dev, uint32_t shared, uint32_t id,
 	return cnt_idx;
 }
 
+/**
+ * Get age param from counter index.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] counter
+ *   Index to the counter handler.
+ *
+ * @return
+ *   The aging parameter specified for the counter index.
+ */
+static struct mlx5_age_param*
+flow_dv_counter_idx_get_age(struct rte_eth_dev *dev,
+				uint32_t counter)
+{
+	struct mlx5_flow_counter *cnt;
+	struct mlx5_flow_counter_pool *pool = NULL;
+
+	flow_dv_counter_get_by_idx(dev, counter, &pool);
+	counter = (counter - 1) % MLX5_COUNTERS_PER_POOL;
+	cnt = MLX5_POOL_GET_CNT(pool, counter);
+	return MLX5_CNT_TO_AGE(cnt);
+}
+
+/**
+ * Remove a flow counter from aged counter list.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] counter
+ *   Index to the counter handler.
+ * @param[in] cnt
+ *   Pointer to the counter handler.
+ */
+static void
+flow_dv_counter_remove_from_age(struct rte_eth_dev *dev,
+				uint32_t counter, struct mlx5_flow_counter *cnt)
+{
+	struct mlx5_age_info *age_info;
+	struct mlx5_age_param *age_param;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
+	age_info = GET_PORT_AGE_INFO(priv);
+	age_param = flow_dv_counter_idx_get_age(dev, counter);
+	if (rte_atomic16_cmpset((volatile uint16_t *)
+			&age_param->state,
+			AGE_CANDIDATE, AGE_FREE)
+			!= AGE_CANDIDATE) {
+		/**
+		 * We need the lock even it is age timeout,
+		 * since counter may still in process.
+		 */
+		rte_spinlock_lock(&age_info->aged_sl);
+		TAILQ_REMOVE(&age_info->aged_counters, cnt, next);
+		rte_spinlock_unlock(&age_info->aged_sl);
+	}
+	rte_atomic16_set(&age_param->state, AGE_FREE);
+}
 /**
  * Release a flow counter.
  *
@@ -4465,10 +4614,12 @@ flow_dv_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 	cnt = flow_dv_counter_get_by_idx(dev, counter, &pool);
 	MLX5_ASSERT(pool);
 	if (counter < MLX5_CNT_BATCH_OFFSET) {
-		cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+		cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 		if (cnt_ext && --cnt_ext->ref_cnt)
 			return;
 	}
+	if (IS_AGE_POOL(pool))
+		flow_dv_counter_remove_from_age(dev, counter, cnt);
 	/* Put the counter in the end - the last updated one. */
 	TAILQ_INSERT_TAIL(&pool->counters, cnt, next);
 	/*
@@ -5243,6 +5394,15 @@ flow_dv_validate(struct rte_eth_dev *dev, const struct rte_flow_attr *attr,
 			/* Meter action will add one more TAG action. */
 			rw_act_num += MLX5_ACT_NUM_SET_TAG;
 			break;
+		case RTE_FLOW_ACTION_TYPE_AGE:
+			ret = flow_dv_validate_action_age(action_flags,
+							  actions, dev,
+							  error);
+			if (ret < 0)
+				return ret;
+			action_flags |= MLX5_FLOW_ACTION_AGE;
+			++actions_n;
+			break;
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP:
 			ret = flow_dv_validate_action_modify_ipv4_dscp
 							 (action_flags,
@@ -7281,6 +7441,53 @@ flow_dv_translate_action_port_id(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/**
+ * Create a counter with aging configuration.
+ *
+ * @param[in] dev
+ *   Pointer to rte_eth_dev structure.
+ * @param[out] count
+ *   Pointer to the counter action configuration.
+ * @param[in] age
+ *   Pointer to the aging action configuration.
+ *
+ * @return
+ *   Index to flow counter on success, 0 otherwise.
+ */
+static uint32_t
+flow_dv_translate_create_counter(struct rte_eth_dev *dev,
+				struct mlx5_flow *dev_flow,
+				const struct rte_flow_action_count *count,
+				const struct rte_flow_action_age *age)
+{
+	uint32_t counter;
+	struct mlx5_age_param *age_param;
+
+	counter = flow_dv_counter_alloc(dev,
+				count ? count->shared : 0,
+				count ? count->id : 0,
+				dev_flow->dv.group, !!age);
+	if (!counter || age == NULL)
+		return counter;
+	age_param  = flow_dv_counter_idx_get_age(dev, counter);
+	/*
+	 * The counter age accuracy may have a bit delay. Have 3/4
+	 * second bias on the timeount in order to let it age in time.
+	 */
+	age_param->context = age->context ? age->context :
+		(void *)(uintptr_t)(dev_flow->flow_idx);
+	/*
+	 * The counter age accuracy may have a bit delay. Have 3/4
+	 * second bias on the timeount in order to let it age in time.
+	 */
+	age_param->timeout = age->timeout * 10 - MLX5_AGING_TIME_DELAY;
+	/* Set expire time in unit of 0.1 sec. */
+	age_param->port_id = dev->data->port_id;
+	age_param->expire = age_param->timeout +
+			rte_rdtsc() / (rte_get_tsc_hz() / 10);
+	rte_atomic16_set(&age_param->state, AGE_CANDIDATE);
+	return counter;
+}
 /**
  * Add Tx queue matcher
  *
@@ -7450,6 +7657,8 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 			    (MLX5_MAX_MODIFY_NUM + 1)];
 	} mhdr_dummy;
 	struct mlx5_flow_dv_modify_hdr_resource *mhdr_res = &mhdr_dummy.res;
+	const struct rte_flow_action_count *count = NULL;
+	const struct rte_flow_action_age *age = NULL;
 	union flow_dv_attr flow_attr = { .attr = 0 };
 	uint32_t tag_be;
 	union mlx5_flow_tbl_key tbl_key;
@@ -7478,7 +7687,6 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 		const struct rte_flow_action_queue *queue;
 		const struct rte_flow_action_rss *rss;
 		const struct rte_flow_action *action = actions;
-		const struct rte_flow_action_count *count = action->conf;
 		const uint8_t *rss_key;
 		const struct rte_flow_action_jump *jump_data;
 		const struct rte_flow_action_meter *mtr;
@@ -7607,36 +7815,21 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 			action_flags |= MLX5_FLOW_ACTION_RSS;
 			dev_flow->handle->fate_action = MLX5_FLOW_FATE_QUEUE;
 			break;
+		case RTE_FLOW_ACTION_TYPE_AGE:
 		case RTE_FLOW_ACTION_TYPE_COUNT:
 			if (!dev_conf->devx) {
-				rte_errno = ENOTSUP;
-				goto cnt_err;
-			}
-			flow->counter = flow_dv_counter_alloc(dev,
-							count->shared,
-							count->id,
-							dev_flow->dv.group);
-			if (!flow->counter)
-				goto cnt_err;
-			dev_flow->dv.actions[actions_n++] =
-				  (flow_dv_counter_get_by_idx(dev,
-				  flow->counter, NULL))->action;
-			action_flags |= MLX5_FLOW_ACTION_COUNT;
-			break;
-cnt_err:
-			if (rte_errno == ENOTSUP)
 				return rte_flow_error_set
 					      (error, ENOTSUP,
 					       RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
 					       NULL,
 					       "count action not supported");
+			}
+			/* Save information first, will apply later. */
+			if (actions->type == RTE_FLOW_ACTION_TYPE_COUNT)
+				count = action->conf;
 			else
-				return rte_flow_error_set
-						(error, rte_errno,
-						 RTE_FLOW_ERROR_TYPE_ACTION,
-						 action,
-						 "cannot create counter"
-						  " object.");
+				age = action->conf;
+			action_flags |= MLX5_FLOW_ACTION_COUNT;
 			break;
 		case RTE_FLOW_ACTION_TYPE_OF_POP_VLAN:
 			dev_flow->dv.actions[actions_n++] =
@@ -7909,6 +8102,22 @@ __flow_dv_translate(struct rte_eth_dev *dev,
 				dev_flow->dv.actions[modify_action_position] =
 					handle->dvh.modify_hdr->verbs_action;
 			}
+			if (action_flags & MLX5_FLOW_ACTION_COUNT) {
+				flow->counter =
+					flow_dv_translate_create_counter(dev,
+						dev_flow, count, age);
+
+				if (!flow->counter)
+					return rte_flow_error_set
+						(error, rte_errno,
+						RTE_FLOW_ERROR_TYPE_ACTION,
+						NULL,
+						"cannot create counter"
+						" object.");
+				dev_flow->dv.actions[actions_n++] =
+					  (flow_dv_counter_get_by_idx(dev,
+					  flow->counter, NULL))->action;
+			}
 			break;
 		default:
 			break;
@@ -9169,6 +9378,60 @@ flow_dv_counter_query(struct rte_eth_dev *dev, uint32_t counter, bool clear,
 	return 0;
 }
 
+/**
+ * Get aged-out flows.
+ *
+ * @param[in] dev
+ *   Pointer to the Ethernet device structure.
+ * @param[in] context
+ *   The address of an array of pointers to the aged-out flows contexts.
+ * @param[in] nb_contexts
+ *   The length of context array pointers.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL. Initialized in case of
+ *   error only.
+ *
+ * @return
+ *   how many contexts get in success, otherwise negative errno value.
+ *   if nb_contexts is 0, return the amount of all aged contexts.
+ *   if nb_contexts is not 0 , return the amount of aged flows reported
+ *   in the context array.
+ * @note: only stub for now
+ */
+static int
+flow_get_aged_flows(struct rte_eth_dev *dev,
+		    void **context,
+		    uint32_t nb_contexts,
+		    struct rte_flow_error *error)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_age_info *age_info;
+	struct mlx5_age_param *age_param;
+	struct mlx5_flow_counter *counter;
+	int nb_flows = 0;
+
+	if (nb_contexts && !context)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+					  NULL,
+					  "Should assign at least one flow or"
+					  " context to get if nb_contexts != 0");
+	age_info = GET_PORT_AGE_INFO(priv);
+	rte_spinlock_lock(&age_info->aged_sl);
+	TAILQ_FOREACH(counter, &age_info->aged_counters, next) {
+		nb_flows++;
+		if (nb_contexts) {
+			age_param = MLX5_CNT_TO_AGE(counter);
+			context[nb_flows - 1] = age_param->context;
+			if (!(--nb_contexts))
+				break;
+		}
+	}
+	rte_spinlock_unlock(&age_info->aged_sl);
+	MLX5_AGE_SET(age_info, MLX5_AGE_TRIGGER);
+	return nb_flows;
+}
+
 /*
  * Mutex-protected thunk to lock-free  __flow_dv_translate().
  */
@@ -9235,7 +9498,7 @@ flow_dv_counter_allocate(struct rte_eth_dev *dev)
 	uint32_t cnt;
 
 	flow_dv_shared_lock(dev);
-	cnt = flow_dv_counter_alloc(dev, 0, 0, 1);
+	cnt = flow_dv_counter_alloc(dev, 0, 0, 1, 0);
 	flow_dv_shared_unlock(dev);
 	return cnt;
 }
@@ -9266,6 +9529,7 @@ const struct mlx5_flow_driver_ops mlx5_flow_dv_drv_ops = {
 	.counter_alloc = flow_dv_counter_allocate,
 	.counter_free = flow_dv_counter_free,
 	.counter_query = flow_dv_counter_query,
+	.get_aged_flows = flow_get_aged_flows,
 };
 
 #endif /* HAVE_IBV_FLOW_DV_SUPPORT */
diff --git a/drivers/net/mlx5/mlx5_flow_verbs.c b/drivers/net/mlx5/mlx5_flow_verbs.c
index 236d665852..7efd97f547 100644
--- a/drivers/net/mlx5/mlx5_flow_verbs.c
+++ b/drivers/net/mlx5/mlx5_flow_verbs.c
@@ -56,7 +56,8 @@ flow_verbs_counter_get_by_idx(struct rte_eth_dev *dev,
 			      struct mlx5_flow_counter_pool **ppool)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
+	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
+									0);
 	struct mlx5_flow_counter_pool *pool;
 
 	idx--;
@@ -151,7 +152,8 @@ static uint32_t
 flow_verbs_counter_new(struct rte_eth_dev *dev, uint32_t shared, uint32_t id)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0);
+	struct mlx5_pools_container *cont = MLX5_CNT_CONTAINER(priv->sh, 0, 0,
+									0);
 	struct mlx5_flow_counter_pool *pool = NULL;
 	struct mlx5_flow_counter_ext *cnt_ext = NULL;
 	struct mlx5_flow_counter *cnt = NULL;
@@ -251,7 +253,7 @@ flow_verbs_counter_release(struct rte_eth_dev *dev, uint32_t counter)
 
 	cnt = flow_verbs_counter_get_by_idx(dev, counter,
 					    &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	if (--cnt_ext->ref_cnt == 0) {
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 		claim_zero(mlx5_glue->destroy_counter_set(cnt_ext->cs));
@@ -282,7 +284,7 @@ flow_verbs_counter_query(struct rte_eth_dev *dev __rte_unused,
 		struct mlx5_flow_counter *cnt = flow_verbs_counter_get_by_idx
 						(dev, flow->counter, &pool);
 		struct mlx5_flow_counter_ext *cnt_ext = MLX5_CNT_TO_CNT_EXT
-						(cnt);
+						(pool, cnt);
 		struct rte_flow_query_count *qc = data;
 		uint64_t counters[2] = {0, 0};
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
@@ -1083,12 +1085,12 @@ flow_verbs_translate_action_count(struct mlx5_flow *dev_flow,
 	}
 #if defined(HAVE_IBV_DEVICE_COUNTERS_SET_V42)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	counter.counter_set_handle = cnt_ext->cs->handle;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #elif defined(HAVE_IBV_DEVICE_COUNTERS_SET_V45)
 	cnt = flow_verbs_counter_get_by_idx(dev, flow->counter, &pool);
-	cnt_ext = MLX5_CNT_TO_CNT_EXT(cnt);
+	cnt_ext = MLX5_CNT_TO_CNT_EXT(pool, cnt);
 	counter.counters = cnt_ext->cs;
 	flow_verbs_spec_add(&dev_flow->verbs, &counter, size);
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v4] ethdev: support flow aging
  2020-04-21 10:11         ` [dpdk-dev] [PATCH v4] " Bill Zhou
  2020-04-21 17:13           ` Ferruh Yigit
@ 2020-04-29 14:50           ` Tom Barbette
  2020-04-30  7:36             ` Matan Azrad
  1 sibling, 1 reply; 50+ messages in thread
From: Tom Barbette @ 2020-04-29 14:50 UTC (permalink / raw)
  To: Bill Zhou, orika, matan, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic, thomas,
	ferruh.yigit, arybchenko
  Cc: dev

Great news!

- I can understand why there is no timeout unit. But that's calling for 
user nightmare. Eg I could only get from the code (and not from 
documentation yet? ) of the following mlx5 driver patch that the value 
should be in tenth of seconds. If I build an application that is 
supposed to work with "any NIC", what can I do? We'd need a way to query 
the timeout unit (have it in dev_info probably).
- It's not totally clear if the rule is automatically removed or not. is 
this a helper or an OpenFlow-like notification?
- Find a typo and grammar fix inline.
- Recently, Mellanox introduced the ability to create 330K flows/s. Any 
performance considerations if those flow "expire" at the same rate?


Hope it's helpfull,

Tom

Le 21/04/2020 à 12:11, Bill Zhou a écrit :
> From: Dong Zhou <dongz@mellanox.com>
> 
> One of the reasons to destroy a flow is the fact that no packet matches the
> flow for "timeout" time.
> For example, when TCP\UDP sessions are suddenly closed.
> 
> Currently, there is not any DPDK mechanism for flow aging and the
> applications use their own ways to detect and destroy aged-out flows.
> 
> The flow aging implementation need include:
> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and
>    the application flow context for each flow.
> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to report
>    that there are new aged-out flows.
> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>    contexts from the port.
> - Support input flow aging command line in Testpmd.
> 
> The new event type addition in the enum is flagged as an ABI breakage, so
> an ignore rule is added for these reasons:
> - It is not changing value of existing types (except MAX)
> - The new value is not used by existing API if the event is not registered
> In general, it is safe adding new ethdev event types at the end of the
> enum, because of event callback registration mechanism.
> 
> Signed-off-by: Dong Zhou <dongz@mellanox.com>
> ---
> v2: Removing "* Added support for flow Aging mechanism base on counter."
> this line from doc/guides/rel_notes/release_20_05.rst, this patch does not
> include this support.
> 
> v3: Update file libabigail.abignore, add one new suppressed enumeration
> type for RTE_ETH_EVENT_MAX.
> 
> v4: Add justification in devtools/libabigail.abignore and in the commit
> log about the modification of v3.
> ---
>   app/test-pmd/cmdline_flow.c              | 26 ++++++++++
>   devtools/libabigail.abignore             |  6 +++
>   doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
>   doc/guides/rel_notes/release_20_05.rst   | 11 +++++
>   lib/librte_ethdev/rte_ethdev.h           |  1 +
>   lib/librte_ethdev/rte_ethdev_version.map |  3 ++
>   lib/librte_ethdev/rte_flow.c             | 18 +++++++
>   lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
>   lib/librte_ethdev/rte_flow_driver.h      |  6 +++
>   9 files changed, 155 insertions(+)
> 
> diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
> index e6ab8ff2f7..45bcff3cf5 100644
> --- a/app/test-pmd/cmdline_flow.c
> +++ b/app/test-pmd/cmdline_flow.c
> @@ -343,6 +343,8 @@ enum index {
>   	ACTION_SET_IPV4_DSCP_VALUE,
>   	ACTION_SET_IPV6_DSCP,
>   	ACTION_SET_IPV6_DSCP_VALUE,
> +	ACTION_AGE,
> +	ACTION_AGE_TIMEOUT,
>   };
>   
>   /** Maximum size for pattern in struct rte_flow_item_raw. */
> @@ -1145,6 +1147,7 @@ static const enum index next_action[] = {
>   	ACTION_SET_META,
>   	ACTION_SET_IPV4_DSCP,
>   	ACTION_SET_IPV6_DSCP,
> +	ACTION_AGE,
>   	ZERO,
>   };
>   
> @@ -1370,6 +1373,13 @@ static const enum index action_set_ipv6_dscp[] = {
>   	ZERO,
>   };
>   
> +static const enum index action_age[] = {
> +	ACTION_AGE,
> +	ACTION_AGE_TIMEOUT,
> +	ACTION_NEXT,
> +	ZERO,
> +};
> +
>   static int parse_set_raw_encap_decap(struct context *, const struct token *,
>   				     const char *, unsigned int,
>   				     void *, unsigned int);
> @@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
>   			     (struct rte_flow_action_set_dscp, dscp)),
>   		.call = parse_vc_conf,
>   	},
> +	[ACTION_AGE] = {
> +		.name = "age",
> +		.help = "set a specific metadata header",
> +		.next = NEXT(action_age),
> +		.priv = PRIV_ACTION(AGE,
> +			sizeof(struct rte_flow_action_age)),
> +		.call = parse_vc,
> +	},
> +	[ACTION_AGE_TIMEOUT] = {
> +		.name = "timeout",
> +		.help = "flow age timeout value",
> +		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
> +					   timeout, 24)),
> +		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
> +		.call = parse_vc_conf,
> +	},
>   };
>   
>   /** Remove and return last entry from argument stack. */
> diff --git a/devtools/libabigail.abignore b/devtools/libabigail.abignore
> index a59df8f135..c047adbd79 100644
> --- a/devtools/libabigail.abignore
> +++ b/devtools/libabigail.abignore
> @@ -11,3 +11,9 @@
>           type_kind = enum
>           name = rte_crypto_asym_xform_type
>           changed_enumerators = RTE_CRYPTO_ASYM_XFORM_TYPE_LIST_END
> +; Ignore ethdev event enum update because new event cannot be
> +; received if not registered
> +[suppress_type]
> +        type_kind = enum
> +        name = rte_eth_event_type
> +        changed_enumerators = RTE_ETH_EVENT_MAX
> diff --git a/doc/guides/prog_guide/rte_flow.rst b/doc/guides/prog_guide/rte_flow.rst
> index 41c147913c..cf4368e1c4 100644
> --- a/doc/guides/prog_guide/rte_flow.rst
> +++ b/doc/guides/prog_guide/rte_flow.rst
> @@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
>      | ``dscp``  | DSCP in low 6 bits, rest ignore |
>      +-----------+---------------------------------+
>   
> +Action: ``AGE``
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Set ageing timeout configuration to a flow.
> +
> +Event RTE_ETH_EVENT_FLOW_AGED will be reported if
> +timeout passed without any matching on the flow.
> +
> +.. _table_rte_flow_action_age:
> +
> +.. table:: AGE
> +
> +   +--------------+---------------------------------+
> +   | Field        | Value                           |
> +   +==============+=================================+
> +   | ``timeout``  | 24 bits timeout value           |
> +   +--------------+---------------------------------+
> +   | ``reserved`` | 8 bits reserved, must be zero   |
> +   +--------------+---------------------------------+
> +   | ``context``  | user input flow context         |
> +   +--------------+---------------------------------+
> +
>   Negative types
>   ~~~~~~~~~~~~~~
>   
> diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
> index bacd4c65a2..ff0cf9f1d6 100644
> --- a/doc/guides/rel_notes/release_20_05.rst
> +++ b/doc/guides/rel_notes/release_20_05.rst
> @@ -135,6 +135,17 @@ New Features
>     by making use of the event device capabilities. The event mode currently supports
>     only inline IPsec protocol offload.
>   
> +* **Added flow Aging Support.**
> +
> +  Added flow Aging support to detect and report aged-out flows, including:
> +
> +  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout and the
> +    application flow context for each flow.
> +  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to report that
> +    there are new aged-out flows.
> +  * Added new API: rte_flow_get_aged_flows to get the aged-out flows contexts
> +    from the port.
> +
>   
>   Removed Items
>   -------------
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 8d69b88f9e..00cc7b4052 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3018,6 +3018,7 @@ enum rte_eth_event_type {
>   	RTE_ETH_EVENT_NEW,      /**< port is probed */
>   	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>   	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
>   	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>   };
>   
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index 3f32fdecf7..fa4b5816be 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -230,4 +230,7 @@ EXPERIMENTAL {
>   
>   	# added in 20.02
>   	rte_flow_dev_dump;
> +
> +	# added in 20.05
> +	rte_flow_get_aged_flows;
>   };
> diff --git a/lib/librte_ethdev/rte_flow.c b/lib/librte_ethdev/rte_flow.c
> index a5ac1c7fbd..3699edce49 100644
> --- a/lib/librte_ethdev/rte_flow.c
> +++ b/lib/librte_ethdev/rte_flow.c
> @@ -172,6 +172,7 @@ static const struct rte_flow_desc_data rte_flow_desc_action[] = {
>   	MK_FLOW_ACTION(SET_META, sizeof(struct rte_flow_action_set_meta)),
>   	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct rte_flow_action_set_dscp)),
>   	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct rte_flow_action_set_dscp)),
> +	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
>   };
>   
>   int
> @@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file, struct rte_flow_error *error)
>   				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
>   				  NULL, rte_strerror(ENOSYS));
>   }
> +
> +int
> +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> +		    uint32_t nb_contexts, struct rte_flow_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> +
> +	if (unlikely(!ops))
> +		return -rte_errno;
> +	if (likely(!!ops->get_aged_flows))
> +		return flow_err(port_id, ops->get_aged_flows(dev, contexts,
> +				nb_contexts, error), error);
> +	return rte_flow_error_set(error, ENOTSUP,
> +				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +				  NULL, rte_strerror(ENOTSUP));
> +}
> diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h
> index 7f3e08fad3..fab44f6c0b 100644
> --- a/lib/librte_ethdev/rte_flow.h
> +++ b/lib/librte_ethdev/rte_flow.h
> @@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
>   	 * See struct rte_flow_action_set_dscp.
>   	 */
>   	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
> +
> +	/**
> +	 * Report as aged flow if timeout passed without any matching on the
> +	 * flow.
> +	 *
> +	 * See struct rte_flow_action_age.
> +	 * See function rte_flow_get_aged_flows
> +	 * see enum RTE_ETH_EVENT_FLOW_AGED
> +	 */
> +	RTE_FLOW_ACTION_TYPE_AGE,
>   };
>   
>   /**
> @@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
>   	uint16_t index; /**< Queue index to use. */
>   };
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this structure may change without prior notice
> + *
> + * RTE_FLOW_ACTION_TYPE_AGE
> + *
> + * Report flow as aged-out if timeout passed without any matching
> + * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
> + * port detects new aged-out flows.
> + *
> + * The flow context and the flow handle will be reported by the
> + * rte_flow_get_aged_flows API.
> + */
> +struct rte_flow_action_age {
> +	uint32_t timeout:24; /**< Time in seconds. */
> +	uint32_t reserved:8; /**< Reserved, must be zero. */
> +	void *context;
> +		/**< The user flow context, NULL means the rte_flow pointer. */
> +};
>   
>   /**
>    * @warning
> @@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
>   	      const void *src,
>   	      struct rte_flow_error *error);
>   
> +/**
> + * Get aged-out flows of a given port.
> + *
> + * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least one new aged
> + * out flow was detected after the last call to rte_flow_get_aged_flows.
> + * This function can be called to get the aged flows usynchronously from the
usynchronously
> + * event callback or synchronously regardless the event.
> + * This is not safe to call rte_flow_get_aged_flows function with other flow
It is not safe to
> + * functions from multiple threads simultaneously.
> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param[in, out] contexts
> + *   The address of an array of pointers to the aged-out flows contexts.
> + * @param[in] nb_contexts
> + *   The length of context array pointers.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL. Initialized in case of
> + *   error only.
> + *
> + * @return
> + *   if nb_contexts is 0, return the amount of all aged contexts.
> + *   if nb_contexts is not 0 , return the amount of aged flows reported
> + *   in the context array, otherwise negative errno value.
> + *
> + * @see rte_flow_action_age
> + * @see RTE_ETH_EVENT_FLOW_AGED
> + */
> +__rte_experimental
> +int
> +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> +			uint32_t nb_contexts, struct rte_flow_error *error);
> +
>   #ifdef __cplusplus
>   }
>   #endif
> diff --git a/lib/librte_ethdev/rte_flow_driver.h b/lib/librte_ethdev/rte_flow_driver.h
> index 51a9a57a0f..881cc469b7 100644
> --- a/lib/librte_ethdev/rte_flow_driver.h
> +++ b/lib/librte_ethdev/rte_flow_driver.h
> @@ -101,6 +101,12 @@ struct rte_flow_ops {
>   		(struct rte_eth_dev *dev,
>   		 FILE *file,
>   		 struct rte_flow_error *error);
> +	/** See rte_flow_get_aged_flows() */
> +	int (*get_aged_flows)
> +		(struct rte_eth_dev *dev,
> +		 void **context,
> +		 uint32_t nb_contexts,
> +		 struct rte_flow_error *err);
>   };
>   
>   /**
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v4] ethdev: support flow aging
  2020-04-29 14:50           ` Tom Barbette
@ 2020-04-30  7:36             ` Matan Azrad
  2020-04-30  7:49               ` Tom Barbette
  0 siblings, 1 reply; 50+ messages in thread
From: Matan Azrad @ 2020-04-30  7:36 UTC (permalink / raw)
  To: Tom Barbette, Bill Zhou, Ori Kam, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic,
	Thomas Monjalon, ferruh.yigit, arybchenko
  Cc: dev


Hi Tom

From: Tom Barbette
> Great news!
> 
> - I can understand why there is no timeout unit. But that's calling for user
> nightmare. Eg I could only get from the code (and not from documentation
> yet? ) of the following mlx5 driver patch that the value should be in tenth of
> seconds. If I build an application that is supposed to work with "any NIC",
> what can I do? We'd need a way to query the timeout unit (have it in
> dev_info probably).

Please see the new age action structure in rte_flow.h
You can see comments there that timeout units is in seconds....

> - It's not totally clear if the rule is automatically removed or not. is this a
> helper or an OpenFlow-like notification?

Only notification, the aged-out flow should be destroyed (or other action) by the application according to the application needs...

> - Find a typo and grammar fix inline.
> - Recently, Mellanox introduced the ability to create 330K flows/s. Any
> performance considerations if those flow "expire" at the same rate?

We didn't see performance impact (should be same rate like count action).

> 
> Hope it's helpfull,
> 
> Tom
> 
> Le 21/04/2020 à 12:11, Bill Zhou a écrit :
> > From: Dong Zhou <dongz@mellanox.com>
> >
> > One of the reasons to destroy a flow is the fact that no packet
> > matches the flow for "timeout" time.
> > For example, when TCP\UDP sessions are suddenly closed.
> >
> > Currently, there is not any DPDK mechanism for flow aging and the
> > applications use their own ways to detect and destroy aged-out flows.
> >
> > The flow aging implementation need include:
> > - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
> and
> >    the application flow context for each flow.
> > - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report
> >    that there are new aged-out flows.
> > - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
> >    contexts from the port.
> > - Support input flow aging command line in Testpmd.
> >
> > The new event type addition in the enum is flagged as an ABI breakage,
> > so an ignore rule is added for these reasons:
> > - It is not changing value of existing types (except MAX)
> > - The new value is not used by existing API if the event is not
> > registered In general, it is safe adding new ethdev event types at the
> > end of the enum, because of event callback registration mechanism.
> >
> > Signed-off-by: Dong Zhou <dongz@mellanox.com>
> > ---
> > v2: Removing "* Added support for flow Aging mechanism base on
> counter."
> > this line from doc/guides/rel_notes/release_20_05.rst, this patch does
> > not include this support.
> >
> > v3: Update file libabigail.abignore, add one new suppressed
> > enumeration type for RTE_ETH_EVENT_MAX.
> >
> > v4: Add justification in devtools/libabigail.abignore and in the
> > commit log about the modification of v3.
> > ---
> >   app/test-pmd/cmdline_flow.c              | 26 ++++++++++
> >   devtools/libabigail.abignore             |  6 +++
> >   doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
> >   doc/guides/rel_notes/release_20_05.rst   | 11 +++++
> >   lib/librte_ethdev/rte_ethdev.h           |  1 +
> >   lib/librte_ethdev/rte_ethdev_version.map |  3 ++
> >   lib/librte_ethdev/rte_flow.c             | 18 +++++++
> >   lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
> >   lib/librte_ethdev/rte_flow_driver.h      |  6 +++
> >   9 files changed, 155 insertions(+)
> >
> > diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
> > index e6ab8ff2f7..45bcff3cf5 100644
> > --- a/app/test-pmd/cmdline_flow.c
> > +++ b/app/test-pmd/cmdline_flow.c
> > @@ -343,6 +343,8 @@ enum index {
> >   	ACTION_SET_IPV4_DSCP_VALUE,
> >   	ACTION_SET_IPV6_DSCP,
> >   	ACTION_SET_IPV6_DSCP_VALUE,
> > +	ACTION_AGE,
> > +	ACTION_AGE_TIMEOUT,
> >   };
> >
> >   /** Maximum size for pattern in struct rte_flow_item_raw. */ @@
> > -1145,6 +1147,7 @@ static const enum index next_action[] = {
> >   	ACTION_SET_META,
> >   	ACTION_SET_IPV4_DSCP,
> >   	ACTION_SET_IPV6_DSCP,
> > +	ACTION_AGE,
> >   	ZERO,
> >   };
> >
> > @@ -1370,6 +1373,13 @@ static const enum index action_set_ipv6_dscp[]
> = {
> >   	ZERO,
> >   };
> >
> > +static const enum index action_age[] = {
> > +	ACTION_AGE,
> > +	ACTION_AGE_TIMEOUT,
> > +	ACTION_NEXT,
> > +	ZERO,
> > +};
> > +
> >   static int parse_set_raw_encap_decap(struct context *, const struct
> token *,
> >   				     const char *, unsigned int,
> >   				     void *, unsigned int);
> > @@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
> >   			     (struct rte_flow_action_set_dscp, dscp)),
> >   		.call = parse_vc_conf,
> >   	},
> > +	[ACTION_AGE] = {
> > +		.name = "age",
> > +		.help = "set a specific metadata header",
> > +		.next = NEXT(action_age),
> > +		.priv = PRIV_ACTION(AGE,
> > +			sizeof(struct rte_flow_action_age)),
> > +		.call = parse_vc,
> > +	},
> > +	[ACTION_AGE_TIMEOUT] = {
> > +		.name = "timeout",
> > +		.help = "flow age timeout value",
> > +		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
> > +					   timeout, 24)),
> > +		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
> > +		.call = parse_vc_conf,
> > +	},
> >   };
> >
> >   /** Remove and return last entry from argument stack. */ diff --git
> > a/devtools/libabigail.abignore b/devtools/libabigail.abignore index
> > a59df8f135..c047adbd79 100644
> > --- a/devtools/libabigail.abignore
> > +++ b/devtools/libabigail.abignore
> > @@ -11,3 +11,9 @@
> >           type_kind = enum
> >           name = rte_crypto_asym_xform_type
> >           changed_enumerators =
> RTE_CRYPTO_ASYM_XFORM_TYPE_LIST_END
> > +; Ignore ethdev event enum update because new event cannot be ;
> > +received if not registered [suppress_type]
> > +        type_kind = enum
> > +        name = rte_eth_event_type
> > +        changed_enumerators = RTE_ETH_EVENT_MAX
> > diff --git a/doc/guides/prog_guide/rte_flow.rst
> > b/doc/guides/prog_guide/rte_flow.rst
> > index 41c147913c..cf4368e1c4 100644
> > --- a/doc/guides/prog_guide/rte_flow.rst
> > +++ b/doc/guides/prog_guide/rte_flow.rst
> > @@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION
> error will be returned.
> >      | ``dscp``  | DSCP in low 6 bits, rest ignore |
> >      +-----------+---------------------------------+
> >
> > +Action: ``AGE``
> > +^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Set ageing timeout configuration to a flow.
> > +
> > +Event RTE_ETH_EVENT_FLOW_AGED will be reported if timeout passed
> > +without any matching on the flow.
> > +
> > +.. _table_rte_flow_action_age:
> > +
> > +.. table:: AGE
> > +
> > +   +--------------+---------------------------------+
> > +   | Field        | Value                           |
> > +   +==============+=================================+
> > +   | ``timeout``  | 24 bits timeout value           |
> > +   +--------------+---------------------------------+
> > +   | ``reserved`` | 8 bits reserved, must be zero   |
> > +   +--------------+---------------------------------+
> > +   | ``context``  | user input flow context         |
> > +   +--------------+---------------------------------+
> > +
> >   Negative types
> >   ~~~~~~~~~~~~~~
> >
> > diff --git a/doc/guides/rel_notes/release_20_05.rst
> > b/doc/guides/rel_notes/release_20_05.rst
> > index bacd4c65a2..ff0cf9f1d6 100644
> > --- a/doc/guides/rel_notes/release_20_05.rst
> > +++ b/doc/guides/rel_notes/release_20_05.rst
> > @@ -135,6 +135,17 @@ New Features
> >     by making use of the event device capabilities. The event mode currently
> supports
> >     only inline IPsec protocol offload.
> >
> > +* **Added flow Aging Support.**
> > +
> > +  Added flow Aging support to detect and report aged-out flows,
> including:
> > +
> > +  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
> and the
> > +    application flow context for each flow.
> > +  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to
> report that
> > +    there are new aged-out flows.
> > +  * Added new API: rte_flow_get_aged_flows to get the aged-out flows
> contexts
> > +    from the port.
> > +
> >
> >   Removed Items
> >   -------------
> > diff --git a/lib/librte_ethdev/rte_ethdev.h
> > b/lib/librte_ethdev/rte_ethdev.h index 8d69b88f9e..00cc7b4052 100644
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > @@ -3018,6 +3018,7 @@ enum rte_eth_event_type {
> >   	RTE_ETH_EVENT_NEW,      /**< port is probed */
> >   	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >   	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
> */
> >   	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >   };
> >
> > diff --git a/lib/librte_ethdev/rte_ethdev_version.map
> > b/lib/librte_ethdev/rte_ethdev_version.map
> > index 3f32fdecf7..fa4b5816be 100644
> > --- a/lib/librte_ethdev/rte_ethdev_version.map
> > +++ b/lib/librte_ethdev/rte_ethdev_version.map
> > @@ -230,4 +230,7 @@ EXPERIMENTAL {
> >
> >   	# added in 20.02
> >   	rte_flow_dev_dump;
> > +
> > +	# added in 20.05
> > +	rte_flow_get_aged_flows;
> >   };
> > diff --git a/lib/librte_ethdev/rte_flow.c
> > b/lib/librte_ethdev/rte_flow.c index a5ac1c7fbd..3699edce49 100644
> > --- a/lib/librte_ethdev/rte_flow.c
> > +++ b/lib/librte_ethdev/rte_flow.c
> > @@ -172,6 +172,7 @@ static const struct rte_flow_desc_data
> rte_flow_desc_action[] = {
> >   	MK_FLOW_ACTION(SET_META, sizeof(struct
> rte_flow_action_set_meta)),
> >   	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct
> rte_flow_action_set_dscp)),
> >   	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct
> > rte_flow_action_set_dscp)),
> > +	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
> >   };
> >
> >   int
> > @@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file,
> struct rte_flow_error *error)
> >   				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> >   				  NULL, rte_strerror(ENOSYS));
> >   }
> > +
> > +int
> > +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> > +		    uint32_t nb_contexts, struct rte_flow_error *error) {
> > +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> > +
> > +	if (unlikely(!ops))
> > +		return -rte_errno;
> > +	if (likely(!!ops->get_aged_flows))
> > +		return flow_err(port_id, ops->get_aged_flows(dev,
> contexts,
> > +				nb_contexts, error), error);
> > +	return rte_flow_error_set(error, ENOTSUP,
> > +				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > +				  NULL, rte_strerror(ENOTSUP));
> > +}
> > diff --git a/lib/librte_ethdev/rte_flow.h
> > b/lib/librte_ethdev/rte_flow.h index 7f3e08fad3..fab44f6c0b 100644
> > --- a/lib/librte_ethdev/rte_flow.h
> > +++ b/lib/librte_ethdev/rte_flow.h
> > @@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
> >   	 * See struct rte_flow_action_set_dscp.
> >   	 */
> >   	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
> > +
> > +	/**
> > +	 * Report as aged flow if timeout passed without any matching on the
> > +	 * flow.
> > +	 *
> > +	 * See struct rte_flow_action_age.
> > +	 * See function rte_flow_get_aged_flows
> > +	 * see enum RTE_ETH_EVENT_FLOW_AGED
> > +	 */
> > +	RTE_FLOW_ACTION_TYPE_AGE,
> >   };
> >
> >   /**
> > @@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
> >   	uint16_t index; /**< Queue index to use. */
> >   };
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this structure may change without prior notice
> > + *
> > + * RTE_FLOW_ACTION_TYPE_AGE
> > + *
> > + * Report flow as aged-out if timeout passed without any matching
> > + * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
> > + * port detects new aged-out flows.
> > + *
> > + * The flow context and the flow handle will be reported by the
> > + * rte_flow_get_aged_flows API.
> > + */
> > +struct rte_flow_action_age {
> > +	uint32_t timeout:24; /**< Time in seconds. */
> > +	uint32_t reserved:8; /**< Reserved, must be zero. */
> > +	void *context;
> > +		/**< The user flow context, NULL means the rte_flow
> pointer. */ };
> >
> >   /**
> >    * @warning
> > @@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
> >   	      const void *src,
> >   	      struct rte_flow_error *error);
> >
> > +/**
> > + * Get aged-out flows of a given port.
> > + *
> > + * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least
> one
> > +new aged
> > + * out flow was detected after the last call to rte_flow_get_aged_flows.
> > + * This function can be called to get the aged flows usynchronously
> > +from the
> usynchronously
> > + * event callback or synchronously regardless the event.
> > + * This is not safe to call rte_flow_get_aged_flows function with
> > + other flow
> It is not safe to
> > + * functions from multiple threads simultaneously.
> > + *
> > + * @param port_id
> > + *   Port identifier of Ethernet device.
> > + * @param[in, out] contexts
> > + *   The address of an array of pointers to the aged-out flows contexts.
> > + * @param[in] nb_contexts
> > + *   The length of context array pointers.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL. Initialized in case of
> > + *   error only.
> > + *
> > + * @return
> > + *   if nb_contexts is 0, return the amount of all aged contexts.
> > + *   if nb_contexts is not 0 , return the amount of aged flows reported
> > + *   in the context array, otherwise negative errno value.
> > + *
> > + * @see rte_flow_action_age
> > + * @see RTE_ETH_EVENT_FLOW_AGED
> > + */
> > +__rte_experimental
> > +int
> > +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
> > +			uint32_t nb_contexts, struct rte_flow_error *error);
> > +
> >   #ifdef __cplusplus
> >   }
> >   #endif
> > diff --git a/lib/librte_ethdev/rte_flow_driver.h
> > b/lib/librte_ethdev/rte_flow_driver.h
> > index 51a9a57a0f..881cc469b7 100644
> > --- a/lib/librte_ethdev/rte_flow_driver.h
> > +++ b/lib/librte_ethdev/rte_flow_driver.h
> > @@ -101,6 +101,12 @@ struct rte_flow_ops {
> >   		(struct rte_eth_dev *dev,
> >   		 FILE *file,
> >   		 struct rte_flow_error *error);
> > +	/** See rte_flow_get_aged_flows() */
> > +	int (*get_aged_flows)
> > +		(struct rte_eth_dev *dev,
> > +		 void **context,
> > +		 uint32_t nb_contexts,
> > +		 struct rte_flow_error *err);
> >   };
> >
> >   /**
> >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v4] ethdev: support flow aging
  2020-04-30  7:36             ` Matan Azrad
@ 2020-04-30  7:49               ` Tom Barbette
  0 siblings, 0 replies; 50+ messages in thread
From: Tom Barbette @ 2020-04-30  7:49 UTC (permalink / raw)
  To: Matan Azrad, Bill Zhou, Ori Kam, wenzhuo.lu, jingjing.wu,
	bernard.iremonger, john.mcnamara, marko.kovacevic,
	Thomas Monjalon, ferruh.yigit, arybchenko
  Cc: dev



Le 30/04/2020 à 09:36, Matan Azrad a écrit :
> 
> Hi Tom
> 
> From: Tom Barbette
>> Great news!
>>
>> - I can understand why there is no timeout unit. But that's calling for user
>> nightmare. Eg I could only get from the code (and not from documentation
>> yet? ) of the following mlx5 driver patch that the value should be in tenth of
>> seconds. If I build an application that is supposed to work with "any NIC",
>> what can I do? We'd need a way to query the timeout unit (have it in
>> dev_info probably).
> 
> Please see the new age action structure in rte_flow.h
> You can see comments there that timeout units is in seconds....
Oh okay, did not catch that. Maybe mention of the unit in the AGE action 
documentation of rte_flow.rst would be helpful.
> 
>> - It's not totally clear if the rule is automatically removed or not. is this a
>> helper or an OpenFlow-like notification?
> 
> Only notification, the aged-out flow should be destroyed (or other action) by the application according to the application needs...
Makes sense.
> 
>> - Find a typo and grammar fix inline.
>> - Recently, Mellanox introduced the ability to create 330K flows/s. Any
>> performance considerations if those flow "expire" at the same rate?
> 
> We didn't see performance impact (should be same rate like count action).
Ok great!

Thanks!
> 
>>
>> Hope it's helpfull,
>>
>> Tom
>>
>> Le 21/04/2020 à 12:11, Bill Zhou a écrit :
>>> From: Dong Zhou <dongz@mellanox.com>
>>>
>>> One of the reasons to destroy a flow is the fact that no packet
>>> matches the flow for "timeout" time.
>>> For example, when TCP\UDP sessions are suddenly closed.
>>>
>>> Currently, there is not any DPDK mechanism for flow aging and the
>>> applications use their own ways to detect and destroy aged-out flows.
>>>
>>> The flow aging implementation need include:
>>> - A new rte_flow action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
>> and
>>>     the application flow context for each flow.
>>> - A new ethdev event: RTE_ETH_EVENT_FLOW_AGED for the driver to
>> report
>>>     that there are new aged-out flows.
>>> - A new rte_flow API: rte_flow_get_aged_flows to get the aged-out flows
>>>     contexts from the port.
>>> - Support input flow aging command line in Testpmd.
>>>
>>> The new event type addition in the enum is flagged as an ABI breakage,
>>> so an ignore rule is added for these reasons:
>>> - It is not changing value of existing types (except MAX)
>>> - The new value is not used by existing API if the event is not
>>> registered In general, it is safe adding new ethdev event types at the
>>> end of the enum, because of event callback registration mechanism.
>>>
>>> Signed-off-by: Dong Zhou <dongz@mellanox.com>
>>> ---
>>> v2: Removing "* Added support for flow Aging mechanism base on
>> counter."
>>> this line from doc/guides/rel_notes/release_20_05.rst, this patch does
>>> not include this support.
>>>
>>> v3: Update file libabigail.abignore, add one new suppressed
>>> enumeration type for RTE_ETH_EVENT_MAX.
>>>
>>> v4: Add justification in devtools/libabigail.abignore and in the
>>> commit log about the modification of v3.
>>> ---
>>>    app/test-pmd/cmdline_flow.c              | 26 ++++++++++
>>>    devtools/libabigail.abignore             |  6 +++
>>>    doc/guides/prog_guide/rte_flow.rst       | 22 +++++++++
>>>    doc/guides/rel_notes/release_20_05.rst   | 11 +++++
>>>    lib/librte_ethdev/rte_ethdev.h           |  1 +
>>>    lib/librte_ethdev/rte_ethdev_version.map |  3 ++
>>>    lib/librte_ethdev/rte_flow.c             | 18 +++++++
>>>    lib/librte_ethdev/rte_flow.h             | 62 ++++++++++++++++++++++++
>>>    lib/librte_ethdev/rte_flow_driver.h      |  6 +++
>>>    9 files changed, 155 insertions(+)
>>>
>>> diff --git a/app/test-pmd/cmdline_flow.c b/app/test-pmd/cmdline_flow.c
>>> index e6ab8ff2f7..45bcff3cf5 100644
>>> --- a/app/test-pmd/cmdline_flow.c
>>> +++ b/app/test-pmd/cmdline_flow.c
>>> @@ -343,6 +343,8 @@ enum index {
>>>    	ACTION_SET_IPV4_DSCP_VALUE,
>>>    	ACTION_SET_IPV6_DSCP,
>>>    	ACTION_SET_IPV6_DSCP_VALUE,
>>> +	ACTION_AGE,
>>> +	ACTION_AGE_TIMEOUT,
>>>    };
>>>
>>>    /** Maximum size for pattern in struct rte_flow_item_raw. */ @@
>>> -1145,6 +1147,7 @@ static const enum index next_action[] = {
>>>    	ACTION_SET_META,
>>>    	ACTION_SET_IPV4_DSCP,
>>>    	ACTION_SET_IPV6_DSCP,
>>> +	ACTION_AGE,
>>>    	ZERO,
>>>    };
>>>
>>> @@ -1370,6 +1373,13 @@ static const enum index action_set_ipv6_dscp[]
>> = {
>>>    	ZERO,
>>>    };
>>>
>>> +static const enum index action_age[] = {
>>> +	ACTION_AGE,
>>> +	ACTION_AGE_TIMEOUT,
>>> +	ACTION_NEXT,
>>> +	ZERO,
>>> +};
>>> +
>>>    static int parse_set_raw_encap_decap(struct context *, const struct
>> token *,
>>>    				     const char *, unsigned int,
>>>    				     void *, unsigned int);
>>> @@ -3694,6 +3704,22 @@ static const struct token token_list[] = {
>>>    			     (struct rte_flow_action_set_dscp, dscp)),
>>>    		.call = parse_vc_conf,
>>>    	},
>>> +	[ACTION_AGE] = {
>>> +		.name = "age",
>>> +		.help = "set a specific metadata header",
>>> +		.next = NEXT(action_age),
>>> +		.priv = PRIV_ACTION(AGE,
>>> +			sizeof(struct rte_flow_action_age)),
>>> +		.call = parse_vc,
>>> +	},
>>> +	[ACTION_AGE_TIMEOUT] = {
>>> +		.name = "timeout",
>>> +		.help = "flow age timeout value",
>>> +		.args = ARGS(ARGS_ENTRY_BF(struct rte_flow_action_age,
>>> +					   timeout, 24)),
>>> +		.next = NEXT(action_age, NEXT_ENTRY(UNSIGNED)),
>>> +		.call = parse_vc_conf,
>>> +	},
>>>    };
>>>
>>>    /** Remove and return last entry from argument stack. */ diff --git
>>> a/devtools/libabigail.abignore b/devtools/libabigail.abignore index
>>> a59df8f135..c047adbd79 100644
>>> --- a/devtools/libabigail.abignore
>>> +++ b/devtools/libabigail.abignore
>>> @@ -11,3 +11,9 @@
>>>            type_kind = enum
>>>            name = rte_crypto_asym_xform_type
>>>            changed_enumerators =
>> RTE_CRYPTO_ASYM_XFORM_TYPE_LIST_END
>>> +; Ignore ethdev event enum update because new event cannot be ;
>>> +received if not registered [suppress_type]
>>> +        type_kind = enum
>>> +        name = rte_eth_event_type
>>> +        changed_enumerators = RTE_ETH_EVENT_MAX
>>> diff --git a/doc/guides/prog_guide/rte_flow.rst
>>> b/doc/guides/prog_guide/rte_flow.rst
>>> index 41c147913c..cf4368e1c4 100644
>>> --- a/doc/guides/prog_guide/rte_flow.rst
>>> +++ b/doc/guides/prog_guide/rte_flow.rst
>>> @@ -2616,6 +2616,28 @@ Otherwise, RTE_FLOW_ERROR_TYPE_ACTION
>> error will be returned.
>>>       | ``dscp``  | DSCP in low 6 bits, rest ignore |
>>>       +-----------+---------------------------------+
>>>
>>> +Action: ``AGE``
>>> +^^^^^^^^^^^^^^^^^^^^^^^^^
>>> +
>>> +Set ageing timeout configuration to a flow.
>>> +
>>> +Event RTE_ETH_EVENT_FLOW_AGED will be reported if timeout passed
>>> +without any matching on the flow.
>>> +
>>> +.. _table_rte_flow_action_age:
>>> +
>>> +.. table:: AGE
>>> +
>>> +   +--------------+---------------------------------+
>>> +   | Field        | Value                           |
>>> +   +==============+=================================+
>>> +   | ``timeout``  | 24 bits timeout value           |
>>> +   +--------------+---------------------------------+
>>> +   | ``reserved`` | 8 bits reserved, must be zero   |
>>> +   +--------------+---------------------------------+
>>> +   | ``context``  | user input flow context         |
>>> +   +--------------+---------------------------------+
>>> +
>>>    Negative types
>>>    ~~~~~~~~~~~~~~
>>>
>>> diff --git a/doc/guides/rel_notes/release_20_05.rst
>>> b/doc/guides/rel_notes/release_20_05.rst
>>> index bacd4c65a2..ff0cf9f1d6 100644
>>> --- a/doc/guides/rel_notes/release_20_05.rst
>>> +++ b/doc/guides/rel_notes/release_20_05.rst
>>> @@ -135,6 +135,17 @@ New Features
>>>      by making use of the event device capabilities. The event mode currently
>> supports
>>>      only inline IPsec protocol offload.
>>>
>>> +* **Added flow Aging Support.**
>>> +
>>> +  Added flow Aging support to detect and report aged-out flows,
>> including:
>>> +
>>> +  * Added new action: RTE_FLOW_ACTION_TYPE_AGE to set the timeout
>> and the
>>> +    application flow context for each flow.
>>> +  * Added new event: RTE_ETH_EVENT_FLOW_AGED for the driver to
>> report that
>>> +    there are new aged-out flows.
>>> +  * Added new API: rte_flow_get_aged_flows to get the aged-out flows
>> contexts
>>> +    from the port.
>>> +
>>>
>>>    Removed Items
>>>    -------------
>>> diff --git a/lib/librte_ethdev/rte_ethdev.h
>>> b/lib/librte_ethdev/rte_ethdev.h index 8d69b88f9e..00cc7b4052 100644
>>> --- a/lib/librte_ethdev/rte_ethdev.h
>>> +++ b/lib/librte_ethdev/rte_ethdev.h
>>> @@ -3018,6 +3018,7 @@ enum rte_eth_event_type {
>>>    	RTE_ETH_EVENT_NEW,      /**< port is probed */
>>>    	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>>    	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>> +	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
>> */
>>>    	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>>>    };
>>>
>>> diff --git a/lib/librte_ethdev/rte_ethdev_version.map
>>> b/lib/librte_ethdev/rte_ethdev_version.map
>>> index 3f32fdecf7..fa4b5816be 100644
>>> --- a/lib/librte_ethdev/rte_ethdev_version.map
>>> +++ b/lib/librte_ethdev/rte_ethdev_version.map
>>> @@ -230,4 +230,7 @@ EXPERIMENTAL {
>>>
>>>    	# added in 20.02
>>>    	rte_flow_dev_dump;
>>> +
>>> +	# added in 20.05
>>> +	rte_flow_get_aged_flows;
>>>    };
>>> diff --git a/lib/librte_ethdev/rte_flow.c
>>> b/lib/librte_ethdev/rte_flow.c index a5ac1c7fbd..3699edce49 100644
>>> --- a/lib/librte_ethdev/rte_flow.c
>>> +++ b/lib/librte_ethdev/rte_flow.c
>>> @@ -172,6 +172,7 @@ static const struct rte_flow_desc_data
>> rte_flow_desc_action[] = {
>>>    	MK_FLOW_ACTION(SET_META, sizeof(struct
>> rte_flow_action_set_meta)),
>>>    	MK_FLOW_ACTION(SET_IPV4_DSCP, sizeof(struct
>> rte_flow_action_set_dscp)),
>>>    	MK_FLOW_ACTION(SET_IPV6_DSCP, sizeof(struct
>>> rte_flow_action_set_dscp)),
>>> +	MK_FLOW_ACTION(AGE, sizeof(struct rte_flow_action_age)),
>>>    };
>>>
>>>    int
>>> @@ -1232,3 +1233,20 @@ rte_flow_dev_dump(uint16_t port_id, FILE *file,
>> struct rte_flow_error *error)
>>>    				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
>>>    				  NULL, rte_strerror(ENOSYS));
>>>    }
>>> +
>>> +int
>>> +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
>>> +		    uint32_t nb_contexts, struct rte_flow_error *error) {
>>> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
>>> +
>>> +	if (unlikely(!ops))
>>> +		return -rte_errno;
>>> +	if (likely(!!ops->get_aged_flows))
>>> +		return flow_err(port_id, ops->get_aged_flows(dev,
>> contexts,
>>> +				nb_contexts, error), error);
>>> +	return rte_flow_error_set(error, ENOTSUP,
>>> +				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
>>> +				  NULL, rte_strerror(ENOTSUP));
>>> +}
>>> diff --git a/lib/librte_ethdev/rte_flow.h
>>> b/lib/librte_ethdev/rte_flow.h index 7f3e08fad3..fab44f6c0b 100644
>>> --- a/lib/librte_ethdev/rte_flow.h
>>> +++ b/lib/librte_ethdev/rte_flow.h
>>> @@ -2081,6 +2081,16 @@ enum rte_flow_action_type {
>>>    	 * See struct rte_flow_action_set_dscp.
>>>    	 */
>>>    	RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP,
>>> +
>>> +	/**
>>> +	 * Report as aged flow if timeout passed without any matching on the
>>> +	 * flow.
>>> +	 *
>>> +	 * See struct rte_flow_action_age.
>>> +	 * See function rte_flow_get_aged_flows
>>> +	 * see enum RTE_ETH_EVENT_FLOW_AGED
>>> +	 */
>>> +	RTE_FLOW_ACTION_TYPE_AGE,
>>>    };
>>>
>>>    /**
>>> @@ -2122,6 +2132,25 @@ struct rte_flow_action_queue {
>>>    	uint16_t index; /**< Queue index to use. */
>>>    };
>>>
>>> +/**
>>> + * @warning
>>> + * @b EXPERIMENTAL: this structure may change without prior notice
>>> + *
>>> + * RTE_FLOW_ACTION_TYPE_AGE
>>> + *
>>> + * Report flow as aged-out if timeout passed without any matching
>>> + * on the flow. RTE_ETH_EVENT_FLOW_AGED event is triggered when a
>>> + * port detects new aged-out flows.
>>> + *
>>> + * The flow context and the flow handle will be reported by the
>>> + * rte_flow_get_aged_flows API.
>>> + */
>>> +struct rte_flow_action_age {
>>> +	uint32_t timeout:24; /**< Time in seconds. */
>>> +	uint32_t reserved:8; /**< Reserved, must be zero. */
>>> +	void *context;
>>> +		/**< The user flow context, NULL means the rte_flow
>> pointer. */ };
>>>
>>>    /**
>>>     * @warning
>>> @@ -3254,6 +3283,39 @@ rte_flow_conv(enum rte_flow_conv_op op,
>>>    	      const void *src,
>>>    	      struct rte_flow_error *error);
>>>
>>> +/**
>>> + * Get aged-out flows of a given port.
>>> + *
>>> + * RTE_ETH_EVENT_FLOW_AGED event will be triggered when at least
>> one
>>> +new aged
>>> + * out flow was detected after the last call to rte_flow_get_aged_flows.
>>> + * This function can be called to get the aged flows usynchronously
>>> +from the
>> usynchronously
>>> + * event callback or synchronously regardless the event.
>>> + * This is not safe to call rte_flow_get_aged_flows function with
>>> + other flow
>> It is not safe to
>>> + * functions from multiple threads simultaneously.
>>> + *
>>> + * @param port_id
>>> + *   Port identifier of Ethernet device.
>>> + * @param[in, out] contexts
>>> + *   The address of an array of pointers to the aged-out flows contexts.
>>> + * @param[in] nb_contexts
>>> + *   The length of context array pointers.
>>> + * @param[out] error
>>> + *   Perform verbose error reporting if not NULL. Initialized in case of
>>> + *   error only.
>>> + *
>>> + * @return
>>> + *   if nb_contexts is 0, return the amount of all aged contexts.
>>> + *   if nb_contexts is not 0 , return the amount of aged flows reported
>>> + *   in the context array, otherwise negative errno value.
>>> + *
>>> + * @see rte_flow_action_age
>>> + * @see RTE_ETH_EVENT_FLOW_AGED
>>> + */
>>> +__rte_experimental
>>> +int
>>> +rte_flow_get_aged_flows(uint16_t port_id, void **contexts,
>>> +			uint32_t nb_contexts, struct rte_flow_error *error);
>>> +
>>>    #ifdef __cplusplus
>>>    }
>>>    #endif
>>> diff --git a/lib/librte_ethdev/rte_flow_driver.h
>>> b/lib/librte_ethdev/rte_flow_driver.h
>>> index 51a9a57a0f..881cc469b7 100644
>>> --- a/lib/librte_ethdev/rte_flow_driver.h
>>> +++ b/lib/librte_ethdev/rte_flow_driver.h
>>> @@ -101,6 +101,12 @@ struct rte_flow_ops {
>>>    		(struct rte_eth_dev *dev,
>>>    		 FILE *file,
>>>    		 struct rte_flow_error *error);
>>> +	/** See rte_flow_get_aged_flows() */
>>> +	int (*get_aged_flows)
>>> +		(struct rte_eth_dev *dev,
>>> +		 void **context,
>>> +		 uint32_t nb_contexts,
>>> +		 struct rte_flow_error *err);
>>>    };
>>>
>>>    /**
>>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/2] net/mlx5: support flow aging
  2020-04-29  2:25       ` [dpdk-dev] [PATCH v3 0/2] " Bill Zhou
  2020-04-29  2:25         ` [dpdk-dev] [PATCH v3 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
  2020-04-29  2:25         ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: support flow aging Bill Zhou
@ 2020-05-03  7:41         ` Matan Azrad
  2020-05-03 11:47           ` Raslan Darawsheh
  2 siblings, 1 reply; 50+ messages in thread
From: Matan Azrad @ 2020-05-03  7:41 UTC (permalink / raw)
  To: Bill Zhou, Ori Kam, Shahaf Shuler, Slava Ovsiienko,
	marko.kovacevic, john.mcnamara
  Cc: dev



From: Bill Zhou:
> Those patches implement flow aging for mlx5 driver. First patch is to modify
> the current additional memory allocation for counter, so that it's easy to get
> every counter additional memory location by using offsetting. Second patch
> implements aging check and age-out event callback mechanism for mlx5
> driver.
> 
> 
> Bill Zhou (2):
>   net/mlx5: modify ext-counter memory allocation
>   net/mlx5: support flow aging


Series-acked-by: Matan Azrad <matan@mellanox.com>

>  doc/guides/rel_notes/release_20_05.rst |   1 +
>  drivers/net/mlx5/mlx5.c                |  93 ++++--
>  drivers/net/mlx5/mlx5.h                |  79 +++++-
>  drivers/net/mlx5/mlx5_flow.c           | 205 ++++++++++++--
>  drivers/net/mlx5/mlx5_flow.h           |  16 +-
>  drivers/net/mlx5/mlx5_flow_dv.c        | 373 +++++++++++++++++++++----
>  drivers/net/mlx5/mlx5_flow_verbs.c     |  16 +-
>  7 files changed, 655 insertions(+), 128 deletions(-)
> 
> --
> 2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/2] net/mlx5: support flow aging
  2020-05-03  7:41         ` [dpdk-dev] [PATCH v3 0/2] " Matan Azrad
@ 2020-05-03 11:47           ` Raslan Darawsheh
  0 siblings, 0 replies; 50+ messages in thread
From: Raslan Darawsheh @ 2020-05-03 11:47 UTC (permalink / raw)
  To: Matan Azrad, Bill Zhou, Ori Kam, Shahaf Shuler, Slava Ovsiienko,
	marko.kovacevic, john.mcnamara
  Cc: dev

Hi,

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Matan Azrad
> Sent: Sunday, May 3, 2020 10:42 AM
> To: Bill Zhou <dongz@mellanox.com>; Ori Kam <orika@mellanox.com>;
> Shahaf Shuler <shahafs@mellanox.com>; Slava Ovsiienko
> <viacheslavo@mellanox.com>; marko.kovacevic@intel.com;
> john.mcnamara@intel.com
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/2] net/mlx5: support flow aging
> 
> 
> 
> From: Bill Zhou:
> > Those patches implement flow aging for mlx5 driver. First patch is to modify
> > the current additional memory allocation for counter, so that it's easy to get
> > every counter additional memory location by using offsetting. Second
> patch
> > implements aging check and age-out event callback mechanism for mlx5
> > driver.
> >
> >
> > Bill Zhou (2):
> >   net/mlx5: modify ext-counter memory allocation
> >   net/mlx5: support flow aging
> 
> 
> Series-acked-by: Matan Azrad <matan@mellanox.com>
> 
> >  doc/guides/rel_notes/release_20_05.rst |   1 +
> >  drivers/net/mlx5/mlx5.c                |  93 ++++--
> >  drivers/net/mlx5/mlx5.h                |  79 +++++-
> >  drivers/net/mlx5/mlx5_flow.c           | 205 ++++++++++++--
> >  drivers/net/mlx5/mlx5_flow.h           |  16 +-
> >  drivers/net/mlx5/mlx5_flow_dv.c        | 373 +++++++++++++++++++++----
> >  drivers/net/mlx5/mlx5_flow_verbs.c     |  16 +-
> >  7 files changed, 655 insertions(+), 128 deletions(-)
> >
> > --
> > 2.21.0


Series applied to next-net-mlx,

Kindest regards,
Raslan Darawsheh

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2020-05-03 11:47 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-26 10:18 [dpdk-dev] [PATCH] [RFC] ethdev: support flow aging Matan Azrad
2019-06-06 10:24 ` Jerin Jacob Kollanukkaran
2019-06-06 10:51   ` Matan Azrad
2019-06-06 12:15     ` Jerin Jacob Kollanukkaran
2019-06-18  5:56       ` Matan Azrad
2019-06-24  6:26         ` Jerin Jacob Kollanukkaran
2019-06-27  8:26           ` Matan Azrad
2020-03-16 16:13       ` Stephen Hemminger
2020-03-16 10:22 ` [dpdk-dev] [PATCH v2] " BillZhou
2020-03-16 12:52 ` BillZhou
2020-03-20  6:59   ` Jerin Jacob
2020-03-24 10:18   ` Andrew Rybchenko
2020-04-10  9:46   ` [dpdk-dev] [PATCH] " BillZhou
2020-04-10 10:14     ` Thomas Monjalon
2020-04-13  4:02       ` Bill Zhou
2020-04-10 12:07     ` Andrew Rybchenko
2020-04-10 12:41       ` Jerin Jacob
2020-04-12  9:13     ` Ori Kam
2020-04-12  9:48       ` Matan Azrad
2020-04-14  8:32     ` [dpdk-dev] [PATCH v2] " Dong Zhou
2020-04-14  8:49       ` Ori Kam
2020-04-14  9:23         ` Bill Zhou
2020-04-16 13:32         ` Ferruh Yigit
2020-04-17 22:00       ` Ferruh Yigit
2020-04-17 22:07         ` Stephen Hemminger
2020-04-18  5:04         ` Bill Zhou
2020-04-18  9:44           ` Thomas Monjalon
2020-04-20 14:06             ` Ferruh Yigit
2020-04-20 16:10               ` Thomas Monjalon
2020-04-21 10:04                 ` Ferruh Yigit
2020-04-21 10:09                   ` Thomas Monjalon
2020-04-21 15:59                   ` Andrew Rybchenko
2020-04-21  6:22       ` [dpdk-dev] [PATCH v3] " Bill Zhou
2020-04-21 10:11         ` [dpdk-dev] [PATCH v4] " Bill Zhou
2020-04-21 17:13           ` Ferruh Yigit
2020-04-29 14:50           ` Tom Barbette
2020-04-30  7:36             ` Matan Azrad
2020-04-30  7:49               ` Tom Barbette
2020-04-13 14:53   ` [dpdk-dev] [PATCH 0/2] " Dong Zhou
2020-04-13 14:53     ` [dpdk-dev] [PATCH 1/2] net/mlx5: modify ext-counter memory allocation Dong Zhou
2020-04-13 14:53     ` [dpdk-dev] [PATCH 2/2] net/mlx5: support flow aging Dong Zhou
2020-04-24 10:45     ` [dpdk-dev] [PATCH v2 0/2] " Bill Zhou
2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
2020-04-24 10:45       ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: support flow aging Bill Zhou
2020-04-26  7:07         ` Suanming Mou
2020-04-29  2:25       ` [dpdk-dev] [PATCH v3 0/2] " Bill Zhou
2020-04-29  2:25         ` [dpdk-dev] [PATCH v3 1/2] net/mlx5: modify ext-counter memory allocation Bill Zhou
2020-04-29  2:25         ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: support flow aging Bill Zhou
2020-05-03  7:41         ` [dpdk-dev] [PATCH v3 0/2] " Matan Azrad
2020-05-03 11:47           ` Raslan Darawsheh

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git