DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
@ 2017-12-21 22:21 Doherty, Declan
  2017-12-24 17:30 ` Shahaf Shuler
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Doherty, Declan @ 2017-12-21 22:21 UTC (permalink / raw)
  To: dev

This RFC contains a proposal to add a new tunnel endpoint API to DPDK that when used
in conjunction with rte_flow enables the configuration of inline data path encapsulation
and decapsulation of tunnel endpoint network overlays on accelerated IO devices.

The proposed new API would provide for the creation, destruction, and
monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs to allow the
acceleration features to be discovered by applications.

/** Tunnel Endpoint context, opaque structure */
struct rte_tep;

enum rte_tep_type {
               RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
               RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
               ...
};

/** Tunnel Endpoint Attributes */
struct rte_tep_attr {
               enum rte_type_type type;

               /* other endpoint attributes here */
}

/**
* Create a tunnel end-point context as specified by the flow attribute and pattern
*
* @param   port_id     Port identifier of Ethernet device.
* @param   attr        Flow rule attributes.
* @param   pattern     Pattern specification by list of rte_flow_items.
* @return
*  - On success returns pointer to TEP context
*  - On failure returns NULL
*/
struct rte_tep *rte_tep_create(uint16_t port_id,
                              struct rte_tep_attr *attr, struct rte_flow_item pattern[])

/**
* Destroy an existing tunnel end-point context. All the end-points context
* will be destroyed, so all active flows using tep should be freed before
* destroying context.
* @param   port_id    Port identifier of Ethernet device.
* @param   tep        Tunnel endpoint context
* @return
*  - On success returns 0
*  - On failure returns 1
*/
int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)

/**
* Get tunnel endpoint statistics
*
* @param   port_id    Port identifier of Ethernet device.
* @param   tep        Tunnel endpoint context
* @param   stats      Tunnel endpoint statistics
*
* @return
*  - On success returns 0
*  - On failure returns 1
*/
Int
rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
                              struct rte_tep_stats *stats)

/**
* Get ports tunnel endpoint capabilities
*
* @param   port_id    Port identifier of Ethernet device.
* @param   capabilities        Tunnel endpoint capabilities
*
* @return
*  - On success returns 0
*  - On failure returns 1
*/
int
rte_tep_capabilities_get(uint16_t port_id,
                              struct rte_tep_capabilities *capabilities)


To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
enhanced to add a new flow item type. This contains a pointer to the
TEP context as well as the overlay flow id to which the traffic flow is
associated.

struct rte_flow_item_tep {
               struct rte_tep *tep;
               uint32_t flow_id;
}

Also 2 new generic actions types are added encapsulation and decapsulation.

RTE_FLOW_ACTION_TYPE_ENCAP
RTE_FLOW_ACTION_TYPE_DECAP

struct rte_flow_action_encap {
               struct rte_flow_item *item;
}

struct rte_flow_action_decap {
               struct rte_flow_item *item;
}

The following section outlines the intended usage of the new APIs and then how
they are combined with the existing rte_flow APIs.

Tunnel endpoints are created on logical ports which support the capability
using rte_tep_create() using a combination of TEP attributes and
rte_flow_items. In the example below a new IPv4 VxLAN endpoint is being defined.
The attrs parameter sets the TEP type, and could be used for other possible
attributes.

struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };

The values for the headers which make up the tunnel endpointr are then
defined using spec parameter in the rte flow items (IPv4, UDP and
VxLAN in this case)

struct rte_flow_item_ipv4 ipv4_item = {
               .hdr = { .src_addr = saddr, .dst_addr = daddr }
};

struct rte_flow_item_udp udp_item = {
               .hdr = { .src_port = sport, .dst_port = dport }
};

struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };

struct rte_flow_item pattern[] = {
               { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
               { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
               { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = &vxlan_item },
               { .type = RTE_FLOW_ITEM_TYPE_END }
};

The tunnel endpoint can then be create on the port. Whether or not any hw
configuration is required at this point would be hw dependent, but if not
the context for the TEP is available for use in programming flow, so the
application is not forced to redefine the TEP parameters on each flow
addition.

struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);

Once the tep context is created flows can then be directed to that endpoint for
processing. The following sections will outline how the author envisage flow
programming will work and also how TEP acceleration can be combined with other
accelerations.


Ingress TEP decapsulation, mark and forward to queue:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The flows definition for TEP decapsulation actions should specify the full
outer packet to be matched at a minimum. The outer packet definition should
match the tunnel definition in the tep context and the tep flow id. This
example shows describes matching on the outer, marking the packet with the
VXLAN VNI and directing to a specified queue of the port.

Source Packet

       Decapsulate Outer Hdr
     /                       \                                    decap outer crc
    /                         \                                    /          \
    +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
    | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
    +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+

/* Flow Attributes/Items Definitions */

struct rte_flow_attr attr = { .ingress = 1 };

struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };

struct rte_flow_item pattern[] = {
               { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
               { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
               { .type = RTE_FLOW_ITEM_TYPE_END }
};

/* Flow Actions Definitions */

struct rte_flow_action_decap decap_eth = {
               .type = RTE_FLOW_ITEM_TYPE_ETH,
               .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
};

struct rte_flow_action_decap decap_tep = {
               .type = RTE_FLOW_ITEM_TYPE_TEP,
.spec = &tep_item
};

struct rte_flow_action_queue queue_action = { .index = qid };

struct rte_flow_action_port mark_action = { .index = vni };

struct rte_flow_action actions[] = {
               { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
               { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
               { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = &mark_action },
               { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
               { .type = RTE_FLOW_ACTION_TYPE_END }
};

/** VERY IMPORTANT NOTE **/
One of the core concepts of this proposal is that actions which modify the
packet are defined in the order which they are to be processed. So first decap
outer ethernet header, then the outer TEP headers.
I think this is not only logical from a usability point of view, it should also
simplify the logic required in PMDs to parse the desired actions.

struct rte_flow *flow =
                              rte_flow_create(port_id, &attr, pattern, actions, &err);

The processed packets are delivered to specifed queue with mbuf metadata
denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.

    +-----+------+-----+---------+-----+
    | ETH | IPv4 | TCP | PAYLOAD | CRC |
    +-----+------+-----+---------+-----+


Ingress TEP decapsulation switch to port:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is intended to represent how a TEP decapsulation could be configured
in a switching offload case, it makes an assumption that there is a logical
port representation for all ports on the hw switch in the DPDK application,
but similar functionality could be achieved by specifying something like a
VF ID of the device.

Like the previous scenario the flows definition for TEP decapsulation actions
should specify the full outer packet to be matched at a minimum but also
define the elements of the inner match to match against including masks if
required.

struct rte_flow_attr attr = { .ingress = 1 };

struct rte_flow_item pattern[] = {
               { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &outer_eth_item },
               { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &outer_tep_item, .mask = &tep_mask },
               { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &inner_eth_item, .mask = &eth_mask }
               { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = &inner_ipv4_item, .mask = &ipv4_mask },
               { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &inner_tcp_item, .mask = &tcp_mask },
               { .type = RTE_FLOW_ITEM_TYPE_END }
};

/* Flow Actions Definitions */

struct rte_flow_action_decap decap_eth = {
               .type = RTE_FLOW_ITEM_TYPE_ETH,
               .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
};

struct rte_flow_action_decap decap_tep = {
               .type = RTE_FLOW_ITEM_TYPE_TEP,
               .item = &outer_tep_item
};

struct rte_flow_action_port port_action = { .index = port_id };

struct rte_flow_action actions[] = {
               { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
               { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
               { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
               { .type = RTE_FLOW_ACTION_TYPE_END }
};

struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);

This action will forward the decapsulated packets to another port of the switch
fabric but no information will on the tunnel or the fact that the packet was
decapsulated will be passed with it, thereby enable segregation of the
infrastructure and


Egress TEP encapsulation:
~~~~~~~~~~~~~~~~~~~~~~~~~

Encapulsation TEP actions require the flow definitions for the source packet
and then the actions to do on that, this example shows a ipv4/tcp packet
action.

Source Packet

    +-----+------+-----+---------+-----+
    | ETH | IPv4 | TCP | PAYLOAD | CRC |
    +-----+------+-----+---------+-----+

struct rte_flow_attr attr = { .egress = 1 };

struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr = src_addr, .dst_addr = dst_addr } };
struct rte_flow_item_udp tcp_item = { .hdr = { .src_port = src_port, .dst_port = dst_port } };

struct rte_flow_item pattern[] = {
               { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
               { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
               { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
               { .type = RTE_FLOW_ITEM_TYPE_END }
};

/* Flow Actions Definitions */

struct rte_flow_action_encap encap_eth = {
               .type = RTE_FLOW_ITEM_TYPE_ETH,
               .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
};

struct rte_flow_action_encap encap_tep = {
               .type = RTE_FLOW_ITEM_TYPE_TEP,
               .item = { .tep = tep, .id = vni }
};
struct rte_flow_action_mark port_action = { .index = port_id };

struct rte_flow_action actions[] = {
               { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
               { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
               { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
               { .type = RTE_FLOW_ACTION_TYPE_END }
}
struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);


      encapsulating Outer Hdr
     /                       \                                      outer crc
    /                         \                                   /          \
    +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
    | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
    +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+



Chaining multiple modification actions eg IPsec and TEP
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For example the definition for full hw acceleration for an IPsec ESP/Transport
SA encapsulated in a vxlan tunnel would look something like:

struct rte_flow_action actions[] = {
               { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
               { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = &sec_session },
               { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
               { .type = RTE_FLOW_ACTION_TYPE_END }
}

1. Source Packet
                           +-----+------+-----+---------+-----+
                           | ETH | IPv4 | TCP | PAYLOAD | CRC |
                           +-----+------+-----+---------+-----+

2. First Action - Tunnel Endpoint Encapsulation

      +------+-----+-------+-----+------+-----+---------+-----+
      | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
      +------+-----+-------+-----+------+-----+---------+-----+

3. Second Action - IPsec ESP/Transport Security Processing

      +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
      | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
      +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+

4. Third Action - Outer Ethernet Encapsulation

+-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+
| ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER | OUTER CRC |
+-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+

This example demonstrates the importance of making the interoperation of
actions to be ordered, as in the above example, a security
action can be defined on both the inner and outer packet by simply placing
another security action at the beginning of the action list.

It also demonstrates the rationale for not collapsing the Ethernet into
the TEP definition as when you have multiple encapsulating actions, all
could potentially be the place where the Ethernet header needs to be
defined.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2017-12-21 22:21 [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement Doherty, Declan
@ 2017-12-24 17:30 ` Shahaf Shuler
  2018-01-09 17:30   ` Doherty, Declan
       [not found] ` <3560e76a-c99b-4dc3-9678-d7975acf67c9@mellanox.com>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Shahaf Shuler @ 2017-12-24 17:30 UTC (permalink / raw)
  To: Doherty, Declan, dev

Hi Declan,

Friday, December 22, 2017 12:21 AM, Doherty, Declan:
> This RFC contains a proposal to add a new tunnel endpoint API to DPDK that
> when used in conjunction with rte_flow enables the configuration of inline
> data path encapsulation and decapsulation of tunnel endpoint network
> overlays on accelerated IO devices.
> 
> The proposed new API would provide for the creation, destruction, and
> monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs
> to allow the acceleration features to be discovered by applications.
> 
> /** Tunnel Endpoint context, opaque structure */ struct rte_tep;
> 
> enum rte_tep_type {
>                RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
>                RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
>                ...
> };
> 
> /** Tunnel Endpoint Attributes */
> struct rte_tep_attr {
>                enum rte_type_type type;
> 
>                /* other endpoint attributes here */ }
> 
> /**
> * Create a tunnel end-point context as specified by the flow attribute and
> pattern
> *
> * @param   port_id     Port identifier of Ethernet device.
> * @param   attr        Flow rule attributes.
> * @param   pattern     Pattern specification by list of rte_flow_items.
> * @return
> *  - On success returns pointer to TEP context
> *  - On failure returns NULL
> */
> struct rte_tep *rte_tep_create(uint16_t port_id,
>                               struct rte_tep_attr *attr, struct rte_flow_item pattern[])
> 
> /**
> * Destroy an existing tunnel end-point context. All the end-points context
> * will be destroyed, so all active flows using tep should be freed before
> * destroying context.
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)
> 
> /**
> * Get tunnel endpoint statistics
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @param   stats      Tunnel endpoint statistics
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> Int
> rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
>                               struct rte_tep_stats *stats)
> 
> /**
> * Get ports tunnel endpoint capabilities
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   capabilities        Tunnel endpoint capabilities
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int
> rte_tep_capabilities_get(uint16_t port_id,
>                               struct rte_tep_capabilities *capabilities)


Am not sure I understand why there is a need for the above control methods. 
Are you introducing a new "tep device" ? 
As the tunnel endpoint is sending and receiving Ethernet packets from the network I think it should still be counted as Ethernet device but with more capabilities (for example it supported encap/decap etc..), therefore it should use the Ethdev layer API to query statistics (for example). 
As for the capabilities - what specifically you had in mind? The current usage you show with tep is with rte_flow rules. There are no capabilities currently for rte_flow supported actions/pattern. To check such capabilities application uses rte_flow_validate. 
Regarding the creation/destroy of tep. Why not simply use rte_flow API and avoid this extra control? 
For example - with 17.11 APIs, application can put the port in isolate mode, and insert a flow_rule to catch only IPv4 VXLAN traffic and direct to some queue/do RSS. Such operation, per my understanding, will create a tunnel endpoint. What are the down sides of doing it with the current APIs? 

> 
> 
> To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
> enhanced to add a new flow item type. This contains a pointer to the TEP
> context as well as the overlay flow id to which the traffic flow is associated.
> 
> struct rte_flow_item_tep {
>                struct rte_tep *tep;
>                uint32_t flow_id;
> }

Can you provide more detailed definition about the flow id ? to which field from the packet headers it refers to? 
On your below examples it looks like it is to match the VXLAN vni in case of VXLAN, what about the other protocols? And also, why not using the already exists VXLAN item? 

Generally I like the idea of separating the encap/decap context from the action. However looks like the rte_flow_item has double meaning on this RFC, once for the classification and once for the action.
>From the top of my head I would think of an API which separate those, and re-use the existing flow items. Something like:

 struct rte_flow_item pattern[] = {
                { set of already exists pattern  },
                { ... },
                { .type = RTE_FLOW_ITEM_TYPE_END } };

encap_ctx = create_enacap_context(pattern)

rte_flow_action actions[] = {
	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx}
}
 
> Also 2 new generic actions types are added encapsulation and decapsulation.
> 
> RTE_FLOW_ACTION_TYPE_ENCAP
> RTE_FLOW_ACTION_TYPE_DECAP
> 
> struct rte_flow_action_encap {
>                struct rte_flow_item *item; }
> 
> struct rte_flow_action_decap {
>                struct rte_flow_item *item; }
> 
> The following section outlines the intended usage of the new APIs and then
> how they are combined with the existing rte_flow APIs.
> 
> Tunnel endpoints are created on logical ports which support the capability
> using rte_tep_create() using a combination of TEP attributes and
> rte_flow_items. In the example below a new IPv4 VxLAN endpoint is being
> defined.
> The attrs parameter sets the TEP type, and could be used for other possible
> attributes.
> 
> struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };
> 
> The values for the headers which make up the tunnel endpointr are then
> defined using spec parameter in the rte flow items (IPv4, UDP and VxLAN in
> this case)
> 
> struct rte_flow_item_ipv4 ipv4_item = {
>                .hdr = { .src_addr = saddr, .dst_addr = daddr } };
> 
> struct rte_flow_item_udp udp_item = {
>                .hdr = { .src_port = sport, .dst_port = dport } };
> 
> struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
>                { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = &vxlan_item },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> The tunnel endpoint can then be create on the port. Whether or not any hw
> configuration is required at this point would be hw dependent, but if not the
> context for the TEP is available for use in programming flow, so the
> application is not forced to redefine the TEP parameters on each flow
> addition.
> 
> struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
> 
> Once the tep context is created flows can then be directed to that endpoint
> for processing. The following sections will outline how the author envisage
> flow programming will work and also how TEP acceleration can be combined
> with other accelerations.
> 
> 
> Ingress TEP decapsulation, mark and forward to queue:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The flows definition for TEP decapsulation actions should specify the full
> outer packet to be matched at a minimum. The outer packet definition
> should match the tunnel definition in the tep context and the tep flow id.
> This example shows describes matching on the outer, marking the packet
> with the VXLAN VNI and directing to a specified queue of the port.
> 
> Source Packet
> 
>        Decapsulate Outer Hdr
>      /                       \                                    decap outer crc
>     /                         \                                    /          \
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>     | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER
> CRC |
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> /* Flow Attributes/Items Definitions */
> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type =
> ether_type }; struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
> 
> struct rte_flow_action_decap decap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item };
> 
> struct rte_flow_action_queue queue_action = { .index = qid };
> 
> struct rte_flow_action_port mark_action = { .index = vni };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = &mark_action },
>                { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END } };
> 
> /** VERY IMPORTANT NOTE **/
> One of the core concepts of this proposal is that actions which modify the
> packet are defined in the order which they are to be processed. So first
> decap outer ethernet header, then the outer TEP headers.
> I think this is not only logical from a usability point of view, it should also
> simplify the logic required in PMDs to parse the desired actions.
> 
> struct rte_flow *flow =
>                               rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> The processed packets are delivered to specifed queue with mbuf metadata
> denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.
> 
>     +-----+------+-----+---------+-----+
>     | ETH | IPv4 | TCP | PAYLOAD | CRC |
>     +-----+------+-----+---------+-----+
> 
> 
> Ingress TEP decapsulation switch to port:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> This is intended to represent how a TEP decapsulation could be configured in
> a switching offload case, it makes an assumption that there is a logical port
> representation for all ports on the hw switch in the DPDK application, but
> similar functionality could be achieved by specifying something like a VF ID of
> the device.
> 
> Like the previous scenario the flows definition for TEP decapsulation actions
> should specify the full outer packet to be matched at a minimum but also
> define the elements of the inner match to match against including masks if
> required.
> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &outer_eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &outer_tep_item,
> .mask = &tep_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &inner_eth_item,
> .mask = &eth_mask }
>                { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = &inner_ipv4_item,
> .mask = &ipv4_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &inner_tcp_item,
> .mask = &tcp_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
> 
> struct rte_flow_action_decap decap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP,
>                .item = &outer_tep_item
> };
> 
> struct rte_flow_action_port port_action = { .index = port_id };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END } };
> 
> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions,
> &err);
> 
> This action will forward the decapsulated packets to another port of the
> switch fabric but no information will on the tunnel or the fact that the packet
> was decapsulated will be passed with it, thereby enable segregation of the
> infrastructure and
> 
> 
> Egress TEP encapsulation:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Encapulsation TEP actions require the flow definitions for the source packet
> and then the actions to do on that, this example shows a ipv4/tcp packet
> action.
> 
> Source Packet
> 
>     +-----+------+-----+---------+-----+
>     | ETH | IPv4 | TCP | PAYLOAD | CRC |
>     +-----+------+-----+---------+-----+
> 
> struct rte_flow_attr attr = { .egress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type =
> ether_type }; struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr =
> src_addr, .dst_addr = dst_addr } }; struct rte_flow_item_udp tcp_item = {
> .hdr = { .src_port = src_port, .dst_port = dst_port } };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_encap encap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
> 
> struct rte_flow_action_encap encap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP,
>                .item = { .tep = tep, .id = vni } }; struct rte_flow_action_mark
> port_action = { .index = port_id };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END } } struct rte_flow *flow =
> rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> 
>       encapsulating Outer Hdr
>      /                       \                                      outer crc
>     /                         \                                   /          \
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>     | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER
> CRC |
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> 
> 
> Chaining multiple modification actions eg IPsec and TEP
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> For example the definition for full hw acceleration for an IPsec ESP/Transport
> SA encapsulated in a vxlan tunnel would look something like:
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = &sec_session
> },
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_END } }
> 
> 1. Source Packet
>                            +-----+------+-----+---------+-----+
>                            | ETH | IPv4 | TCP | PAYLOAD | CRC |
>                            +-----+------+-----+---------+-----+
> 
> 2. First Action - Tunnel Endpoint Encapsulation
> 
>       +------+-----+-------+-----+------+-----+---------+-----+
>       | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
>       +------+-----+-------+-----+------+-----+---------+-----+
> 
> 3. Second Action - IPsec ESP/Transport Security Processing
> 
>       +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>       | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
>       +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
> 
> 4. Third Action - Outer Ethernet Encapsulation
> 
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+---
> --------+
> | ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
> OUTER CRC |
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+---
> --------+
> 
> This example demonstrates the importance of making the interoperation of
> actions to be ordered, as in the above example, a security action can be
> defined on both the inner and outer packet by simply placing another
> security action at the beginning of the action list.
> 
> It also demonstrates the rationale for not collapsing the Ethernet into the TEP
> definition as when you have multiple encapsulating actions, all could
> potentially be the place where the Ethernet header needs to be defined.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
       [not found] ` <3560e76a-c99b-4dc3-9678-d7975acf67c9@mellanox.com>
@ 2018-01-02 10:50   ` Boris Pismenny
  2018-01-10 16:04   ` Doherty, Declan
  1 sibling, 0 replies; 15+ messages in thread
From: Boris Pismenny @ 2018-01-02 10:50 UTC (permalink / raw)
  To: dev; +Cc: Doherty, Declan <declan.doherty@intel.com>Shahaf Shuler


Hi Declan,

On 12/22/2017 12:21 AM, Doherty, Declan wrote:
> This RFC contains a proposal to add a new tunnel endpoint API to DPDK that when used
> in conjunction with rte_flow enables the configuration of inline data path encapsulation
> and decapsulation of tunnel endpoint network overlays on accelerated IO devices.
> 
> The proposed new API would provide for the creation, destruction, and
> monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs to allow the
> acceleration features to be discovered by applications.
> 
> /** Tunnel Endpoint context, opaque structure */
> struct rte_tep;
> 
> enum rte_tep_type {
>                 RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
>                 RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
>                 ...
> };
> 
> /** Tunnel Endpoint Attributes */
> struct rte_tep_attr {
>                 enum rte_type_type type;
> 
>                 /* other endpoint attributes here */
> }
> 
> /**
> * Create a tunnel end-point context as specified by the flow attribute and pattern
> *
> * @param   port_id     Port identifier of Ethernet device.
> * @param   attr        Flow rule attributes.
> * @param   pattern     Pattern specification by list of rte_flow_items.
> * @return
> *  - On success returns pointer to TEP context
> *  - On failure returns NULL
> */
> struct rte_tep *rte_tep_create(uint16_t port_id,
>                                struct rte_tep_attr *attr, struct rte_flow_item pattern[])
> 
> /**
> * Destroy an existing tunnel end-point context. All the end-points context
> * will be destroyed, so all active flows using tep should be freed before
> * destroying context.
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)
> 
> /**
> * Get tunnel endpoint statistics
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @param   stats      Tunnel endpoint statistics
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> Int
> rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
>                                struct rte_tep_stats *stats)
> 
> /**
> * Get ports tunnel endpoint capabilities
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   capabilities        Tunnel endpoint capabilities
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int
> rte_tep_capabilities_get(uint16_t port_id,
>                                struct rte_tep_capabilities *capabilities)
> 
> 
> To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
> enhanced to add a new flow item type. This contains a pointer to the
> TEP context as well as the overlay flow id to which the traffic flow is
> associated.
> 
> struct rte_flow_item_tep {
>                 struct rte_tep *tep;
>                 uint32_t flow_id;
> }
> 
> Also 2 new generic actions types are added encapsulation and decapsulation.
> 
> RTE_FLOW_ACTION_TYPE_ENCAP
> RTE_FLOW_ACTION_TYPE_DECAP
> 
> struct rte_flow_action_encap {
>                 struct rte_flow_item *item;
> }
> 
> struct rte_flow_action_decap {
>                 struct rte_flow_item *item;
> }
> 
> The following section outlines the intended usage of the new APIs and then how
> they are combined with the existing rte_flow APIs.
> 
> Tunnel endpoints are created on logical ports which support the capability
> using rte_tep_create() using a combination of TEP attributes and
> rte_flow_items. In the example below a new IPv4 VxLAN endpoint is being defined.
> The attrs parameter sets the TEP type, and could be used for other possible
> attributes.
> 
> struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };
> 
> The values for the headers which make up the tunnel endpointr are then
> defined using spec parameter in the rte flow items (IPv4, UDP and
> VxLAN in this case)
> 
> struct rte_flow_item_ipv4 ipv4_item = {
>                 .hdr = { .src_addr = saddr, .dst_addr = daddr }
> };
> 
> struct rte_flow_item_udp udp_item = {
>                 .hdr = { .src_port = sport, .dst_port = dport }
> };
> 
> struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };
> 
> struct rte_flow_item pattern[] = {
>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = &vxlan_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> The tunnel endpoint can then be create on the port. Whether or not any hw
> configuration is required at this point would be hw dependent, but if not
> the context for the TEP is available for use in programming flow, so the
> application is not forced to redefine the TEP parameters on each flow
> addition.
> 
> struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
> 
> Once the tep context is created flows can then be directed to that endpoint for
> processing. The following sections will outline how the author envisage flow
> programming will work and also how TEP acceleration can be combined with other
> accelerations.
> 
> 
> Ingress TEP decapsulation, mark and forward to queue:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The flows definition for TEP decapsulation actions should specify the full
> outer packet to be matched at a minimum. The outer packet definition should
> match the tunnel definition in the tep context and the tep flow id. This
> example shows describes matching on the outer, marking the packet with the
> VXLAN VNI and directing to a specified queue of the port.
> 
> Source Packet
> 
>         Decapsulate Outer Hdr
>       /                       \                                    decap outer crc
>      /                         \                                    /          \
>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> /* Flow Attributes/Items Definitions */
> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
> struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };
> 
> struct rte_flow_item pattern[] = {
>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
>                 { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
> };
> 
> struct rte_flow_action_decap decap_tep = {
>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
> .spec = &tep_item
> };
> 
> struct rte_flow_action_queue queue_action = { .index = qid };
> 
> struct rte_flow_action_port mark_action = { .index = vni };
> 
> struct rte_flow_action actions[] = {
>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                 { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = &mark_action },
>                 { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
>                 { .type = RTE_FLOW_ACTION_TYPE_END }
> };

I guess the Ethernet header is kept separate so that it would be 
possible to update it separately?
But, I don't know of anyway to update a specific rte_flow pattern.
Maybe it would be best to combine it with the rest of the TEP and add an 
update TEP command?

> 
> /** VERY IMPORTANT NOTE **/
> One of the core concepts of this proposal is that actions which modify the
> packet are defined in the order which they are to be processed. So first decap
> outer ethernet header, then the outer TEP headers.
> I think this is not only logical from a usability point of view, it should also
> simplify the logic required in PMDs to parse the desired actions.

This makes a lot of sense when dealing with encap/decap.
Maybe it would be best to add a new bit from the reserved field in 
rte_flow_attr to express this. Something like this:

struct rte_flow_attr {
         uint32_t group; /**< Priority group. */
         uint32_t priority; /**< Priority level within group. */
         uint32_t ingress:1; /**< Rule applies to ingress traffic. */
         uint32_t egress:1; /**< Rule applies to egress traffic. */
	uint32_t inorder:1; /**< Actions are applied in order. */
         uint32_t reserved:29; /**< Reserved, must be zero. */
};

> 
> struct rte_flow *flow =
>                                rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> The processed packets are delivered to specifed queue with mbuf metadata
> denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.
> 
>      +-----+------+-----+---------+-----+
>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>      +-----+------+-----+---------+-----+
> 
> 
> Ingress TEP decapsulation switch to port:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> This is intended to represent how a TEP decapsulation could be configured
> in a switching offload case, it makes an assumption that there is a logical
> port representation for all ports on the hw switch in the DPDK application,
> but similar functionality could be achieved by specifying something like a
> VF ID of the device.
> 
> Like the previous scenario the flows definition for TEP decapsulation actions
> should specify the full outer packet to be matched at a minimum but also
> define the elements of the inner match to match against including masks if
> required.

Why is the inner specification necessary?

What if I'd like to decapsulate all VXLAN traffic of some specification?

> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item pattern[] = {
>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &outer_eth_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &outer_tep_item, .mask = &tep_mask },
>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &inner_eth_item, .mask = &eth_mask }
>                 { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = &inner_ipv4_item, .mask = &ipv4_mask },
>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &inner_tcp_item, .mask = &tcp_mask },
>                 { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
> };
> 
> struct rte_flow_action_decap decap_tep = {
>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>                 .item = &outer_tep_item
> };
> 
> struct rte_flow_action_port port_action = { .index = port_id };
> 
> struct rte_flow_action actions[] = {
>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                 { .type = RTE_FLOW_ACTION_TYPE_END }
> }; >
> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> This action will forward the decapsulated packets to another port of the switch
> fabric but no information will on the tunnel or the fact that the packet was
> decapsulated will be passed with it, thereby enable segregation of the
> infrastructure and
> 
> 
> Egress TEP encapsulation:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Encapulsation TEP actions require the flow definitions for the source packet
> and then the actions to do on that, this example shows a ipv4/tcp packet
> action.
> 
> Source Packet
> 
>      +-----+------+-----+---------+-----+
>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>      +-----+------+-----+---------+-----+
> 
> struct rte_flow_attr attr = { .egress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
> struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr = src_addr, .dst_addr = dst_addr } };
> struct rte_flow_item_udp tcp_item = { .hdr = { .src_port = src_port, .dst_port = dst_port } };
> 
> struct rte_flow_item pattern[] = {
>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
>                 { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_encap encap_eth = {
>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
> };
> 
> struct rte_flow_action_encap encap_tep = {
>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>                 .item = { .tep = tep, .id = vni }
> };
> struct rte_flow_action_mark port_action = { .index = port_id };

This is the source port_id, where previously it was the destination 
port_id, right?

> 
> struct rte_flow_action actions[] = {
>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                 { .type = RTE_FLOW_ACTION_TYPE_END }
> }
> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> 
>        encapsulating Outer Hdr
>       /                       \                                      outer crc
>      /                         \                                   /          \
>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> 
> 
> Chaining multiple modification actions eg IPsec and TEP
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> For example the definition for full hw acceleration for an IPsec ESP/Transport
> SA encapsulated in a vxlan tunnel would look something like:
> 
> struct rte_flow_action actions[] = {
>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                 { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = &sec_session },
>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                 { .type = RTE_FLOW_ACTION_TYPE_END }
> }

Assuming the actions are ordered..
The order here suggests that the packet looks like:
[ETH | IP | UDP | VXLAN | ETH | IP | ESP | payload | ESP TRAILER | CRC]

But, the packet below has the ESP header as the outer header.
Also, shouldn't the encap_eth action come before the encap_tep action?

> 
> 1. Source Packet
>                             +-----+------+-----+---------+-----+
>                             | ETH | IPv4 | TCP | PAYLOAD | CRC |
>                             +-----+------+-----+---------+-----+
> 
> 2. First Action - Tunnel Endpoint Encapsulation
> 
>        +------+-----+-------+-----+------+-----+---------+-----+
>        | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
>        +------+-----+-------+-----+------+-----+---------+-----+
> 
> 3. Second Action - IPsec ESP/Transport Security Processing
> 
>        +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>        | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
>        +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
> 
> 4. Third Action - Outer Ethernet Encapsulation
> 
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+
> | ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER | OUTER CRC |
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+
> 
> This example demonstrates the importance of making the interoperation of
> actions to be ordered, as in the above example, a security
> action can be defined on both the inner and outer packet by simply placing
> another security action at the beginning of the action list.
> 
> It also demonstrates the rationale for not collapsing the Ethernet into
> the TEP definition as when you have multiple encapsulating actions, all
> could potentially be the place where the Ethernet header needs to be
> defined.
> 
> 

With rte_security full protocol offload as presented here we still need 
someway to provide and update the Ethernet header. Maybe there should be 
two encap_eth actions in this case. One for the outer and another for 
the inner?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2017-12-24 17:30 ` Shahaf Shuler
@ 2018-01-09 17:30   ` Doherty, Declan
  2018-01-11 21:45     ` John Daley (johndale)
  0 siblings, 1 reply; 15+ messages in thread
From: Doherty, Declan @ 2018-01-09 17:30 UTC (permalink / raw)
  To: Shahaf Shuler, dev

On 24/12/2017 5:30 PM, Shahaf Shuler wrote:
> Hi Declan,
> 

Hey Shahaf, apologies for the delay in responding, I have been out of 
office for the last 2 weeks.

> Friday, December 22, 2017 12:21 AM, Doherty, Declan:
>> This RFC contains a proposal to add a new tunnel endpoint API to DPDK that
>> when used in conjunction with rte_flow enables the configuration of inline
>> data path encapsulation and decapsulation of tunnel endpoint network
>> overlays on accelerated IO devices.
>>
>> The proposed new API would provide for the creation, destruction, and
>> monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs
>> to allow the acceleration features to be discovered by applications.
>>
....
> 
> 
> Am not sure I understand why there is a need for the above control methods.
> Are you introducing a new "tep device" ? > As the tunnel endpoint is sending and receiving Ethernet packets from 
the network I think it should still be counted as Ethernet device but 
with more capabilities (for example it supported encap/decap etc..), 
therefore it should use the Ethdev layer API to query statistics (for 
example).

No, the new APIs are only intended to be a method of creating, 
monitoring and deleting tunnel-endpoints on an existing ethdev. The 
rationale for APIs separate to rte_flow are the same as that in the 
rte_security, there is not a 1:1 mapping of TEPs to flows. Many flows 
(VNI's in VxLAN for example) can be originate/terminate on the same TEP, 
therefore managing the TEP independently of the flows being transmitted 
on it is important to allow visibility of that endpoint stats for 
example. I can't see how the existing ethdev API could be used for 
statistics as a single ethdev could be supporting may concurrent TEPs, 
therefore we would either need to use the extended stats with many 
entries, one for each TEP, or if we treat a TEP as an attribute of a 
port in a similar manner to the way rte_security manages an IPsec SA, 
the state of each TEP can be monitored and managed independently of both 
the overall port or the flows being transported on that endpoint.

> As for the capabilities - what specifically you had in mind? The current usage you show with tep is with rte_flow rules. There are no capabilities currently for rte_flow supported actions/pattern. To check such capabilities application uses rte_flow_validate.

I envisaged that the application should be able to see if an ethdev can 
support TEP in the rx/tx offloads, and then the rte_tep_capabilities 
would allow applications to query what tunnel endpoint protocols are 
supported etc. I would like a simple mechanism to allow users to see if 
a particular tunnel endpoint type is supported without having to build 
actual flows to validate.

> Regarding the creation/destroy of tep. Why not simply use rte_flow API and avoid this extra control?
> For example - with 17.11 APIs, application can put the port in isolate mode, and insert a flow_rule to catch only IPv4 VXLAN traffic and direct to some queue/do RSS. Such operation, per my understanding, will create a tunnel endpoint. What are the down sides of doing it with the current APIs?

That doesn't enable encapsulation and decapsulation of the outer tunnel 
endpoint in the hw as far as I know. Apart from the inability to monitor
the endpoint statistics I mentioned above. It would also require that 
you redefine the endpoints parameters ever time to you wish to add a new 
flow to it. I think the having the rte_tep object semantics should also 
simplify the ability to enable a full vswitch offload of TEP where the 
hw is handling both encap/decap and switching to a particular port.

> 
>>
>>
>> To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
>> enhanced to add a new flow item type. This contains a pointer to the TEP
>> context as well as the overlay flow id to which the traffic flow is associated.
>>
>> struct rte_flow_item_tep {
>>                 struct rte_tep *tep;
>>                 uint32_t flow_id;
>> }
> 
> Can you provide more detailed definition about the flow id ? to which field from the packet headers it refers to?
> On your below examples it looks like it is to match the VXLAN vni in case of VXLAN, what about the other protocols? And also, why not using the already exists VXLAN item?

I have only been looking initially at couple of the tunnel endpoint 
procotols, namely Geneve, NvGRE, and VxLAN, but the idea here is to 
allow the user to define the VNI in the case of Geneve and VxLAN and the 
VSID in the case of NvGRE on a per flow basis, as per my understanding 
these are used to identify the source/destination hosts on the overlay 
network independently from the endpoint there are transported across.

The VxLAN item is used in the creation of the TEP object, using the TEP 
object just removes the need for the user to constantly redefine all the 
tunnel parameters and also I think dependent on the hw implementation it 
may simplify the drivers work if it know the exact endpoint the actions 
is for instead of having to look it up on each flow addition.

> 
> Generally I like the idea of separating the encap/decap context from the action. However looks like the rte_flow_item has double meaning on this RFC, once for the classification and once for the action.
>  From the top of my head I would think of an API which separate those, and re-use the existing flow items. Something like:
> 
>   struct rte_flow_item pattern[] = {
>                  { set of already exists pattern  },
>                  { ... },
>                  { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> encap_ctx = create_enacap_context(pattern)
> 
> rte_flow_action actions[] = {
> 	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx}
> }

I not sure I fully understand what you're asking here, but in general 
for encap you only would define the inner part of the packet in the 
match pattern criteria and the actual outer tunnel headers would be 
defined in the action.

I guess there is some replication in the decap side as proposed, as the 
TEP object is used in both the pattern and the action, possibly you 
could get away with having no TEP object defined in the action data, but 
I prefer keeping the API symmetrical for encap/decap actions at the 
shake of some extra verbosity.

>   
...
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
       [not found] ` <3560e76a-c99b-4dc3-9678-d7975acf67c9@mellanox.com>
  2018-01-02 10:50   ` Boris Pismenny
@ 2018-01-10 16:04   ` Doherty, Declan
  1 sibling, 0 replies; 15+ messages in thread
From: Doherty, Declan @ 2018-01-10 16:04 UTC (permalink / raw)
  To: Boris Pismenny, dev
  Cc: Rony Efraim <ronye@mellanox.com>; Aviad Yehezkel
	<aviadye@mellanox.com>; Shahaf Shuler

Adding discussion back to list.

On 24/12/2017 2:18 PM, Boris Pismenny wrote:
> Hi Declan
> 
> On 12/22/2017 12:21 AM, Doherty, Declan wrote:
>> This RFC contains a proposal to add a new tunnel endpoint API to DPDK 
>> that when used
>> in conjunction with rte_flow enables the configuration of inline data 
>> path encapsulation
>> and decapsulation of tunnel endpoint network overlays on accelerated 
>> IO devices.
>>
>> The proposed new API would provide for the creation, destruction, and
>> monitoring of a tunnel endpoint in supporting hw, as well as 
>> capabilities APIs to allow the
>> acceleration features to be discovered by applications.
>>
>> /** Tunnel Endpoint context, opaque structure */
>> struct rte_tep;
>>
>> enum rte_tep_type {
>>                 RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
>>                 RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
>>                 ...
>> };
>>
>> /** Tunnel Endpoint Attributes */
>> struct rte_tep_attr {
>>                 enum rte_type_type type;
>>
>>                 /* other endpoint attributes here */
>> }
>>
>> /**
>> * Create a tunnel end-point context as specified by the flow attribute 
>> and pattern
>> *
>> * @param   port_id     Port identifier of Ethernet device.
>> * @param   attr        Flow rule attributes.
>> * @param   pattern     Pattern specification by list of rte_flow_items.
>> * @return
>> *  - On success returns pointer to TEP context
>> *  - On failure returns NULL
>> */
>> struct rte_tep *rte_tep_create(uint16_t port_id,
>>                                struct rte_tep_attr *attr, struct 
>> rte_flow_item pattern[])
>>
>> /**
>> * Destroy an existing tunnel end-point context. All the end-points 
>> context
>> * will be destroyed, so all active flows using tep should be freed before
>> * destroying context.
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   tep        Tunnel endpoint context
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)
>>
>> /**
>> * Get tunnel endpoint statistics
>> *
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   tep        Tunnel endpoint context
>> * @param   stats      Tunnel endpoint statistics
>> *
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> Int
>> rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
>>                                struct rte_tep_stats *stats)
>>
>> /**
>> * Get ports tunnel endpoint capabilities
>> *
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   capabilities        Tunnel endpoint capabilities
>> *
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> int
>> rte_tep_capabilities_get(uint16_t port_id,
>>                                struct rte_tep_capabilities *capabilities)
>>
>>
>> To direct traffic flows to hw terminated tunnel endpoint the rte_flow 
>> API is
>> enhanced to add a new flow item type. This contains a pointer to the
>> TEP context as well as the overlay flow id to which the traffic flow is
>> associated.
>>
>> struct rte_flow_item_tep {
>>                 struct rte_tep *tep;
>>                 uint32_t flow_id;
>> }
>>
>> Also 2 new generic actions types are added encapsulation and 
>> decapsulation.
>>
>> RTE_FLOW_ACTION_TYPE_ENCAP
>> RTE_FLOW_ACTION_TYPE_DECAP
>>
>> struct rte_flow_action_encap {
>>                 struct rte_flow_item *item;
>> }
>>
>> struct rte_flow_action_decap {
>>                 struct rte_flow_item *item;
>> }
>>
>> The following section outlines the intended usage of the new APIs and 
>> then how
>> they are combined with the existing rte_flow APIs.
>>
>> Tunnel endpoints are created on logical ports which support the 
>> capability
>> using rte_tep_create() using a combination of TEP attributes and
>> rte_flow_items. In the example below a new IPv4 VxLAN endpoint is 
>> being defined.
>> The attrs parameter sets the TEP type, and could be used for other 
>> possible
>> attributes.
>>
>> struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };
>>
>> The values for the headers which make up the tunnel endpointr are then
>> defined using spec parameter in the rte flow items (IPv4, UDP and
>> VxLAN in this case)
>>
>> struct rte_flow_item_ipv4 ipv4_item = {
>>                 .hdr = { .src_addr = saddr, .dst_addr = daddr }
>> };
>>
>> struct rte_flow_item_udp udp_item = {
>>                 .hdr = { .src_port = sport, .dst_port = dport }
>> };
>>
>> struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = 
>> &vxlan_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> The tunnel endpoint can then be create on the port. Whether or not any hw
>> configuration is required at this point would be hw dependent, but if not
>> the context for the TEP is available for use in programming flow, so the
>> application is not forced to redefine the TEP parameters on each flow
>> addition.
>>
>> struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
>>
>> Once the tep context is created flows can then be directed to that 
>> endpoint for
>> processing. The following sections will outline how the author 
>> envisage flow
>> programming will work and also how TEP acceleration can be combined 
>> with other
>> accelerations.
>>
>>
>> Ingress TEP decapsulation, mark and forward to queue:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> The flows definition for TEP decapsulation actions should specify the 
>> full
>> outer packet to be matched at a minimum. The outer packet definition 
>> should
>> match the tunnel definition in the tep context and the tep flow id. This
>> example shows describes matching on the outer, marking the packet with 
>> the
>> VXLAN VNI and directing to a specified queue of the port.
>>
>> Source Packet
>>
>>         Decapsulate Outer Hdr
>>       /                       \                                    
>> decap outer crc
>>      /                         \                                    
>> /          \
>>      
>> +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+ 
>>
>>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | 
>> OUTER CRC |
>>      
>> +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+ 
>>
>>
>> /* Flow Attributes/Items Definitions */
>>
>> struct rte_flow_attr attr = { .ingress = 1 };
>>
>> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, 
>> .type = ether_type };
>> struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_decap decap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = 
>> ether_type }
>> };
>>
>> struct rte_flow_action_decap decap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>> .spec = &tep_item
>> };
>>
>> struct rte_flow_action_queue queue_action = { .index = qid };
>>
>> struct rte_flow_action_port mark_action = { .index = vni };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = 
>> &decap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = 
>> &decap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = 
>> &mark_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = 
>> &queue_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> };
> 
> I guess the Ethernet header is kept separate so that it would be 
> possible to update it separately?
> But, I don't know of anyway to update a specific rte_flow pattern.
> Maybe it would be best to combine it with the rest of the TEP and add an 
> update TEP command?

The main reason I had for proposing the the Ethernet header as a 
separate entity from the the TEP was to minimize replication of fields 
when multiple encapsulations actions were chained together, for example 
if a tunnel IPsec action is chained with a TEP action, which of the 
actions should contain the Ethernet header information as both could 
possibly define it, but it would depend on which is the last 
encapsulation, but it may make sense to have a update function and just 
live with the small replication.

> 
>>
>> /** VERY IMPORTANT NOTE **/
>> One of the core concepts of this proposal is that actions which modify 
>> the
>> packet are defined in the order which they are to be processed. So 
>> first decap
>> outer ethernet header, then the outer TEP headers.
>> I think this is not only logical from a usability point of view, it 
>> should also
>> simplify the logic required in PMDs to parse the desired actions.
> 
> This makes a lot of sense when dealing with encap/decap.
> Maybe it would be best to add a new bit from the reserved field in 
> rte_flow_attr to express this. Something like this:
> 
> struct rte_flow_attr {
>          uint32_t group; /**< Priority group. */
>          uint32_t priority; /**< Priority level within group. */
>          uint32_t ingress:1; /**< Rule applies to ingress traffic. */
>          uint32_t egress:1; /**< Rule applies to egress traffic. */
>      uint32_t inorder:1; /**< Actions are applied in order. */
>          uint32_t reserved:29; /**< Reserved, must be zero. */
> };
> 

That makes sense to me.

>>
>> struct rte_flow *flow =
>>                                rte_flow_create(port_id, &attr, 
>> pattern, actions, &err);
>>
>> The processed packets are delivered to specifed queue with mbuf metadata
>> denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.
>>
>>      +-----+------+-----+---------+-----+
>>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>      +-----+------+-----+---------+-----+
>>
>>
>> Ingress TEP decapsulation switch to port:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> This is intended to represent how a TEP decapsulation could be configured
>> in a switching offload case, it makes an assumption that there is a 
>> logical
>> port representation for all ports on the hw switch in the DPDK 
>> application,
>> but similar functionality could be achieved by specifying something 
>> like a
>> VF ID of the device.
>>
>> Like the previous scenario the flows definition for TEP decapsulation 
>> actions
>> should specify the full outer packet to be matched at a minimum but also
>> define the elements of the inner match to match against including 
>> masks if
>> required.
> 
> Why is the inner specification necessary?

This example is for an OvS like usecase where you want to match on to a 
specific flow, so you are matching against both outer and inner and only 
decapsulating the outer for that particular flow.

>  > What if I'd like to decapsulate all VXLAN traffic of some specification?

I think the previous example to this shows that case.

> 
>>
>> struct rte_flow_attr attr = { .ingress = 1 };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = 
>> &outer_eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = 
>> &outer_tep_item, .mask = &tep_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = 
>> &inner_eth_item, .mask = &eth_mask }
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = 
>> &inner_ipv4_item, .mask = &ipv4_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = 
>> &inner_tcp_item, .mask = &tcp_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_decap decap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = 
>> ether_type }
>> };
>>
>> struct rte_flow_action_decap decap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>>                 .item = &outer_tep_item
>> };
>>
>> struct rte_flow_action_port port_action = { .index = port_id };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = 
>> &decap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = 
>> &decap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = 
>> &port_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> }; >
>> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, 
>> actions, &err);
>>
>> This action will forward the decapsulated packets to another port of 
>> the switch
>> fabric but no information will on the tunnel or the fact that the 
>> packet was
>> decapsulated will be passed with it, thereby enable segregation of the
>> infrastructure and
>>
>>
>> Egress TEP encapsulation:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> Encapulsation TEP actions require the flow definitions for the source 
>> packet
>> and then the actions to do on that, this example shows a ipv4/tcp packet
>> action.
>>
>> Source Packet
>>
>>      +-----+------+-----+---------+-----+
>>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>      +-----+------+-----+---------+-----+
>>
>> struct rte_flow_attr attr = { .egress = 1 };
>>
>> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, 
>> .type = ether_type };
>> struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr = src_addr, 
>> .dst_addr = dst_addr } };
>> struct rte_flow_item_udp tcp_item = { .hdr = { .src_port = src_port, 
>> .dst_port = dst_port } };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_encap encap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = 
>> ether_type }
>> };
>>
>> struct rte_flow_action_encap encap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>>                 .item = { .tep = tep, .id = vni }
>> };
>> struct rte_flow_action_mark port_action = { .index = port_id };
> 
> This is the source port_id, where previously it was the destination 
> port_id, right?
>
Apologies, there is a typo in the above action definition it should be:

struct rte_flow_action_port port_action = { .index = port_id };

So it should be read as the destination port id also.


SW       +-+ +-+ +-+ +-+ +-+
ETHDEV   |0| |1| |2| |3| |4|
          +++ +-+ +++ +++ +++
           ^       ^   ^   ^
----------|-------|---|---|-
           |       v   v   v
HW        |      +-+ +-+ +-+
           |      |A| |B| |C|  Host Ports
           |      +++ +++ +++
           |       |   |   |
           |      ++---+---++
           |      |   PPP   |  Packet Processing Pipeline
           |      +-+-----+-+    (including switching)
           |        |     |
           |       +++   +++
           +------>|D|   |E|   Physical Ports
                   +-+   +-+

So for the above example if the traffic is originating on port B of the 
switch from the hw perspective and after encapsulation will be 
transmitted on port D, the flow rule would be created as an egress rule 
on ethdev port_id=3 and the port_action would be to ethdev port_id=0


>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = 
>> &encap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = 
>> &encap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = 
>> &port_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> }
>> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, 
>> actions, &err);
>>
>>
>>        encapsulating Outer Hdr
>>       /                       \                                      
>> outer crc
>>      /                         \                                   
>> /          \
>>      
>> +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+ 
>>
>>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | 
>> OUTER CRC |
>>      
>> +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+ 
>>
>>
>>
>>
>> Chaining multiple modification actions eg IPsec and TEP
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> For example the definition for full hw acceleration for an IPsec 
>> ESP/Transport
>> SA encapsulated in a vxlan tunnel would look something like:
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = 
>> &encap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = 
>> &sec_session },
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = 
>> &encap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> }
> 
> Assuming the actions are ordered..
> The order here suggests that the packet looks like:
> [ETH | IP | UDP | VXLAN | ETH | IP | ESP | payload | ESP TRAILER | CRC]
> 
> But, the packet below has the ESP header as the outer header.
> Also, shouldn't the encap_eth action come before the encap_tep action >

Maybe un-intuitively :) I have the actions ordered as performed on the 
packet, so first to last, so I think the opposite order you are reading 
them in.

So the first thing is to do the encapsulation of the tep, which is 
adding the IP|UDP|VxLAN, I don't have a Ethernet encapsulation after 
this as in this example we are going a security action which could 
possibly change the outer IP. The the security action, which is a 
ESP/Transport on the modification and lastly the ethernet encapsulation.

  [ETH | IP | ESP | ** UDP | VXLAN | ETH | IP | TCP | PAYLOAD | CRC | 
ESP TRAILER ** | CRC ]

Packet encrypted between ** **

>>
>> 1. Source Packet
>>                             +-----+------+-----+---------+-----+
>>                             | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>                             +-----+------+-----+---------+-----+
>>
>> 2. First Action - Tunnel Endpoint Encapsulation
>>
>>        +------+-----+-------+-----+------+-----+---------+-----+
>>        | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>        +------+-----+-------+-----+------+-----+---------+-----+
>>
>> 3. Second Action - IPsec ESP/Transport Security Processing
>>
>>        
>> +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+ 
>>
>>  IPv4 | ESP |              ENCRYPTED PAYLOAD                 |  ESP TRAILER |
>>        
>> +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+ 
>>
>>
>> 4. Third Action - Outer Ethernet Encapsulation
>>
>> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+ 
>>
>> | ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | 
>> ESP TRAILER | OUTER CRC |
>> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+ 
>>
>>
>> This example demonstrates the importance of making the interoperation of
>> actions to be ordered, as in the above example, a security
>> action can be defined on both the inner and outer packet by simply 
>> placing
>> another security action at the beginning of the action list.
>>
>> It also demonstrates the rationale for not collapsing the Ethernet into
>> the TEP definition as when you have multiple encapsulating actions, all
>> could potentially be the place where the Ethernet header needs to be
>> defined.
>>
>>
> 
> With rte_security full protocol offload as presented here we still need 
> someway to provide and update the Ethernet header. Maybe there should be 
> two encap_eth actions in this case. One for the outer and another for 
> the inner?

Yes, it's the same issue that you raised above. I think possibly it 
makes sense that both rte_security and rte_tep support update functions 
and allow definition of a Ethernet header, and we have a mechanism of 
defining whether the Ethernet header is required for the 
rte_tep/rte_security action.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2017-12-21 22:21 [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement Doherty, Declan
  2017-12-24 17:30 ` Shahaf Shuler
       [not found] ` <3560e76a-c99b-4dc3-9678-d7975acf67c9@mellanox.com>
@ 2018-01-11 21:44 ` John Daley (johndale)
  2018-01-23 15:50   ` Doherty, Declan
  2018-02-13 17:05 ` Adrien Mazarguil
  3 siblings, 1 reply; 15+ messages in thread
From: John Daley (johndale) @ 2018-01-11 21:44 UTC (permalink / raw)
  To: Doherty, Declan, dev

Hi,
One comment on DECAP action and a "feature request".  I'll also reply to the top of thread discussion separately. Thanks for the RFC Declan!

Feature request associated with ENCAP action:

VPP (and probably other apps) would like the ability to simply specify an independent tunnel ID as part of egress match criteria in an rte_flow rule. Then egress packets could specify a tunnel ID  and valid flag in the mbuf. If it matched the rte_flow tunnel ID item, a simple lookup in the nic could be done and the associated actions (particularly ENCAP) executed. The application already know the tunnel that the packet is associated with so no need to have the nic do matching on a header pattern. Plus it's possible that packet headers alone are not enough to determine the correct encap action (the bridge where the packet came from might be required). 

This would require a new mbuf field to specify the tunnel ID (maybe in tx_offload) and a valid flag.  It would also require a new rte flow item type for matching the tunnel ID (like RTE_FLOW_ITEM_TYPE_META_TUNNEL_ID).

Is something like this being considered by others? If not, should it be part of this RFC or a new one? I think this would be the 1st meta-data match criteria in rte_flow, but I could see others following. 

-johnd

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Doherty, Declan
> Sent: Thursday, December 21, 2017 2:21 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
> 
> This RFC contains a proposal to add a new tunnel endpoint API to DPDK that
> when used in conjunction with rte_flow enables the configuration of inline
> data path encapsulation and decapsulation of tunnel endpoint network
> overlays on accelerated IO devices.
> 
> The proposed new API would provide for the creation, destruction, and
> monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs
> to allow the acceleration features to be discovered by applications.
> 
> /** Tunnel Endpoint context, opaque structure */ struct rte_tep;
> 
> enum rte_tep_type {
>                RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
>                RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
>                ...
> };
> 
> /** Tunnel Endpoint Attributes */
> struct rte_tep_attr {
>                enum rte_type_type type;
> 
>                /* other endpoint attributes here */ }
> 
> /**
> * Create a tunnel end-point context as specified by the flow attribute and
> pattern
> *
> * @param   port_id     Port identifier of Ethernet device.
> * @param   attr        Flow rule attributes.
> * @param   pattern     Pattern specification by list of rte_flow_items.
> * @return
> *  - On success returns pointer to TEP context
> *  - On failure returns NULL
> */
> struct rte_tep *rte_tep_create(uint16_t port_id,
>                               struct rte_tep_attr *attr, struct rte_flow_item pattern[])
> 
> /**
> * Destroy an existing tunnel end-point context. All the end-points context
> * will be destroyed, so all active flows using tep should be freed before
> * destroying context.
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)
> 
> /**
> * Get tunnel endpoint statistics
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @param   stats      Tunnel endpoint statistics
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> Int
> rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
>                               struct rte_tep_stats *stats)
> 
> /**
> * Get ports tunnel endpoint capabilities
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   capabilities        Tunnel endpoint capabilities
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int
> rte_tep_capabilities_get(uint16_t port_id,
>                               struct rte_tep_capabilities *capabilities)
> 
> 
> To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
> enhanced to add a new flow item type. This contains a pointer to the TEP
> context as well as the overlay flow id to which the traffic flow is associated.
> 
> struct rte_flow_item_tep {
>                struct rte_tep *tep;
>                uint32_t flow_id;
> }
> 
> Also 2 new generic actions types are added encapsulation and decapsulation.
> 
> RTE_FLOW_ACTION_TYPE_ENCAP
> RTE_FLOW_ACTION_TYPE_DECAP
> 
> struct rte_flow_action_encap {
>                struct rte_flow_item *item; }
> 
> struct rte_flow_action_decap {
>                struct rte_flow_item *item; }
> 
> The following section outlines the intended usage of the new APIs and then
> how they are combined with the existing rte_flow APIs.
> 
> Tunnel endpoints are created on logical ports which support the capability
> using rte_tep_create() using a combination of TEP attributes and
> rte_flow_items. In the example below a new IPv4 VxLAN endpoint is being
> defined.
> The attrs parameter sets the TEP type, and could be used for other possible
> attributes.
> 
> struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };
> 
> The values for the headers which make up the tunnel endpointr are then
> defined using spec parameter in the rte flow items (IPv4, UDP and VxLAN in
> this case)
> 
> struct rte_flow_item_ipv4 ipv4_item = {
>                .hdr = { .src_addr = saddr, .dst_addr = daddr } };
> 
> struct rte_flow_item_udp udp_item = {
>                .hdr = { .src_port = sport, .dst_port = dport } };
> 
> struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
>                { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = &vxlan_item },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> The tunnel endpoint can then be create on the port. Whether or not any hw
> configuration is required at this point would be hw dependent, but if not the
> context for the TEP is available for use in programming flow, so the
> application is not forced to redefine the TEP parameters on each flow
> addition.
> 
> struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
> 
> Once the tep context is created flows can then be directed to that endpoint
> for processing. The following sections will outline how the author envisage
> flow programming will work and also how TEP acceleration can be combined
> with other accelerations.
> 
> 
> Ingress TEP decapsulation, mark and forward to queue:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The flows definition for TEP decapsulation actions should specify the full
> outer packet to be matched at a minimum. The outer packet definition
> should match the tunnel definition in the tep context and the tep flow id.
> This example shows describes matching on the outer, marking the packet
> with the VXLAN VNI and directing to a specified queue of the port.
> 
> Source Packet
> 
>        Decapsulate Outer Hdr
>      /                       \                                    decap outer crc
>     /                         \                                    /          \
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>     | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER
> CRC |
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> /* Flow Attributes/Items Definitions */
> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type =
> ether_type }; struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
> 
> struct rte_flow_action_decap decap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item };
> 
> struct rte_flow_action_queue queue_action = { .index = qid };
> 
> struct rte_flow_action_port mark_action = { .index = vni };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = &mark_action },
>                { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END } };
> 
Does the conf for  RTE_FLOW_ACTION_TYPE_DECAP action specify the first pattern to decap up to? In the above, is the 1st decap action needed? Wouldn't the 2nd action decap up to the matching vni?

On our nic, we would have to translate the decap actions into a (level, offset) pair which requires a lot of effort. Since the packet is already matched perhaps 'struct rte_flow_item' is not the right thing to pass to the decap action and a simple (layer, offset) could be used instead? E.g to decap up to the inner Ethernet header of a VxLAN packet:
struct rte_flow_action_decap {
               uint32_t level;
	uint8_t offset;
}
struct rte_flow_action_decap_tep {
	.level = RTE_PTYPE_L4_UDP,
	.offset = sizeof(struct vxlan_hdr)
}

Using RTE_PTYPE... is just for illustration- we might to define our own layers in rte_flow.h.  You could specify inner packet layers, and the offset need not be restricted to the size of the header so that  decap to an absolute offset could be allowed, e.g:
struct rte_flow_action_decap_42 {
	.level = RTE_PTYPE_L2_ETHER,
	.offset = 42
}

> /** VERY IMPORTANT NOTE **/
> One of the core concepts of this proposal is that actions which modify the
> packet are defined in the order which they are to be processed. So first
> decap outer ethernet header, then the outer TEP headers.
> I think this is not only logical from a usability point of view, it should also
> simplify the logic required in PMDs to parse the desired actions.
> 
> struct rte_flow *flow =
>                               rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> The processed packets are delivered to specifed queue with mbuf metadata
> denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.
> 
>     +-----+------+-----+---------+-----+
>     | ETH | IPv4 | TCP | PAYLOAD | CRC |
>     +-----+------+-----+---------+-----+
> 
> 
> Ingress TEP decapsulation switch to port:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> This is intended to represent how a TEP decapsulation could be configured in
> a switching offload case, it makes an assumption that there is a logical port
> representation for all ports on the hw switch in the DPDK application, but
> similar functionality could be achieved by specifying something like a VF ID of
> the device.
> 
> Like the previous scenario the flows definition for TEP decapsulation actions
> should specify the full outer packet to be matched at a minimum but also
> define the elements of the inner match to match against including masks if
> required.
> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &outer_eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &outer_tep_item,
> .mask = &tep_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &inner_eth_item,
> .mask = &eth_mask }
>                { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = &inner_ipv4_item,
> .mask = &ipv4_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &inner_tcp_item,
> .mask = &tcp_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
> 
> struct rte_flow_action_decap decap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP,
>                .item = &outer_tep_item
> };
> 
> struct rte_flow_action_port port_action = { .index = port_id };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END } };
> 
> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions,
> &err);
> 
> This action will forward the decapsulated packets to another port of the
> switch fabric but no information will on the tunnel or the fact that the packet
> was decapsulated will be passed with it, thereby enable segregation of the
> infrastructure and
> 
> 
> Egress TEP encapsulation:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Encapulsation TEP actions require the flow definitions for the source packet
> and then the actions to do on that, this example shows a ipv4/tcp packet
> action.
> 
> Source Packet
> 
>     +-----+------+-----+---------+-----+
>     | ETH | IPv4 | TCP | PAYLOAD | CRC |
>     +-----+------+-----+---------+-----+
> 
> struct rte_flow_attr attr = { .egress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type =
> ether_type }; struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr =
> src_addr, .dst_addr = dst_addr } }; struct rte_flow_item_udp tcp_item = {
> .hdr = { .src_port = src_port, .dst_port = dst_port } };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
>                { .type = RTE_FLOW_ITEM_TYPE_END } };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_encap encap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
> 
> struct rte_flow_action_encap encap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP,
>                .item = { .tep = tep, .id = vni } }; struct rte_flow_action_mark
> port_action = { .index = port_id };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END } } struct rte_flow *flow =
> rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> 
>       encapsulating Outer Hdr
>      /                       \                                      outer crc
>     /                         \                                   /          \
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>     | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER
> CRC |
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> 
> 
> Chaining multiple modification actions eg IPsec and TEP
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> For example the definition for full hw acceleration for an IPsec ESP/Transport
> SA encapsulated in a vxlan tunnel would look something like:
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = &sec_session
> },
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_END } }
> 
> 1. Source Packet
>                            +-----+------+-----+---------+-----+
>                            | ETH | IPv4 | TCP | PAYLOAD | CRC |
>                            +-----+------+-----+---------+-----+
> 
> 2. First Action - Tunnel Endpoint Encapsulation
> 
>       +------+-----+-------+-----+------+-----+---------+-----+
>       | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
>       +------+-----+-------+-----+------+-----+---------+-----+
> 
> 3. Second Action - IPsec ESP/Transport Security Processing
> 
>       +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>       | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
>       +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
> 
> 4. Third Action - Outer Ethernet Encapsulation
> 
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+---
> --------+
> | ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
> OUTER CRC |
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+---
> --------+
> 
> This example demonstrates the importance of making the interoperation of
> actions to be ordered, as in the above example, a security action can be
> defined on both the inner and outer packet by simply placing another
> security action at the beginning of the action list.
> 
> It also demonstrates the rationale for not collapsing the Ethernet into the TEP
> definition as when you have multiple encapsulating actions, all could
> potentially be the place where the Ethernet header needs to be defined.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-01-09 17:30   ` Doherty, Declan
@ 2018-01-11 21:45     ` John Daley (johndale)
  2018-01-16  8:22       ` Shahaf Shuler
  2018-01-23 14:46       ` Doherty, Declan
  0 siblings, 2 replies; 15+ messages in thread
From: John Daley (johndale) @ 2018-01-11 21:45 UTC (permalink / raw)
  To: Doherty, Declan, Shahaf Shuler, dev

Hi Declan and Shahaf,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Doherty, Declan
> Sent: Tuesday, January 09, 2018 9:31 AM
> To: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
> 
> On 24/12/2017 5:30 PM, Shahaf Shuler wrote:
> > Hi Declan,
> >
> 
> Hey Shahaf, apologies for the delay in responding, I have been out of office
> for the last 2 weeks.
> 
> > Friday, December 22, 2017 12:21 AM, Doherty, Declan:
> >> This RFC contains a proposal to add a new tunnel endpoint API to DPDK
> >> that when used in conjunction with rte_flow enables the configuration
> >> of inline data path encapsulation and decapsulation of tunnel
> >> endpoint network overlays on accelerated IO devices.
> >>
> >> The proposed new API would provide for the creation, destruction, and
> >> monitoring of a tunnel endpoint in supporting hw, as well as
> >> capabilities APIs to allow the acceleration features to be discovered by
> applications.
> >>
> ....
> >
> >
> > Am not sure I understand why there is a need for the above control
> methods.
> > Are you introducing a new "tep device" ? > As the tunnel endpoint is
> > sending and receiving Ethernet packets from
> the network I think it should still be counted as Ethernet device but with
> more capabilities (for example it supported encap/decap etc..), therefore it
> should use the Ethdev layer API to query statistics (for example).
> 
> No, the new APIs are only intended to be a method of creating, monitoring
> and deleting tunnel-endpoints on an existing ethdev. The rationale for APIs
> separate to rte_flow are the same as that in the rte_security, there is not a
> 1:1 mapping of TEPs to flows. Many flows (VNI's in VxLAN for example) can
> be originate/terminate on the same TEP, therefore managing the TEP
> independently of the flows being transmitted on it is important to allow
> visibility of that endpoint stats for example.

I don't quite understand what you mean by tunnel and flow here. Can you define exactly what you mean? Flow is an overloaded word in our world. I think that defining it will make understanding the RFC a little easier.

Taking VxLAN, I think of the tunnel as including up through the VxLAN header, including the VNI. If you go by this definition, I would consider a flow to be all packets with the same VNI and the same 5-tuple hash of the inner packet. Is this what you mean by tunnel (or TEP) and flow here?

With these definitions, VPP for example might need up to a couple thousand TEPs on an interface and each TEP could have hundreds or thousands of flows. It would be quite possible to have 1 rte flow rule per TEP (or 2- ingress/decap and egress/encap). The COUNT action could be used to count the number of packets through each TEP. Is this adequate, or are you proposing that we need a mechanism to get stats of flows within each TEP? Is that the main point of the API? Assuming no need for stats on a per TEP/flow basis is there anything else the API adds?

> I can't see how the existing
> ethdev API could be used for statistics as a single ethdev could be supporting
> may concurrent TEPs, therefore we would either need to use the extended
> stats with many entries, one for each TEP, or if we treat a TEP as an attribute
> of a port in a similar manner to the way rte_security manages an IPsec SA,
> the state of each TEP can be monitored and managed independently of both
> the overall port or the flows being transported on that endpoint.

Assuming we can define one rte_flow rule per TEP, does what you propose give us anything more than just using the COUNT action?
> 
> > As for the capabilities - what specifically you had in mind? The current
> usage you show with tep is with rte_flow rules. There are no capabilities
> currently for rte_flow supported actions/pattern. To check such capabilities
> application uses rte_flow_validate.
> 
> I envisaged that the application should be able to see if an ethdev can
> support TEP in the rx/tx offloads, and then the rte_tep_capabilities would
> allow applications to query what tunnel endpoint protocols are supported
> etc. I would like a simple mechanism to allow users to see if a particular
> tunnel endpoint type is supported without having to build actual flows to
> validate.

I can see the value of that, but in the end wouldn't the API call rte_flow_validate anyways? Maybe we don't add the layer now or maybe it doesn't really belong in DPDK? I'm in favor of deferring the capabilities API until we know it's really needed.  I hate to see special capabilities APIs start sneaking in after we decided to go the rte_flow_validate route and users are starting to get used to it.
> 
> > Regarding the creation/destroy of tep. Why not simply use rte_flow API
> and avoid this extra control?
> > For example - with 17.11 APIs, application can put the port in isolate mode,
> and insert a flow_rule to catch only IPv4 VXLAN traffic and direct to some
> queue/do RSS. Such operation, per my understanding, will create a tunnel
> endpoint. What are the down sides of doing it with the current APIs?
> 
> That doesn't enable encapsulation and decapsulation of the outer tunnel
> endpoint in the hw as far as I know. Apart from the inability to monitor the
> endpoint statistics I mentioned above. It would also require that you
> redefine the endpoints parameters ever time to you wish to add a new flow
> to it. I think the having the rte_tep object semantics should also simplify the
> ability to enable a full vswitch offload of TEP where the hw is handling both
> encap/decap and switching to a particular port.

If we have the ingress/decap and egress/encap actions and 1 rte_flow rule per TEP and use the COUNT action, I think we get all but the last bit. For that, perhaps the application could keep  ingress and egress rte_flow template for each tunnel type (VxLAN, GRE, ..). Then copying the template and filling in the outer packet info and tunnel Id is all that would be required. We could also define these in rte_flow.h?

> 
> >
> >>
> >>
> >> To direct traffic flows to hw terminated tunnel endpoint the rte_flow
> >> API is enhanced to add a new flow item type. This contains a pointer
> >> to the TEP context as well as the overlay flow id to which the traffic flow is
> associated.
> >>
> >> struct rte_flow_item_tep {
> >>                 struct rte_tep *tep;
> >>                 uint32_t flow_id;
> >> }
> >
> > Can you provide more detailed definition about the flow id ? to which field
> from the packet headers it refers to?
> > On your below examples it looks like it is to match the VXLAN vni in case of
> VXLAN, what about the other protocols? And also, why not using the already
> exists VXLAN item?
> 
> I have only been looking initially at couple of the tunnel endpoint procotols,
> namely Geneve, NvGRE, and VxLAN, but the idea here is to allow the user to
> define the VNI in the case of Geneve and VxLAN and the VSID in the case of
> NvGRE on a per flow basis, as per my understanding these are used to
> identify the source/destination hosts on the overlay network independently
> from the endpoint there are transported across.
> 
> The VxLAN item is used in the creation of the TEP object, using the TEP
> object just removes the need for the user to constantly redefine all the
> tunnel parameters and also I think dependent on the hw implementation it
> may simplify the drivers work if it know the exact endpoint the actions is for
> instead of having to look it up on each flow addition.
> 
> >
> > Generally I like the idea of separating the encap/decap context from the
> action. However looks like the rte_flow_item has double meaning on this
> RFC, once for the classification and once for the action.
> >  From the top of my head I would think of an API which separate those, and
> re-use the existing flow items. Something like:
> >
> >   struct rte_flow_item pattern[] = {
> >                  { set of already exists pattern  },
> >                  { ... },
> >                  { .type = RTE_FLOW_ITEM_TYPE_END } };
> >
> > encap_ctx = create_enacap_context(pattern)
> >
> > rte_flow_action actions[] = {
> > 	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx} }
> 
> I not sure I fully understand what you're asking here, but in general for encap
> you only would define the inner part of the packet in the match pattern
> criteria and the actual outer tunnel headers would be defined in the action.
> 
> I guess there is some replication in the decap side as proposed, as the TEP
> object is used in both the pattern and the action, possibly you could get away
> with having no TEP object defined in the action data, but I prefer keeping the
> API symmetrical for encap/decap actions at the shake of some extra
> verbosity.
> 
> >
> ...
> >


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-01-11 21:45     ` John Daley (johndale)
@ 2018-01-16  8:22       ` Shahaf Shuler
  2018-01-23 15:35         ` Doherty, Declan
  2018-01-23 14:46       ` Doherty, Declan
  1 sibling, 1 reply; 15+ messages in thread
From: Shahaf Shuler @ 2018-01-16  8:22 UTC (permalink / raw)
  To: John Daley (johndale), Doherty, Declan, dev; +Cc: Adrien Mazarguil, Yuanhan Liu

Thursday, January 11, 2018 11:45 PM, John Daley:
> Hi Declan and Shahaf,
> 
> > I can't see how the existing
> > ethdev API could be used for statistics as a single ethdev could be
> > supporting may concurrent TEPs, therefore we would either need to use
> > the extended stats with many entries, one for each TEP, or if we treat
> > a TEP as an attribute of a port in a similar manner to the way
> > rte_security manages an IPsec SA, the state of each TEP can be
> > monitored and managed independently of both the overall port or the
> flows being transported on that endpoint.
> 
> Assuming we can define one rte_flow rule per TEP, does what you propose
> give us anything more than just using the COUNT action?

I agree with John here, and I also not sure we need such assumption. 

If I get it right, the API proposed here is to have a tunnel endpoint which is a logical port on top of ethdev port. the TEP is able to receive and monitor some specific tunneled traffic, for example VXLAN, GENEVE and more.
For example, VXLAN TEP can have multiple flows with different VNIs all under the same context. 

Now, with the current rte_flow APIs, we can do exactly the same and give the application the full flexibility to group the tunnel flows into logical TEP. 
On this suggestion application will:
1. Create rte_flow rules for the pattern it want to receive.
2. In case it is interested in counting, a COUNT action will be added to the flow.
3. In case header manipulation is required, a DECAP/ENCAP/REWRITE action will be added to the flow. 
4. Grouping of flows into a logical TEP will be done on the application layer simply by keeping the relevant rte_flow rules in some dedicated struct. With it, create/destroy TEP can be translated to create/destroy the flow rules. Statistics query can be done be querying each flow count and sum. Note that some devices can support the same counter for multiple flows. Even though it is not yet exposed in rte_flow this can be an interesting optimization. 

> >
> > > As for the capabilities - what specifically you had in mind? The
> > > current
> > usage you show with tep is with rte_flow rules. There are no
> > capabilities currently for rte_flow supported actions/pattern. To
> > check such capabilities application uses rte_flow_validate.
> >
> > I envisaged that the application should be able to see if an ethdev
> > can support TEP in the rx/tx offloads, and then the
> > rte_tep_capabilities would allow applications to query what tunnel
> > endpoint protocols are supported etc. I would like a simple mechanism
> > to allow users to see if a particular tunnel endpoint type is
> > supported without having to build actual flows to validate.
> 
> I can see the value of that, but in the end wouldn't the API call
> rte_flow_validate anyways? Maybe we don't add the layer now or maybe it
> doesn't really belong in DPDK? I'm in favor of deferring the capabilities API
> until we know it's really needed.  I hate to see special capabilities APIs start
> sneaking in after we decided to go the rte_flow_validate route and users are
> starting to get used to it.

I don't see how it is different from any other rte_flow creation.
We don't hold caps for device ability to filter packets according to VXLAN or GENEVE items. Why we should start now?

We have already the rte_flow_veirfy. I think part of the reasons for it was that the number of different capabilities possible with rte_flow is huge. I think this also the case with the TEP capabilities (even though It is still not clear to me what exactly they will include). 

> >
> > > Regarding the creation/destroy of tep. Why not simply use rte_flow
> > > API
> > and avoid this extra control?
> > > For example - with 17.11 APIs, application can put the port in
> > > isolate mode,
> > and insert a flow_rule to catch only IPv4 VXLAN traffic and direct to
> > some queue/do RSS. Such operation, per my understanding, will create a
> > tunnel endpoint. What are the down sides of doing it with the current
> APIs?
> >
> > That doesn't enable encapsulation and decapsulation of the outer
> > tunnel endpoint in the hw as far as I know. Apart from the inability
> > to monitor the endpoint statistics I mentioned above. It would also
> > require that you redefine the endpoints parameters ever time to you
> > wish to add a new flow to it. I think the having the rte_tep object
> > semantics should also simplify the ability to enable a full vswitch
> > offload of TEP where the hw is handling both encap/decap and switching to
> a particular port.
> 
> If we have the ingress/decap and egress/encap actions and 1 rte_flow rule
> per TEP and use the COUNT action, I think we get all but the last bit. For that,
> perhaps the application could keep  ingress and egress rte_flow template for
> each tunnel type (VxLAN, GRE, ..). Then copying the template and filling in
> the outer packet info and tunnel Id is all that would be required. We could
> also define these in rte_flow.h?
> 
> >
> > >
> > >>
> > >>
> > >> To direct traffic flows to hw terminated tunnel endpoint the
> > >> rte_flow API is enhanced to add a new flow item type. This contains
> > >> a pointer to the TEP context as well as the overlay flow id to
> > >> which the traffic flow is
> > associated.
> > >>
> > >> struct rte_flow_item_tep {
> > >>                 struct rte_tep *tep;
> > >>                 uint32_t flow_id;
> > >> }
> > >
> > > Can you provide more detailed definition about the flow id ? to
> > > which field
> > from the packet headers it refers to?
> > > On your below examples it looks like it is to match the VXLAN vni in
> > > case of
> > VXLAN, what about the other protocols? And also, why not using the
> > already exists VXLAN item?
> >
> > I have only been looking initially at couple of the tunnel endpoint
> > procotols, namely Geneve, NvGRE, and VxLAN, but the idea here is to
> > allow the user to define the VNI in the case of Geneve and VxLAN and
> > the VSID in the case of NvGRE on a per flow basis, as per my
> > understanding these are used to identify the source/destination hosts
> > on the overlay network independently from the endpoint there are
> transported across.
> >
> > The VxLAN item is used in the creation of the TEP object, using the
> > TEP object just removes the need for the user to constantly redefine
> > all the tunnel parameters and also I think dependent on the hw
> > implementation it may simplify the drivers work if it know the exact
> > endpoint the actions is for instead of having to look it up on each flow
> addition.
> >
> > >
> > > Generally I like the idea of separating the encap/decap context from
> > > the
> > action. However looks like the rte_flow_item has double meaning on
> > this RFC, once for the classification and once for the action.
> > >  From the top of my head I would think of an API which separate
> > > those, and
> > re-use the existing flow items. Something like:
> > >
> > >   struct rte_flow_item pattern[] = {
> > >                  { set of already exists pattern  },
> > >                  { ... },
> > >                  { .type = RTE_FLOW_ITEM_TYPE_END } };
> > >
> > > encap_ctx = create_enacap_context(pattern)
> > >
> > > rte_flow_action actions[] = {
> > > 	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx} }
> >
> > I not sure I fully understand what you're asking here, but in general
> > for encap you only would define the inner part of the packet in the
> > match pattern criteria and the actual outer tunnel headers would be
> defined in the action.
> >
> > I guess there is some replication in the decap side as proposed, as
> > the TEP object is used in both the pattern and the action, possibly
> > you could get away with having no TEP object defined in the action
> > data, but I prefer keeping the API symmetrical for encap/decap actions
> > at the shake of some extra verbosity.
> >
> > >
> > ...
> > >


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-01-11 21:45     ` John Daley (johndale)
  2018-01-16  8:22       ` Shahaf Shuler
@ 2018-01-23 14:46       ` Doherty, Declan
  1 sibling, 0 replies; 15+ messages in thread
From: Doherty, Declan @ 2018-01-23 14:46 UTC (permalink / raw)
  To: John Daley (johndale), Shahaf Shuler, dev

On 11/01/2018 9:45 PM, John Daley (johndale) wrote:
> Hi Declan and Shahaf,
> 
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Doherty, Declan
>> Sent: Tuesday, January 09, 2018 9:31 AM
>> To: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
>>
>> On 24/12/2017 5:30 PM, Shahaf Shuler wrote:
>>> Hi Declan,
>>>
>>
>> Hey Shahaf, apologies for the delay in responding, I have been out of office
>> for the last 2 weeks.
>>
>>> Friday, December 22, 2017 12:21 AM, Doherty, Declan:
>>>> This RFC contains a proposal to add a new tunnel endpoint API to DPDK
>>>> that when used in conjunction with rte_flow enables the configuration
>>>> of inline data path encapsulation and decapsulation of tunnel
>>>> endpoint network overlays on accelerated IO devices.
>>>>
>>>> The proposed new API would provide for the creation, destruction, and
>>>> monitoring of a tunnel endpoint in supporting hw, as well as
>>>> capabilities APIs to allow the acceleration features to be discovered by
>> applications.
>>>>
>> ....
>>>
>>>
>>> Am not sure I understand why there is a need for the above control
>> methods.
>>> Are you introducing a new "tep device" ? > As the tunnel endpoint is
>>> sending and receiving Ethernet packets from
>> the network I think it should still be counted as Ethernet device but with
>> more capabilities (for example it supported encap/decap etc..), therefore it
>> should use the Ethdev layer API to query statistics (for example).
>>
>> No, the new APIs are only intended to be a method of creating, monitoring
>> and deleting tunnel-endpoints on an existing ethdev. The rationale for APIs
>> separate to rte_flow are the same as that in the rte_security, there is not a
>> 1:1 mapping of TEPs to flows. Many flows (VNI's in VxLAN for example) can
>> be originate/terminate on the same TEP, therefore managing the TEP
>> independently of the flows being transmitted on it is important to allow
>> visibility of that endpoint stats for example.
> 
> I don't quite understand what you mean by tunnel and flow here. Can you define exactly what you mean? Flow is an overloaded word in our world. I think that defining it will make understanding the RFC a little easier.
> 

Hey John,

I think that's a good idea, for me the tunnel endpoint defines the l3/l4 
parameters of the endpoint, so for VxLAN over IPv4 this would include 
the IPv4, UDP and VxLAN headers excluding the VNI(flow id). I'm not sure 
if it makes more sense that each TEP contains the VNI(flow id) or not. I 
believe the model currently used by OvS today is similar to the RFC it 
that many VNIs can be terminated in the same TEP port context.

 From terms of flows definitions, for encapsulated ingress I would see 
the definition of a flow to include the l2 and l3/l4 headers of the 
outer including the flow id of the tunnel and optionally include any or 
all of the inner headers. For non-encapsulated egress traffic the flow 
defines any combination of the l2, l3, l4 headers as defined by the user.

> Taking VxLAN, I think of the tunnel as including up through the VxLAN header, including the VNI. If you go by this definition, I would consider a flow to be all packets with the same VNI and the same 5-tuple hash of the inner packet. Is this what you mean by tunnel (or TEP) and flow here?

Yes, with the exception that I had excluded the or flow id from the TEP 
definition and it was part of the flow but otherwise essentially yes.

> 
> With these definitions, VPP for example might need up to a couple thousand TEPs on an interface and each TEP could have hundreds or thousands of flows. It would be quite possible to have 1 rte flow rule per TEP (or 2- ingress/decap and egress/encap). The COUNT action could be used to count the number of packets through each TEP. Is this adequate, or are you proposing that we need a mechanism to get stats of flows within each TEP? Is that the main point of the API? Assuming no need for stats on a per TEP/flow basis is there anything else the API adds?

Yes the basis of having TEP as separate API is to allow flows to tracked 
independently of the overlay they may be transported on. I believe this 
will be a requirement for acceleration of any vswitch, as we could have 
a case that flows are bypassing the host vswitch completely and 
encap/decap and switched in hw directly to/from the guest to physical 
port. OvS currently can track both flows and TEP statistics and I think 
we need to support this model.
>> I can't see how the existing
>> ethdev API could be used for statistics as a single ethdev could be supporting
>> may concurrent TEPs, therefore we would either need to use the extended
>> stats with many entries, one for each TEP, or if we treat a TEP as an attribute
>> of a port in a similar manner to the way rte_security manages an IPsec SA,
>> the state of each TEP can be monitored and managed independently of both
>> the overall port or the flows being transported on that endpoint.
> 
> Assuming we can define one rte_flow rule per TEP, does what you propose give us anything more than just using the COUNT action?

This still won't all individual flow statistics to be tracked in the 
full offload model. As you state above, you could have a couple of 
thousand TEPs terminated on a single or small number of physical ports 
with tens or hundreds of thousands of flows on each TEP. I think for 
management of the system we need to be able to monitor all of these 
statistics independently.

>>
>>> As for the capabilities - what specifically you had in mind? The current
>> usage you show with tep is with rte_flow rules. There are no capabilities
>> currently for rte_flow supported actions/pattern. To check such capabilities
>> application uses rte_flow_validate.
>>
>> I envisaged that the application should be able to see if an ethdev can
>> support TEP in the rx/tx offloads, and then the rte_tep_capabilities would
>> allow applications to query what tunnel endpoint protocols are supported
>> etc. I would like a simple mechanism to allow users to see if a particular
>> tunnel endpoint type is supported without having to build actual flows to
>> validate.
> 
> I can see the value of that, but in the end wouldn't the API call rte_flow_validate anyways? Maybe we don't add the layer now or maybe it doesn't really belong in DPDK? I'm in favor of deferring the capabilities API until we know it's really needed.  I hate to see special capabilities APIs start sneaking in after we decided to go the rte_flow_validate route and users are starting to get used to it.

flow validation will still always be required but I think having a rich 
capability API will also be very important to allow applications control 
planes to figure out what accelerations are available and define the 
application pipeline accordingly. I can envisage scenarios were on the 
same platform you could two devices which both support TEP in hw but one 
may support switching also, the way the host application would use these 
2 devices may be radically different and rte_flow_validate does allow 
that sort of capabilities to be clearly discovered. This may be as 
simple as a new feature bit in the ethdev.

>>
>>> Regarding the creation/destroy of tep. Why not simply use rte_flow API
>> and avoid this extra control?
>>> For example - with 17.11 APIs, application can put the port in isolate mode,
>> and insert a flow_rule to catch only IPv4 VXLAN traffic and direct to some
>> queue/do RSS. Such operation, per my understanding, will create a tunnel
>> endpoint. What are the down sides of doing it with the current APIs?
>>
>> That doesn't enable encapsulation and decapsulation of the outer tunnel
>> endpoint in the hw as far as I know. Apart from the inability to monitor the
>> endpoint statistics I mentioned above. It would also require that you
>> redefine the endpoints parameters ever time to you wish to add a new flow
>> to it. I think the having the rte_tep object semantics should also simplify the
>> ability to enable a full vswitch offload of TEP where the hw is handling both
>> encap/decap and switching to a particular port.
> 
> If we have the ingress/decap and egress/encap actions and 1 rte_flow rule per TEP and use the COUNT action, I think we get all but the last bit. For that, perhaps the application could keep  ingress and egress rte_flow template for each tunnel type (VxLAN, GRE, ..). Then copying the template and filling in the outer packet info and tunnel Id is all that would be required. We could also define these in rte_flow.h?

Again the main issue here is that one flow per TEP doesn't work when the 
device also supports flow switching in the inner flow.

> 
>>
>>>
>>>>
>>>>
>>>> To direct traffic flows to hw terminated tunnel endpoint the rte_flow
>>>> API is enhanced to add a new flow item type. This contains a pointer
>>>> to the TEP context as well as the overlay flow id to which the traffic flow is
>> associated.
>>>>
>>>> struct rte_flow_item_tep {
>>>>                  struct rte_tep *tep;
>>>>                  uint32_t flow_id;
>>>> }
>>>
>>> Can you provide more detailed definition about the flow id ? to which field
>> from the packet headers it refers to?
>>> On your below examples it looks like it is to match the VXLAN vni in case of
>> VXLAN, what about the other protocols? And also, why not using the already
>> exists VXLAN item?
>>
>> I have only been looking initially at couple of the tunnel endpoint procotols,
>> namely Geneve, NvGRE, and VxLAN, but the idea here is to allow the user to
>> define the VNI in the case of Geneve and VxLAN and the VSID in the case of
>> NvGRE on a per flow basis, as per my understanding these are used to
>> identify the source/destination hosts on the overlay network independently
>> from the endpoint there are transported across.
>>
>> The VxLAN item is used in the creation of the TEP object, using the TEP
>> object just removes the need for the user to constantly redefine all the
>> tunnel parameters and also I think dependent on the hw implementation it
>> may simplify the drivers work if it know the exact endpoint the actions is for
>> instead of having to look it up on each flow addition.
>>
>>>
>>> Generally I like the idea of separating the encap/decap context from the
>> action. However looks like the rte_flow_item has double meaning on this
>> RFC, once for the classification and once for the action.
>>>   From the top of my head I would think of an API which separate those, and
>> re-use the existing flow items. Something like:
>>>
>>>    struct rte_flow_item pattern[] = {
>>>                   { set of already exists pattern  },
>>>                   { ... },
>>>                   { .type = RTE_FLOW_ITEM_TYPE_END } };
>>>
>>> encap_ctx = create_enacap_context(pattern)
>>>
>>> rte_flow_action actions[] = {
>>> 	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx} }
>>
>> I not sure I fully understand what you're asking here, but in general for encap
>> you only would define the inner part of the packet in the match pattern
>> criteria and the actual outer tunnel headers would be defined in the action.
>>
>> I guess there is some replication in the decap side as proposed, as the TEP
>> object is used in both the pattern and the action, possibly you could get away
>> with having no TEP object defined in the action data, but I prefer keeping the
>> API symmetrical for encap/decap actions at the shake of some extra
>> verbosity.
>>
>>>
>> ...
>>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-01-16  8:22       ` Shahaf Shuler
@ 2018-01-23 15:35         ` Doherty, Declan
  2018-02-01 19:59           ` Shahaf Shuler
  0 siblings, 1 reply; 15+ messages in thread
From: Doherty, Declan @ 2018-01-23 15:35 UTC (permalink / raw)
  To: Shahaf Shuler, John Daley (johndale), dev; +Cc: Adrien Mazarguil, Yuanhan Liu

On 16/01/2018 8:22 AM, Shahaf Shuler wrote:
> Thursday, January 11, 2018 11:45 PM, John Daley:
>> Hi Declan and Shahaf,
>>
>>> I can't see how the existing
>>> ethdev API could be used for statistics as a single ethdev could be
>>> supporting may concurrent TEPs, therefore we would either need to use
>>> the extended stats with many entries, one for each TEP, or if we treat
>>> a TEP as an attribute of a port in a similar manner to the way
>>> rte_security manages an IPsec SA, the state of each TEP can be
>>> monitored and managed independently of both the overall port or the
>> flows being transported on that endpoint.
>>
>> Assuming we can define one rte_flow rule per TEP, does what you propose
>> give us anything more than just using the COUNT action?
> 
> I agree with John here, and I also not sure we need such assumption.
> 
> If I get it right, the API proposed here is to have a tunnel endpoint which is a logical port on top of ethdev port. the TEP is able to receive and monitor some specific tunneled traffic, for example VXLAN, GENEVE and more.
> For example, VXLAN TEP can have multiple flows with different VNIs all under the same context.
> 
> Now, with the current rte_flow APIs, we can do exactly the same and give the application the full flexibility to group the tunnel flows into logical TEP.
> On this suggestion application will:
> 1. Create rte_flow rules for the pattern it want to receive.
> 2. In case it is interested in counting, a COUNT action will be added to the flow.
> 3. In case header manipulation is required, a DECAP/ENCAP/REWRITE action will be added to the flow.
> 4. Grouping of flows into a logical TEP will be done on the application layer simply by keeping the relevant rte_flow rules in some dedicated struct. With it, create/destroy TEP can be translated to create/destroy the flow rules. Statistics query can be done be querying each flow count and sum. Note that some devices can support the same counter for multiple flows. Even though it is not yet exposed in rte_flow this can be an interesting optimization.

As I responsed in John's mail I think this approach fails in devices 
which support switching offload also. As the flows never hit the host 
application configuring the TEP and flows there is no easy way to sum 
those statistics, also flows are transitory in terms of runtime so it 
would not be possible to keep accurate statistics over a period of time.


> 
>>>
>>>> As for the capabilities - what specifically you had in mind? The
>>>> current
>>> usage you show with tep is with rte_flow rules. There are no
>>> capabilities currently for rte_flow supported actions/pattern. To
>>> check such capabilities application uses rte_flow_validate.
>>>
>>> I envisaged that the application should be able to see if an ethdev
>>> can support TEP in the rx/tx offloads, and then the
>>> rte_tep_capabilities would allow applications to query what tunnel
>>> endpoint protocols are supported etc. I would like a simple mechanism
>>> to allow users to see if a particular tunnel endpoint type is
>>> supported without having to build actual flows to validate.
>>
>> I can see the value of that, but in the end wouldn't the API call
>> rte_flow_validate anyways? Maybe we don't add the layer now or maybe it
>> doesn't really belong in DPDK? I'm in favor of deferring the capabilities API
>> until we know it's really needed.  I hate to see special capabilities APIs start
>> sneaking in after we decided to go the rte_flow_validate route and users are
>> starting to get used to it.
> 
> I don't see how it is different from any other rte_flow creation.
> We don't hold caps for device ability to filter packets according to VXLAN or GENEVE items. Why we should start now?

I don't know, possibly if it makes adoption of the features easier for 
the end user.

> 
> We have already the rte_flow_veirfy. I think part of the reasons for it was that the number of different capabilities possible with rte_flow is huge. I think this also the case with the TEP capabilities (even though It is still not clear to me what exactly they will include).

It may be that only need advertise that we are capable of encap/decap 
services, but it would be good to have input from downstream users what 
they would like to see.

> 
>>>
>>>> Regarding the creation/destroy of tep. Why not simply use rte_flow
>>>> API
>>> and avoid this extra control?
>>>> For example - with 17.11 APIs, application can put the port in
>>>> isolate mode,
>>> and insert a flow_rule to catch only IPv4 VXLAN traffic and direct to
>>> some queue/do RSS. Such operation, per my understanding, will create a
>>> tunnel endpoint. What are the down sides of doing it with the current
>> APIs?
>>>
>>> That doesn't enable encapsulation and decapsulation of the outer
>>> tunnel endpoint in the hw as far as I know. Apart from the inability
>>> to monitor the endpoint statistics I mentioned above. It would also
>>> require that you redefine the endpoints parameters ever time to you
>>> wish to add a new flow to it. I think the having the rte_tep object
>>> semantics should also simplify the ability to enable a full vswitch
>>> offload of TEP where the hw is handling both encap/decap and switching to
>> a particular port.
>>
>> If we have the ingress/decap and egress/encap actions and 1 rte_flow rule
>> per TEP and use the COUNT action, I think we get all but the last bit. For that,
>> perhaps the application could keep  ingress and egress rte_flow template for
>> each tunnel type (VxLAN, GRE, ..). Then copying the template and filling in
>> the outer packet info and tunnel Id is all that would be required. We could
>> also define these in rte_flow.h?
>>
>>>
>>>>
>>>>>
>>>>>
>>>>> To direct traffic flows to hw terminated tunnel endpoint the
>>>>> rte_flow API is enhanced to add a new flow item type. This contains
>>>>> a pointer to the TEP context as well as the overlay flow id to
>>>>> which the traffic flow is
>>> associated.
>>>>>
>>>>> struct rte_flow_item_tep {
>>>>>                  struct rte_tep *tep;
>>>>>                  uint32_t flow_id;
>>>>> }
>>>>
>>>> Can you provide more detailed definition about the flow id ? to
>>>> which field
>>> from the packet headers it refers to?
>>>> On your below examples it looks like it is to match the VXLAN vni in
>>>> case of
>>> VXLAN, what about the other protocols? And also, why not using the
>>> already exists VXLAN item?
>>>
>>> I have only been looking initially at couple of the tunnel endpoint
>>> procotols, namely Geneve, NvGRE, and VxLAN, but the idea here is to
>>> allow the user to define the VNI in the case of Geneve and VxLAN and
>>> the VSID in the case of NvGRE on a per flow basis, as per my
>>> understanding these are used to identify the source/destination hosts
>>> on the overlay network independently from the endpoint there are
>> transported across.
>>>
>>> The VxLAN item is used in the creation of the TEP object, using the
>>> TEP object just removes the need for the user to constantly redefine
>>> all the tunnel parameters and also I think dependent on the hw
>>> implementation it may simplify the drivers work if it know the exact
>>> endpoint the actions is for instead of having to look it up on each flow
>> addition.
>>>
>>>>
>>>> Generally I like the idea of separating the encap/decap context from
>>>> the
>>> action. However looks like the rte_flow_item has double meaning on
>>> this RFC, once for the classification and once for the action.
>>>>   From the top of my head I would think of an API which separate
>>>> those, and
>>> re-use the existing flow items. Something like:
>>>>
>>>>    struct rte_flow_item pattern[] = {
>>>>                   { set of already exists pattern  },
>>>>                   { ... },
>>>>                   { .type = RTE_FLOW_ITEM_TYPE_END } };
>>>>
>>>> encap_ctx = create_enacap_context(pattern)
>>>>
>>>> rte_flow_action actions[] = {
>>>> 	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx} }
>>>
>>> I not sure I fully understand what you're asking here, but in general
>>> for encap you only would define the inner part of the packet in the
>>> match pattern criteria and the actual outer tunnel headers would be
>> defined in the action.
>>>
>>> I guess there is some replication in the decap side as proposed, as
>>> the TEP object is used in both the pattern and the action, possibly
>>> you could get away with having no TEP object defined in the action
>>> data, but I prefer keeping the API symmetrical for encap/decap actions
>>> at the shake of some extra verbosity.
>>>
>>>>
>>> ...
>>>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-01-11 21:44 ` John Daley (johndale)
@ 2018-01-23 15:50   ` Doherty, Declan
  0 siblings, 0 replies; 15+ messages in thread
From: Doherty, Declan @ 2018-01-23 15:50 UTC (permalink / raw)
  To: John Daley (johndale), dev

On 11/01/2018 9:44 PM, John Daley (johndale) wrote:
> Hi,
> One comment on DECAP action and a "feature request".  I'll also reply to the top of thread discussion separately. Thanks for the RFC Declan!
> 
> Feature request associated with ENCAP action:
> 
> VPP (and probably other apps) would like the ability to simply specify an independent tunnel ID as part of egress match criteria in an rte_flow rule. Then egress packets could specify a tunnel ID  and valid flag in the mbuf. If it matched the rte_flow tunnel ID item, a simple lookup in the nic could be done and the associated actions (particularly ENCAP) executed. The application already know the tunnel that the packet is associated with so no need to have the nic do matching on a header pattern. Plus it's possible that packet headers alone are not enough to determine the correct encap action (the bridge where the packet came from might be required).
> 
> This would require a new mbuf field to specify the tunnel ID (maybe in tx_offload) and a valid flag.  It would also require a new rte flow item type for matching the tunnel ID (like RTE_FLOW_ITEM_TYPE_META_TUNNEL_ID).
> 
> Is something like this being considered by others? If not, should it be part of this RFC or a new one? I think this would be the 1st meta-data match criteria in rte_flow, but I could see others following.

This sounds similar to what we needed to do in rte_security to support 
metadata for inline crypto on the ixgbe. I wasn't aware of devices which 
supported this type of function for overlaps, but it definitely sounds 
like we need to consider it here.

> 
> -johnd
> 
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Doherty, Declan
>> Sent: Thursday, December 21, 2017 2:21 PM
>> To: dev@dpdk.org
>> Subject: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
>>
>> This RFC contains a proposal to add a new tunnel endpoint API to DPDK that
>> when used in conjunction with rte_flow enables the configuration of inline
>> data path encapsulation and decapsulation of tunnel endpoint network
>> overlays on accelerated IO devices.
>>
>> The proposed new API would provide for the creation, destruction, and
>> monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs
>> to allow the acceleration features to be discovered by applications.
>>
>> /** Tunnel Endpoint context, opaque structure */ struct rte_tep;
>>
>> enum rte_tep_type {
>>                 RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
>>                 RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
>>                 ...
>> };
>>
>> /** Tunnel Endpoint Attributes */
>> struct rte_tep_attr {
>>                 enum rte_type_type type;
>>
>>                 /* other endpoint attributes here */ }
>>
>> /**
>> * Create a tunnel end-point context as specified by the flow attribute and
>> pattern
>> *
>> * @param   port_id     Port identifier of Ethernet device.
>> * @param   attr        Flow rule attributes.
>> * @param   pattern     Pattern specification by list of rte_flow_items.
>> * @return
>> *  - On success returns pointer to TEP context
>> *  - On failure returns NULL
>> */
>> struct rte_tep *rte_tep_create(uint16_t port_id,
>>                                struct rte_tep_attr *attr, struct rte_flow_item pattern[])
>>
>> /**
>> * Destroy an existing tunnel end-point context. All the end-points context
>> * will be destroyed, so all active flows using tep should be freed before
>> * destroying context.
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   tep        Tunnel endpoint context
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)
>>
>> /**
>> * Get tunnel endpoint statistics
>> *
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   tep        Tunnel endpoint context
>> * @param   stats      Tunnel endpoint statistics
>> *
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> Int
>> rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
>>                                struct rte_tep_stats *stats)
>>
>> /**
>> * Get ports tunnel endpoint capabilities
>> *
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   capabilities        Tunnel endpoint capabilities
>> *
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> int
>> rte_tep_capabilities_get(uint16_t port_id,
>>                                struct rte_tep_capabilities *capabilities)
>>
>>
>> To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
>> enhanced to add a new flow item type. This contains a pointer to the TEP
>> context as well as the overlay flow id to which the traffic flow is associated.
>>
>> struct rte_flow_item_tep {
>>                 struct rte_tep *tep;
>>                 uint32_t flow_id;
>> }
>>
>> Also 2 new generic actions types are added encapsulation and decapsulation.
>>
>> RTE_FLOW_ACTION_TYPE_ENCAP
>> RTE_FLOW_ACTION_TYPE_DECAP
>>
>> struct rte_flow_action_encap {
>>                 struct rte_flow_item *item; }
>>
>> struct rte_flow_action_decap {
>>                 struct rte_flow_item *item; }
>>
>> The following section outlines the intended usage of the new APIs and then
>> how they are combined with the existing rte_flow APIs.
>>
>> Tunnel endpoints are created on logical ports which support the capability
>> using rte_tep_create() using a combination of TEP attributes and
>> rte_flow_items. In the example below a new IPv4 VxLAN endpoint is being
>> defined.
>> The attrs parameter sets the TEP type, and could be used for other possible
>> attributes.
>>
>> struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };
>>
>> The values for the headers which make up the tunnel endpointr are then
>> defined using spec parameter in the rte flow items (IPv4, UDP and VxLAN in
>> this case)
>>
>> struct rte_flow_item_ipv4 ipv4_item = {
>>                 .hdr = { .src_addr = saddr, .dst_addr = daddr } };
>>
>> struct rte_flow_item_udp udp_item = {
>>                 .hdr = { .src_port = sport, .dst_port = dport } };
>>
>> struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = &vxlan_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END } };
>>
>> The tunnel endpoint can then be create on the port. Whether or not any hw
>> configuration is required at this point would be hw dependent, but if not the
>> context for the TEP is available for use in programming flow, so the
>> application is not forced to redefine the TEP parameters on each flow
>> addition.
>>
>> struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
>>
>> Once the tep context is created flows can then be directed to that endpoint
>> for processing. The following sections will outline how the author envisage
>> flow programming will work and also how TEP acceleration can be combined
>> with other accelerations.
>>
>>
>> Ingress TEP decapsulation, mark and forward to queue:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> The flows definition for TEP decapsulation actions should specify the full
>> outer packet to be matched at a minimum. The outer packet definition
>> should match the tunnel definition in the tep context and the tep flow id.
>> This example shows describes matching on the outer, marking the packet
>> with the VXLAN VNI and directing to a specified queue of the port.
>>
>> Source Packet
>>
>>         Decapsulate Outer Hdr
>>       /                       \                                    decap outer crc
>>      /                         \                                    /          \
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER
>> CRC |
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>>
>> /* Flow Attributes/Items Definitions */
>>
>> struct rte_flow_attr attr = { .ingress = 1 };
>>
>> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type =
>> ether_type }; struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END } };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_decap decap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
>>
>> struct rte_flow_action_decap decap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item };
>>
>> struct rte_flow_action_queue queue_action = { .index = qid };
>>
>> struct rte_flow_action_port mark_action = { .index = vni };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = &mark_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END } };
>>
> Does the conf for  RTE_FLOW_ACTION_TYPE_DECAP action specify the first pattern to decap up to? In the above, is the 1st decap action needed? Wouldn't the 2nd action decap up to the matching vni?

I hadn't looked at like that, only as an explicit ordered list of 
headers to decap but viewing it as a pattern to decap up to also makes 
sense.
> On our nic, we would have to translate the decap actions into a (level, offset) pair which requires a lot of effort. Since the packet is already matched perhaps 'struct rte_flow_item' is not the right thing to pass to the decap action and a simple (layer, offset) could be used instead? E.g to decap up to the inner Ethernet header of a VxLAN packet:
> struct rte_flow_action_decap {
>                 uint32_t level;
> 	uint8_t offset;
> }
> struct rte_flow_action_decap_tep {
> 	.level = RTE_PTYPE_L4_UDP,
> 	.offset = sizeof(struct vxlan_hdr)
> }
> 
> Using RTE_PTYPE... is just for illustration- we might to define our own layers in rte_flow.h.  You could specify inner packet layers, and the offset need not be restricted to the size of the header so that  decap to an absolute offset could be allowed, e.g:
> struct rte_flow_action_decap_42 {
> 	.level = RTE_PTYPE_L2_ETHER,
> 	.offset = 42
> }
>

This sounds like an interesting approach, I hadn't considered these sort 
of decap actions.

>> /** VERY IMPORTANT NOTE **/
>> One of the core concepts of this proposal is that actions which modify the
>> packet are defined in the order which they are to be processed. So first
>> decap outer ethernet header, then the outer TEP headers.
>> I think this is not only logical from a usability point of view, it should also
>> simplify the logic required in PMDs to parse the desired actions.
>>
>> struct rte_flow *flow =
>>                                rte_flow_create(port_id, &attr, pattern, actions, &err);
>>
>> The processed packets are delivered to specifed queue with mbuf metadata
>> denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.
>>
>>      +-----+------+-----+---------+-----+
>>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>      +-----+------+-----+---------+-----+
>>
>>
>> Ingress TEP decapsulation switch to port:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> This is intended to represent how a TEP decapsulation could be configured in
>> a switching offload case, it makes an assumption that there is a logical port
>> representation for all ports on the hw switch in the DPDK application, but
>> similar functionality could be achieved by specifying something like a VF ID of
>> the device.
>>
>> Like the previous scenario the flows definition for TEP decapsulation actions
>> should specify the full outer packet to be matched at a minimum but also
>> define the elements of the inner match to match against including masks if
>> required.
>>
>> struct rte_flow_attr attr = { .ingress = 1 };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &outer_eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &outer_tep_item,
>> .mask = &tep_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &inner_eth_item,
>> .mask = &eth_mask }
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = &inner_ipv4_item,
>> .mask = &ipv4_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &inner_tcp_item,
>> .mask = &tcp_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END } };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_decap decap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
>>
>> struct rte_flow_action_decap decap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>>                 .item = &outer_tep_item
>> };
>>
>> struct rte_flow_action_port port_action = { .index = port_id };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END } };
>>
>> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions,
>> &err);
>>
>> This action will forward the decapsulated packets to another port of the
>> switch fabric but no information will on the tunnel or the fact that the packet
>> was decapsulated will be passed with it, thereby enable segregation of the
>> infrastructure and
>>
>>
>> Egress TEP encapsulation:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> Encapulsation TEP actions require the flow definitions for the source packet
>> and then the actions to do on that, this example shows a ipv4/tcp packet
>> action.
>>
>> Source Packet
>>
>>      +-----+------+-----+---------+-----+
>>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>      +-----+------+-----+---------+-----+
>>
>> struct rte_flow_attr attr = { .egress = 1 };
>>
>> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type =
>> ether_type }; struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr =
>> src_addr, .dst_addr = dst_addr } }; struct rte_flow_item_udp tcp_item = {
>> .hdr = { .src_port = src_port, .dst_port = dst_port } };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END } };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_encap encap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type } };
>>
>> struct rte_flow_action_encap encap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>>                 .item = { .tep = tep, .id = vni } }; struct rte_flow_action_mark
>> port_action = { .index = port_id };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END } } struct rte_flow *flow =
>> rte_flow_create(port_id, &attr, pattern, actions, &err);
>>
>>
>>        encapsulating Outer Hdr
>>       /                       \                                      outer crc
>>      /                         \                                   /          \
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER
>> CRC |
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>>
>>
>>
>> Chaining multiple modification actions eg IPsec and TEP
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> For example the definition for full hw acceleration for an IPsec ESP/Transport
>> SA encapsulated in a vxlan tunnel would look something like:
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = &sec_session
>> },
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END } }
>>
>> 1. Source Packet
>>                             +-----+------+-----+---------+-----+
>>                             | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>                             +-----+------+-----+---------+-----+
>>
>> 2. First Action - Tunnel Endpoint Encapsulation
>>
>>        +------+-----+-------+-----+------+-----+---------+-----+
>>        | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>        +------+-----+-------+-----+------+-----+---------+-----+
>>
>> 3. Second Action - IPsec ESP/Transport Security Processing
>>
>>        +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>>        | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
>>        +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>>
>> 4. Third Action - Outer Ethernet Encapsulation
>>
>> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+---
>> --------+
>> | ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
>> OUTER CRC |
>> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+---
>> --------+
>>
>> This example demonstrates the importance of making the interoperation of
>> actions to be ordered, as in the above example, a security action can be
>> defined on both the inner and outer packet by simply placing another
>> security action at the beginning of the action list.
>>
>> It also demonstrates the rationale for not collapsing the Ethernet into the TEP
>> definition as when you have multiple encapsulating actions, all could
>> potentially be the place where the Ethernet header needs to be defined.
>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-01-23 15:35         ` Doherty, Declan
@ 2018-02-01 19:59           ` Shahaf Shuler
  0 siblings, 0 replies; 15+ messages in thread
From: Shahaf Shuler @ 2018-02-01 19:59 UTC (permalink / raw)
  To: Doherty, Declan, John Daley (johndale), dev; +Cc: Adrien Mazarguil, Yuanhan Liu

Hi Declan, sorry for the late response. 

Tuesday, January 23, 2018 5:36 PM, Doherty, Declan:
> > If I get it right, the API proposed here is to have a tunnel endpoint which is
> a logical port on top of ethdev port. the TEP is able to receive and monitor
> some specific tunneled traffic, for example VXLAN, GENEVE and more.
> > For example, VXLAN TEP can have multiple flows with different VNIs all
> under the same context.
> >
> > Now, with the current rte_flow APIs, we can do exactly the same and give
> the application the full flexibility to group the tunnel flows into logical TEP.
> > On this suggestion application will:
> > 1. Create rte_flow rules for the pattern it want to receive.
> > 2. In case it is interested in counting, a COUNT action will be added to the
> flow.
> > 3. In case header manipulation is required, a DECAP/ENCAP/REWRITE
> action will be added to the flow.
> > 4. Grouping of flows into a logical TEP will be done on the application layer
> simply by keeping the relevant rte_flow rules in some dedicated struct. With
> it, create/destroy TEP can be translated to create/destroy the flow rules.
> Statistics query can be done be querying each flow count and sum. Note that
> some devices can support the same counter for multiple flows. Even though
> it is not yet exposed in rte_flow this can be an interesting optimization.
> 
> As I responsed in John's mail I think this approach fails in devices which
> support switching offload also. As the flows never hit the host application
> configuring the TEP and flows there is no easy way to sum those statistics,

Devices which supports switching offloads must use NIC support to count the flows. It can be either by associating count action with a flow or by using TEP in your proposal.
The TEP counting could be introduced in another way - instead of having 1:1 relation between flow counter and rte_flow, to introduce a counter element which can be attached to multiple flows. 
So this counter element along with the rte_flows it is associate with are basically the TEP:
1. it holds the sum of statistics from all the TEP flows it is associate with.
2. it holds the receive pattern 

My point is, I don't think it is correct to bound between the TEP and the switching offloads actions (encap/decap/rewrite on this context). 
The TEP can be presented as auxiliary library/API to help with the flows grouping, however application still need to have the ability to make the switch offloads control as it wish. 

> also flows are transitory in terms of runtime so it would not be possible to
> keep accurate statistics over a period of time.

Am not sure I understand what you mean here. 
In order to receive traffic you need flows. Even the default RSS configuration of the PMD can be described by rte_flows. 
So as long as one receive traffic it has one/more flows configured on the device. 

> 
> 
> >
> >>>
> >>>> As for the capabilities - what specifically you had in mind? The
> >>>> current
> >>> usage you show with tep is with rte_flow rules. There are no
> >>> capabilities currently for rte_flow supported actions/pattern. To
> >>> check such capabilities application uses rte_flow_validate.
> >>>
> >>> I envisaged that the application should be able to see if an ethdev
> >>> can support TEP in the rx/tx offloads, and then the
> >>> rte_tep_capabilities would allow applications to query what tunnel
> >>> endpoint protocols are supported etc. I would like a simple
> >>> mechanism to allow users to see if a particular tunnel endpoint type
> >>> is supported without having to build actual flows to validate.
> >>
> >> I can see the value of that, but in the end wouldn't the API call
> >> rte_flow_validate anyways? Maybe we don't add the layer now or maybe
> >> it doesn't really belong in DPDK? I'm in favor of deferring the
> >> capabilities API until we know it's really needed.  I hate to see
> >> special capabilities APIs start sneaking in after we decided to go
> >> the rte_flow_validate route and users are starting to get used to it.
> >
> > I don't see how it is different from any other rte_flow creation.
> > We don't hold caps for device ability to filter packets according to VXLAN or
> GENEVE items. Why we should start now?
> 
> I don't know, possibly if it makes adoption of the features easier for the end
> user.
> 
> >
> > We have already the rte_flow_veirfy. I think part of the reasons for it was
> that the number of different capabilities possible with rte_flow is huge. I
> think this also the case with the TEP capabilities (even though It is still not
> clear to me what exactly they will include).
> 
> It may be that only need advertise that we are capable of encap/decap
> services, but it would be good to have input from downstream users what
> they would like to see.
> 
> >
> >>>
> >>>> Regarding the creation/destroy of tep. Why not simply use rte_flow
> >>>> API
> >>> and avoid this extra control?
> >>>> For example - with 17.11 APIs, application can put the port in
> >>>> isolate mode,
> >>> and insert a flow_rule to catch only IPv4 VXLAN traffic and direct
> >>> to some queue/do RSS. Such operation, per my understanding, will
> >>> create a tunnel endpoint. What are the down sides of doing it with
> >>> the current
> >> APIs?
> >>>
> >>> That doesn't enable encapsulation and decapsulation of the outer
> >>> tunnel endpoint in the hw as far as I know. Apart from the inability
> >>> to monitor the endpoint statistics I mentioned above. It would also
> >>> require that you redefine the endpoints parameters ever time to you
> >>> wish to add a new flow to it. I think the having the rte_tep object
> >>> semantics should also simplify the ability to enable a full vswitch
> >>> offload of TEP where the hw is handling both encap/decap and
> >>> switching to
> >> a particular port.
> >>
> >> If we have the ingress/decap and egress/encap actions and 1 rte_flow
> >> rule per TEP and use the COUNT action, I think we get all but the
> >> last bit. For that, perhaps the application could keep  ingress and
> >> egress rte_flow template for each tunnel type (VxLAN, GRE, ..). Then
> >> copying the template and filling in the outer packet info and tunnel
> >> Id is all that would be required. We could also define these in rte_flow.h?
> >>
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>> To direct traffic flows to hw terminated tunnel endpoint the
> >>>>> rte_flow API is enhanced to add a new flow item type. This
> >>>>> contains a pointer to the TEP context as well as the overlay flow
> >>>>> id to which the traffic flow is
> >>> associated.
> >>>>>
> >>>>> struct rte_flow_item_tep {
> >>>>>                  struct rte_tep *tep;
> >>>>>                  uint32_t flow_id; }
> >>>>
> >>>> Can you provide more detailed definition about the flow id ? to
> >>>> which field
> >>> from the packet headers it refers to?
> >>>> On your below examples it looks like it is to match the VXLAN vni
> >>>> in case of
> >>> VXLAN, what about the other protocols? And also, why not using the
> >>> already exists VXLAN item?
> >>>
> >>> I have only been looking initially at couple of the tunnel endpoint
> >>> procotols, namely Geneve, NvGRE, and VxLAN, but the idea here is to
> >>> allow the user to define the VNI in the case of Geneve and VxLAN and
> >>> the VSID in the case of NvGRE on a per flow basis, as per my
> >>> understanding these are used to identify the source/destination
> >>> hosts on the overlay network independently from the endpoint there
> >>> are
> >> transported across.
> >>>
> >>> The VxLAN item is used in the creation of the TEP object, using the
> >>> TEP object just removes the need for the user to constantly redefine
> >>> all the tunnel parameters and also I think dependent on the hw
> >>> implementation it may simplify the drivers work if it know the exact
> >>> endpoint the actions is for instead of having to look it up on each
> >>> flow
> >> addition.
> >>>
> >>>>
> >>>> Generally I like the idea of separating the encap/decap context
> >>>> from the
> >>> action. However looks like the rte_flow_item has double meaning on
> >>> this RFC, once for the classification and once for the action.
> >>>>   From the top of my head I would think of an API which separate
> >>>> those, and
> >>> re-use the existing flow items. Something like:
> >>>>
> >>>>    struct rte_flow_item pattern[] = {
> >>>>                   { set of already exists pattern  },
> >>>>                   { ... },
> >>>>                   { .type = RTE_FLOW_ITEM_TYPE_END } };
> >>>>
> >>>> encap_ctx = create_enacap_context(pattern)
> >>>>
> >>>> rte_flow_action actions[] = {
> >>>> 	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx} }
> >>>
> >>> I not sure I fully understand what you're asking here, but in
> >>> general for encap you only would define the inner part of the packet
> >>> in the match pattern criteria and the actual outer tunnel headers
> >>> would be
> >> defined in the action.
> >>>
> >>> I guess there is some replication in the decap side as proposed, as
> >>> the TEP object is used in both the pattern and the action, possibly
> >>> you could get away with having no TEP object defined in the action
> >>> data, but I prefer keeping the API symmetrical for encap/decap
> >>> actions at the shake of some extra verbosity.
> >>>
> >>>>
> >>> ...
> >>>>
> >


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2017-12-21 22:21 [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement Doherty, Declan
                   ` (2 preceding siblings ...)
  2018-01-11 21:44 ` John Daley (johndale)
@ 2018-02-13 17:05 ` Adrien Mazarguil
  2018-02-26 17:44   ` Doherty, Declan
  3 siblings, 1 reply; 15+ messages in thread
From: Adrien Mazarguil @ 2018-02-13 17:05 UTC (permalink / raw)
  To: Doherty, Declan
  Cc: dev, Shahaf Shuler, John Daley (johndale),
	Boris Pismenny, Nelio Laranjeiro

Hi,

Apologies for being late to this thread, I've read the ensuing discussion
(hope I didn't miss any) and also think rte_flow could be improved in
several ways to enable TEP support, in particular regarding the ordering of
actions.

On the other hand I'm not sure a dedicated API for TEP is needed at all. I'm
not convinced rte_security chose the right path and would like to avoid
repeating the same mistakes if possible, more below.

On Thu, Dec 21, 2017 at 10:21:13PM +0000, Doherty, Declan wrote:
> This RFC contains a proposal to add a new tunnel endpoint API to DPDK that when used
> in conjunction with rte_flow enables the configuration of inline data path encapsulation
> and decapsulation of tunnel endpoint network overlays on accelerated IO devices.
> 
> The proposed new API would provide for the creation, destruction, and
> monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs to allow the
> acceleration features to be discovered by applications.
> 
> /** Tunnel Endpoint context, opaque structure */
> struct rte_tep;
> 
> enum rte_tep_type {
>                RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
>                RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
>                ...
> };
> 
> /** Tunnel Endpoint Attributes */
> struct rte_tep_attr {
>                enum rte_type_type type;
> 
>                /* other endpoint attributes here */
> }
> 
> /**
> * Create a tunnel end-point context as specified by the flow attribute and pattern
> *
> * @param   port_id     Port identifier of Ethernet device.
> * @param   attr        Flow rule attributes.
> * @param   pattern     Pattern specification by list of rte_flow_items.
> * @return
> *  - On success returns pointer to TEP context
> *  - On failure returns NULL
> */
> struct rte_tep *rte_tep_create(uint16_t port_id,
>                               struct rte_tep_attr *attr, struct rte_flow_item pattern[])
> 
> /**
> * Destroy an existing tunnel end-point context. All the end-points context
> * will be destroyed, so all active flows using tep should be freed before
> * destroying context.
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)
> 
> /**
> * Get tunnel endpoint statistics
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   tep        Tunnel endpoint context
> * @param   stats      Tunnel endpoint statistics
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> Int
> rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
>                               struct rte_tep_stats *stats)
> 
> /**
> * Get ports tunnel endpoint capabilities
> *
> * @param   port_id    Port identifier of Ethernet device.
> * @param   capabilities        Tunnel endpoint capabilities
> *
> * @return
> *  - On success returns 0
> *  - On failure returns 1
> */
> int
> rte_tep_capabilities_get(uint16_t port_id,
>                               struct rte_tep_capabilities *capabilities)
> 
> 
> To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
> enhanced to add a new flow item type. This contains a pointer to the
> TEP context as well as the overlay flow id to which the traffic flow is
> associated.
> 
> struct rte_flow_item_tep {
>                struct rte_tep *tep;
>                uint32_t flow_id;
> }

What I dislike is rte_flow item/actions relying on externally-generated
opaque objects when these can be avoided, as it means yet another API
applications have to deal with and PMDs need to implement; this adds a layer
of inefficiency in my opinion.

I believe TEP can be fully implemented through a combination of new rte_flow
pattern items/actions without involving external API calls. More on that
later.

> Also 2 new generic actions types are added encapsulation and decapsulation.
> 
> RTE_FLOW_ACTION_TYPE_ENCAP
> RTE_FLOW_ACTION_TYPE_DECAP
> 
> struct rte_flow_action_encap {
>                struct rte_flow_item *item;
> }
> 
> struct rte_flow_action_decap {
>                struct rte_flow_item *item;
> }

Encap/decap actions are definitely needed and useful, no question about
that. I'm unsure about doing so through a generic action with the described
structures instead of dedicated ones though.

These can't work with anything other than rte_flow_item_tep; a special
pattern item using some kind of opaque object is needed (e.g. using
rte_flow_item_tcp makes no sense with them).

Also struct rte_flow_item is tailored for flow rule patterns, using it with
actions is not only confusing, it makes its "mask" and "last" members
useless and inconsistent with their documentation.

Although I'm not convinced an opaque object is the right approach, if we
choose this route I suggest the much simpler:

 struct rte_flow_action_tep_(encap|decap) {
     struct rte_tep *tep;
     uint32_t flow_id;
 };

> The following section outlines the intended usage of the new APIs and then how
> they are combined with the existing rte_flow APIs.
> 
> Tunnel endpoints are created on logical ports which support the capability
> using rte_tep_create() using a combination of TEP attributes and
> rte_flow_items. In the example below a new IPv4 VxLAN endpoint is being defined.
> The attrs parameter sets the TEP type, and could be used for other possible
> attributes.
> 
> struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };
> 
> The values for the headers which make up the tunnel endpointr are then
> defined using spec parameter in the rte flow items (IPv4, UDP and
> VxLAN in this case)
> 
> struct rte_flow_item_ipv4 ipv4_item = {
>                .hdr = { .src_addr = saddr, .dst_addr = daddr }
> };
> 
> struct rte_flow_item_udp udp_item = {
>                .hdr = { .src_port = sport, .dst_port = dport }
> };
> 
> struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
>                { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = &vxlan_item },
>                { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> The tunnel endpoint can then be create on the port. Whether or not any hw
> configuration is required at this point would be hw dependent, but if not
> the context for the TEP is available for use in programming flow, so the
> application is not forced to redefine the TEP parameters on each flow
> addition.
> 
> struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
> 
> Once the tep context is created flows can then be directed to that endpoint for
> processing. The following sections will outline how the author envisage flow
> programming will work and also how TEP acceleration can be combined with other
> accelerations.

In order to allow a single TEP context object to be shared by multiple flow
rules, a whole new API must be implemented and applications still have to
additionally create one rte_flow rule per TEP flow_id to manage. While this
probably results in shorter flow rule patterns and action lists, is it
really worth it?

While I understand the reasons for this approach, I'd like to push for a
rte_flow-only API as much as possible, I'll provide suggestions below.

> Ingress TEP decapsulation, mark and forward to queue:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The flows definition for TEP decapsulation actions should specify the full
> outer packet to be matched at a minimum. The outer packet definition should
> match the tunnel definition in the tep context and the tep flow id. This
> example shows describes matching on the outer, marking the packet with the
> VXLAN VNI and directing to a specified queue of the port.
> 
> Source Packet
> 
>        Decapsulate Outer Hdr
>      /                       \                                    decap outer crc
>     /                         \                                    /          \
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>     | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> /* Flow Attributes/Items Definitions */
> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
> struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
>                { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
> };
> 
> struct rte_flow_action_decap decap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP,
> .spec = &tep_item
> };
> 
> struct rte_flow_action_queue queue_action = { .index = qid };
> 
> struct rte_flow_action_port mark_action = { .index = vni };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = &mark_action },
>                { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END }
> };

Assuming there is no dedicated TEP API, how about something like the
following pseudo-code for a VXLAN-based TEP instead:

 attr = ingress;
 pattern = eth / ipv6 / udp / vxlan vni is 42 / end;
 actions = vxlan_decap / mark id 92 / queue index 8 / end;
 
 flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
 ...

The VXLAN_DECAP action and its parameters (if any) remain to be defined,
however VXLAN implies all layers up to and including the first VXLAN header
encountered. Also, if supported/accepted by a PMD:

 attr = ingress;
 pattern = eth / any / udp / vxlan vni is 42 / end;
 actions = vxlan_decap / mark id 92 / queue index 8 / end;

=> Both outer IPv4 and IPv6 traffic taken into account at once.

 attr = ingress;
 pattern = end;
 actions = vxlan_decap / mark id 92 / queue index 8 / end;

=> All recognized VXLAN traffic regardless of VNI is acted upon. The rest
   simply passes through.

> /** VERY IMPORTANT NOTE **/
> One of the core concepts of this proposal is that actions which modify the
> packet are defined in the order which they are to be processed. So first decap
> outer ethernet header, then the outer TEP headers.
> I think this is not only logical from a usability point of view, it should also
> simplify the logic required in PMDs to parse the desired actions.

This. I've been thinking about it for a very long time but never got around
submit a patch. Handling rte_flow actions in order, allowing repeated
identical actions and therefore getting rid of DUP.

The current approach was a bad design decision from my part, I'm convinced
it must be redefined before combinations become commonplace (right now no
PMD implements any action whose order matters as far as I know).

> struct rte_flow *flow =
>                               rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> The processed packets are delivered to specifed queue with mbuf metadata
> denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.
> 
>     +-----+------+-----+---------+-----+
>     | ETH | IPv4 | TCP | PAYLOAD | CRC |
>     +-----+------+-----+---------+-----+

Yes, except for the CRC part which would be optional depending on PMD/HW
capabilities. Not a big deal.

> Ingress TEP decapsulation switch to port:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> This is intended to represent how a TEP decapsulation could be configured
> in a switching offload case, it makes an assumption that there is a logical
> port representation for all ports on the hw switch in the DPDK application,
> but similar functionality could be achieved by specifying something like a
> VF ID of the device.
> 
> Like the previous scenario the flows definition for TEP decapsulation actions
> should specify the full outer packet to be matched at a minimum but also
> define the elements of the inner match to match against including masks if
> required.
> 
> struct rte_flow_attr attr = { .ingress = 1 };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &outer_eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &outer_tep_item, .mask = &tep_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &inner_eth_item, .mask = &eth_mask }
>                { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = &inner_ipv4_item, .mask = &ipv4_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &inner_tcp_item, .mask = &tcp_mask },
>                { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_decap decap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
> };
> 
> struct rte_flow_action_decap decap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP,
>                .item = &outer_tep_item
> };
> 
> struct rte_flow_action_port port_action = { .index = port_id };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END }
> };
> 
> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> This action will forward the decapsulated packets to another port of the switch
> fabric but no information will on the tunnel or the fact that the packet was
> decapsulated will be passed with it, thereby enable segregation of the
> infrastructure and

Again a suggestion without a dedicated TEP API, matching outer and some
inner as well:

 attr = ingress;
 pattern = eth / ipv6 / udp / vxlan vni is 42 / eth / ipv4 / tcp / end;
 actions = vxlan_decap / port index 3 / end;
 /* or */
 actions = vxlan_decap / vf id 5 / end;

The PORT action should be defined as well as the converse of the existing
PORT pattern item (matching an arbitrary physical port). Specifying a PORT
action would steer traffic to a nondefault physical port.

The VF action is already correctly defined.

> Egress TEP encapsulation:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Encapulsation TEP actions require the flow definitions for the source packet
> and then the actions to do on that, this example shows a ipv4/tcp packet
> action.
> 
> Source Packet
> 
>     +-----+------+-----+---------+-----+
>     | ETH | IPv4 | TCP | PAYLOAD | CRC |
>     +-----+------+-----+---------+-----+
> 
> struct rte_flow_attr attr = { .egress = 1 };
> 
> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
> struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr = src_addr, .dst_addr = dst_addr } };
> struct rte_flow_item_udp tcp_item = { .hdr = { .src_port = src_port, .dst_port = dst_port } };
> 
> struct rte_flow_item pattern[] = {
>                { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>                { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>                { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
>                { .type = RTE_FLOW_ITEM_TYPE_END }
> };
> 
> /* Flow Actions Definitions */
> 
> struct rte_flow_action_encap encap_eth = {
>                .type = RTE_FLOW_ITEM_TYPE_ETH,
>                .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
> };
> 
> struct rte_flow_action_encap encap_tep = {
>                .type = RTE_FLOW_ITEM_TYPE_TEP,
>                .item = { .tep = tep, .id = vni }
> };
> struct rte_flow_action_mark port_action = { .index = port_id };
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>                { .type = RTE_FLOW_ACTION_TYPE_END }
> }
> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
> 
> 
>       encapsulating Outer Hdr
>      /                       \                                      outer crc
>     /                         \                                   /          \
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>     | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
>     +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+

I see three main use cases for egress since we do not want a PMD to parse
traffic in software to determine if it's candidate for TEP encapsulation:

1. Traffic generated/forwarded by an application.
2. Same as 1. assuming an application is aware hardware can match egress
   traffic in addition to encapsulate it.
3. Traffic fully processed internally in hardware.

To handle 1., in my opinion the most common use case, PMDs should rely on an
application-provided mark pattern item (the converse of the MARK action):

 attr = egress;
 pattern = mark is 42 / end;
 actions = vxlan_encap {many parameters} / end;

To handle 2, hardware with the ability to recognize and encapsulate outgoing
traffic is required (applications can rely on rte_flow_validate()):

 attr = egress;
 pattern = eth / ipv4 / tcp / end;
 actions = vxlan_encap {many parameters} / end;

For 3, a combination of ingress and egress can be used needed on a given
rule. For clarity, one should assert where traffic comes from and where it's
supposed to go:

 attr = ingress egress;
 pattern = eth / ipv4 / tcp / port id 0 / end;
 actions = vxlan_encap {many parameters} / vf id 5 / end; 

The {many parameters} for VXLAN_ENCAP obviously remain to be defined,
they have to either include everything needed to construct L2, L3, L4 and
VXLAN headers, or separate actions for each layer specified in
innermost-to-outermost order.

No need for dedicated mbuf TEP flags.

> Chaining multiple modification actions eg IPsec and TEP
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> For example the definition for full hw acceleration for an IPsec ESP/Transport
> SA encapsulated in a vxlan tunnel would look something like:
> 
> struct rte_flow_action actions[] = {
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>                { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = &sec_session },
>                { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>                { .type = RTE_FLOW_ACTION_TYPE_END }
> }
> 
> 1. Source Packet
>                            +-----+------+-----+---------+-----+
>                            | ETH | IPv4 | TCP | PAYLOAD | CRC |
>                            +-----+------+-----+---------+-----+
> 
> 2. First Action - Tunnel Endpoint Encapsulation
> 
>       +------+-----+-------+-----+------+-----+---------+-----+
>       | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
>       +------+-----+-------+-----+------+-----+---------+-----+
> 
> 3. Second Action - IPsec ESP/Transport Security Processing
> 
>       +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>       | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
>       +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
> 
> 4. Third Action - Outer Ethernet Encapsulation
> 
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+
> | ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER | OUTER CRC |
> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+
> 
> This example demonstrates the importance of making the interoperation of
> actions to be ordered, as in the above example, a security
> action can be defined on both the inner and outer packet by simply placing
> another security action at the beginning of the action list.
> 
> It also demonstrates the rationale for not collapsing the Ethernet into
> the TEP definition as when you have multiple encapsulating actions, all
> could potentially be the place where the Ethernet header needs to be
> defined.

For completeness, here's a suggested alternative with neither dedicated TEP
nor security APIs:

 attr = egress;
 pattern = mark is 42 / end;
 actions = vxlan_encap {many parameters} / esp_encap {many parameters} / eth_encap {many parameters} / end;

Note ESP_ENCAP is not so easy given some data must be provided by the
application with each transmitted packet. The current security API does not
provide means to perform ESP encapsulation, it instead focuses on encryption
and relies on the application to prepare headers and allocate room for the
trailer. It's an unrealistic use case at the moment but shows the potential
of such an API.

- First question is what's your opinion regarding focusing on rte_flow
  instead of a TEP API? (Note for counters: one could add COUNT actions as
  well, what's currently missing is a way to share counters among several
  flow rules, which is planned as well)

- Regarding dedicated encap/decap actions instead of generic ones, given all
  protocols have different requirements (e.g. ESP encap is on a whole
  different level of complexity and likely needs callbacks)?

- Regarding the reliance on a MARK meta pattern item as a standard means for
  applications to tag egress traffic so a PMD knows what to do?

- I'd like to send a deprecation notice for rte_flow regarding handling of
  actions (documentation and change in some PMDs to reject currently valid
  but seldom used flow rules accordingly) instead of a new flow
  attribute. Would you ack such a change for 18.05?

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-02-13 17:05 ` Adrien Mazarguil
@ 2018-02-26 17:44   ` Doherty, Declan
  2018-03-05 16:23     ` Adrien Mazarguil
  0 siblings, 1 reply; 15+ messages in thread
From: Doherty, Declan @ 2018-02-26 17:44 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Shahaf Shuler, John Daley (johndale),
	Boris Pismenny, Nelio Laranjeiro

On 13/02/2018 5:05 PM, Adrien Mazarguil wrote:
> Hi,
> 
> Apologies for being late to this thread, I've read the ensuing discussion
> (hope I didn't miss any) and also think rte_flow could be improved in
> several ways to enable TEP support, in particular regarding the ordering of
> actions.
> 
> On the other hand I'm not sure a dedicated API for TEP is needed at all. I'm
> not convinced rte_security chose the right path and would like to avoid
> repeating the same mistakes if possible, more below.
> 
> On Thu, Dec 21, 2017 at 10:21:13PM +0000, Doherty, Declan wrote:
>> This RFC contains a proposal to add a new tunnel endpoint API to DPDK that when used
>> in conjunction with rte_flow enables the configuration of inline data path encapsulation
>> and decapsulation of tunnel endpoint network overlays on accelerated IO devices.
>>
>> The proposed new API would provide for the creation, destruction, and
>> monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs to allow the
>> acceleration features to be discovered by applications.
>>
>> /** Tunnel Endpoint context, opaque structure */
>> struct rte_tep;
>>
>> enum rte_tep_type {
>>                 RTE_TEP_TYPE_VXLAN = 1, /**< VXLAN Protocol */
>>                 RTE_TEP_TYPE_NVGRE,     /**< NVGRE Protocol */
>>                 ...
>> };
>>
>> /** Tunnel Endpoint Attributes */
>> struct rte_tep_attr {
>>                 enum rte_type_type type;
>>
>>                 /* other endpoint attributes here */
>> }
>>
>> /**
>> * Create a tunnel end-point context as specified by the flow attribute and pattern
>> *
>> * @param   port_id     Port identifier of Ethernet device.
>> * @param   attr        Flow rule attributes.
>> * @param   pattern     Pattern specification by list of rte_flow_items.
>> * @return
>> *  - On success returns pointer to TEP context
>> *  - On failure returns NULL
>> */
>> struct rte_tep *rte_tep_create(uint16_t port_id,
>>                                struct rte_tep_attr *attr, struct rte_flow_item pattern[])
>>
>> /**
>> * Destroy an existing tunnel end-point context. All the end-points context
>> * will be destroyed, so all active flows using tep should be freed before
>> * destroying context.
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   tep        Tunnel endpoint context
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> int rte_tep_destroy(uint16_t port_id, struct rte_tep *tep)
>>
>> /**
>> * Get tunnel endpoint statistics
>> *
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   tep        Tunnel endpoint context
>> * @param   stats      Tunnel endpoint statistics
>> *
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> Int
>> rte_tep_stats_get(uint16_t port_id, struct rte_tep *tep,
>>                                struct rte_tep_stats *stats)
>>
>> /**
>> * Get ports tunnel endpoint capabilities
>> *
>> * @param   port_id    Port identifier of Ethernet device.
>> * @param   capabilities        Tunnel endpoint capabilities
>> *
>> * @return
>> *  - On success returns 0
>> *  - On failure returns 1
>> */
>> int
>> rte_tep_capabilities_get(uint16_t port_id,
>>                                struct rte_tep_capabilities *capabilities)
>>
>>
>> To direct traffic flows to hw terminated tunnel endpoint the rte_flow API is
>> enhanced to add a new flow item type. This contains a pointer to the
>> TEP context as well as the overlay flow id to which the traffic flow is
>> associated.
>>
>> struct rte_flow_item_tep {
>>                 struct rte_tep *tep;
>>                 uint32_t flow_id;
>> }
> 
> What I dislike is rte_flow item/actions relying on externally-generated
> opaque objects when these can be avoided, as it means yet another API
> applications have to deal with and PMDs need to implement; this adds a layer
> of inefficiency in my opinion.
> 
> I believe TEP can be fully implemented through a combination of new rte_flow
> pattern items/actions without involving external API calls. More on that
> later.
> 
>> Also 2 new generic actions types are added encapsulation and decapsulation.
>>
>> RTE_FLOW_ACTION_TYPE_ENCAP
>> RTE_FLOW_ACTION_TYPE_DECAP
>>
>> struct rte_flow_action_encap {
>>                 struct rte_flow_item *item;
>> }
>>
>> struct rte_flow_action_decap {
>>                 struct rte_flow_item *item;
>> }
> 
> Encap/decap actions are definitely needed and useful, no question about
> that. I'm unsure about doing so through a generic action with the described
> structures instead of dedicated ones though.
> 
> These can't work with anything other than rte_flow_item_tep; a special
> pattern item using some kind of opaque object is needed (e.g. using
> rte_flow_item_tcp makes no sense with them).
> 
> Also struct rte_flow_item is tailored for flow rule patterns, using it with
> actions is not only confusing, it makes its "mask" and "last" members
> useless and inconsistent with their documentation.
> 
> Although I'm not convinced an opaque object is the right approach, if we
> choose this route I suggest the much simpler:
> 
>   struct rte_flow_action_tep_(encap|decap) {
>       struct rte_tep *tep;
>       uint32_t flow_id;
>   };
> 

That's a fair point, the only other action that we currently had the 
encap/decap actions supporting was the Ethernet item, and going back to 
a comment from Boris having the Ethernet header separate from the tunnel 
is probably not ideal anyway. As one of our reasons for using an opaque 
tep item was to allow modification of the TEP independently of all the 
flows being carried on it. So for instance if the src or dst MAC needs 
to be modified or the output port needs to changed, the TEP itself could 
be modified.


>> The following section outlines the intended usage of the new APIs and then how
>> they are combined with the existing rte_flow APIs.
>>
>> Tunnel endpoints are created on logical ports which support the capability
>> using rte_tep_create() using a combination of TEP attributes and
>> rte_flow_items. In the example below a new IPv4 VxLAN endpoint is being defined.
>> The attrs parameter sets the TEP type, and could be used for other possible
>> attributes.
>>
>> struct rte_tep_attr attrs = { .type = RTE_TEP_TYPE_VXLAN };
>>
>> The values for the headers which make up the tunnel endpointr are then
>> defined using spec parameter in the rte flow items (IPv4, UDP and
>> VxLAN in this case)
>>
>> struct rte_flow_item_ipv4 ipv4_item = {
>>                 .hdr = { .src_addr = saddr, .dst_addr = daddr }
>> };
>>
>> struct rte_flow_item_udp udp_item = {
>>                 .hdr = { .src_port = sport, .dst_port = dport }
>> };
>>
>> struct rte_flow_item_vxlan vxlan_item = { .flags = vxlan_flags };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_UDP, .spec = &udp_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_VXLAN, .spec = &vxlan_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> The tunnel endpoint can then be create on the port. Whether or not any hw
>> configuration is required at this point would be hw dependent, but if not
>> the context for the TEP is available for use in programming flow, so the
>> application is not forced to redefine the TEP parameters on each flow
>> addition.
>>
>> struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
>>
>> Once the tep context is created flows can then be directed to that endpoint for
>> processing. The following sections will outline how the author envisage flow
>> programming will work and also how TEP acceleration can be combined with other
>> accelerations.
> 
> In order to allow a single TEP context object to be shared by multiple flow
> rules, a whole new API must be implemented and applications still have to
> additionally create one rte_flow rule per TEP flow_id to manage. While this
> probably results in shorter flow rule patterns and action lists, is it
> really worth it?
> 
> While I understand the reasons for this approach, I'd like to push for a
> rte_flow-only API as much as possible, I'll provide suggestions below.
> 

Not only are the rules shorter to implement, it could help to greatly 
reduces the amount of cycles required to add flows, both in terms of the 
application marshaling the data in rte_flow patterns and the PMD parsing 
that those patterns every time a flow is added, in the case where 10k's 
of flow are getting added per second this could add a significant 
overhead on the system.


>> Ingress TEP decapsulation, mark and forward to queue:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> The flows definition for TEP decapsulation actions should specify the full
>> outer packet to be matched at a minimum. The outer packet definition should
>> match the tunnel definition in the tep context and the tep flow id. This
>> example shows describes matching on the outer, marking the packet with the
>> VXLAN VNI and directing to a specified queue of the port.
>>
>> Source Packet
>>
>>         Decapsulate Outer Hdr
>>       /                       \                                    decap outer crc
>>      /                         \                                    /          \
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>>
>> /* Flow Attributes/Items Definitions */
>>
>> struct rte_flow_attr attr = { .ingress = 1 };
>>
>> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
>> struct rte_flow_item_tep tep_item = { .tep = tep, .id = vni };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &tep_item  },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_decap decap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
>> };
>>
>> struct rte_flow_action_decap decap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>> .spec = &tep_item
>> };
>>
>> struct rte_flow_action_queue queue_action = { .index = qid };
>>
>> struct rte_flow_action_port mark_action = { .index = vni };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_MARK, .conf = &mark_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_QUEUE, .conf = &queue_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> };
> 
> Assuming there is no dedicated TEP API, how about something like the
> following pseudo-code for a VXLAN-based TEP instead:
> 
>   attr = ingress;
>   pattern = eth / ipv6 / udp / vxlan vni is 42 / end;
>   actions = vxlan_decap / mark id 92 / queue index 8 / end;
>   
>   flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
>   ...
> 
> The VXLAN_DECAP action and its parameters (if any) remain to be defined,
> however VXLAN implies all layers up to and including the first VXLAN header
> encountered. Also, if supported/accepted by a PMD:

I think the idea of parsing upto the VxLAN header makes sense, it would 
also make sense if we go with the opaque TEP object aswell.

> 
>   attr = ingress;
>   pattern = eth / any / udp / vxlan vni is 42 / end;
>   actions = vxlan_decap / mark id 92 / queue index 8 / end;
> 
> => Both outer IPv4 and IPv6 traffic taken into account at once.
> 
>   attr = ingress;
>   pattern = end;
>   actions = vxlan_decap / mark id 92 / queue index 8 / end;
> 
> => All recognized VXLAN traffic regardless of VNI is acted upon. The rest
>     simply passes through.
> 
>> /** VERY IMPORTANT NOTE **/
>> One of the core concepts of this proposal is that actions which modify the
>> packet are defined in the order which they are to be processed. So first decap
>> outer ethernet header, then the outer TEP headers.
>> I think this is not only logical from a usability point of view, it should also
>> simplify the logic required in PMDs to parse the desired actions.
> 
> This. I've been thinking about it for a very long time but never got around
> submit a patch. Handling rte_flow actions in order, allowing repeated
> identical actions and therefore getting rid of DUP. >
> The current approach was a bad design decision from my part, I'm convinced
> it must be redefined before combinations become commonplace (right now no
> PMD implements any action whose order matters as far as I know).
> 

I don't think it was an issue with the original implementation as I 
don't think it really becomes an issue until we start working with 
packet modifications, to that note I think that we only need to limit 
action ordering to actions which modify the packet itself. Actions like 
counting, marking, selecting output, be it port/pf/vf/queue/rss are all 
independent to the actions which modify the packet.

>> struct rte_flow *flow =
>>                                rte_flow_create(port_id, &attr, pattern, actions, &err);
>>
>> The processed packets are delivered to specifed queue with mbuf metadata
>> denoting marked flow id and with mbuf ol_flags PKT_RX_TEP_OFFLOAD set.
>>
>>      +-----+------+-----+---------+-----+
>>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>      +-----+------+-----+---------+-----+
> 
> Yes, except for the CRC part which would be optional depending on PMD/HW
> capabilities. Not a big deal.
> sure

>> Ingress TEP decapsulation switch to port:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> This is intended to represent how a TEP decapsulation could be configured
>> in a switching offload case, it makes an assumption that there is a logical
>> port representation for all ports on the hw switch in the DPDK application,
>> but similar functionality could be achieved by specifying something like a
>> VF ID of the device.
>>
>> Like the previous scenario the flows definition for TEP decapsulation actions
>> should specify the full outer packet to be matched at a minimum but also
>> define the elements of the inner match to match against including masks if
>> required.
>>
>> struct rte_flow_attr attr = { .ingress = 1 };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &outer_eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TEP, .spec = &outer_tep_item, .mask = &tep_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &inner_eth_item, .mask = &eth_mask }
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPv4, .spec = &inner_ipv4_item, .mask = &ipv4_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &inner_tcp_item, .mask = &tcp_mask },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_decap decap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
>> };
>>
>> struct rte_flow_action_decap decap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>>                 .item = &outer_tep_item
>> };
>>
>> struct rte_flow_action_port port_action = { .index = port_id };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_DECAP, .conf = &decap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> };
>>
>> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
>>
>> This action will forward the decapsulated packets to another port of the switch
>> fabric but no information will on the tunnel or the fact that the packet was
>> decapsulated will be passed with it, thereby enable segregation of the
>> infrastructure and
> 
> Again a suggestion without a dedicated TEP API, matching outer and some
> inner as well:
> 
>   attr = ingress;
>   pattern = eth / ipv6 / udp / vxlan vni is 42 / eth / ipv4 / tcp / end;
>   actions = vxlan_decap / port index 3 / end;
>   /* or */
>   actions = vxlan_decap / vf id 5 / end;
> 
> The PORT action should be defined as well as the converse of the existing
> PORT pattern item (matching an arbitrary physical port). Specifying a PORT
> action would steer traffic to a nondefault physical port.
> 
> The VF action is already correctly defined.
> 
>> Egress TEP encapsulation:
>> ~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> Encapulsation TEP actions require the flow definitions for the source packet
>> and then the actions to do on that, this example shows a ipv4/tcp packet
>> action.
>>
>> Source Packet
>>
>>      +-----+------+-----+---------+-----+
>>      | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>      +-----+------+-----+---------+-----+
>>
>> struct rte_flow_attr attr = { .egress = 1 };
>>
>> struct rte_flow_item_eth eth_item = { .src = s_addr, .dst = d_addr, .type = ether_type };
>> struct rte_flow_item_ipv4 ipv4_item = { .hdr = { .src_addr = src_addr, .dst_addr = dst_addr } };
>> struct rte_flow_item_udp tcp_item = { .hdr = { .src_port = src_port, .dst_port = dst_port } };
>>
>> struct rte_flow_item pattern[] = {
>>                 { .type = RTE_FLOW_ITEM_TYPE_ETH, .spec = &eth_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_IPV4, .spec = &ipv4_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_TCP, .spec = &tcp_item },
>>                 { .type = RTE_FLOW_ITEM_TYPE_END }
>> };
>>
>> /* Flow Actions Definitions */
>>
>> struct rte_flow_action_encap encap_eth = {
>>                 .type = RTE_FLOW_ITEM_TYPE_ETH,
>>                 .item = { .src = s_addr, .dst = d_addr, .type = ether_type }
>> };
>>
>> struct rte_flow_action_encap encap_tep = {
>>                 .type = RTE_FLOW_ITEM_TYPE_TEP,
>>                 .item = { .tep = tep, .id = vni }
>> };
>> struct rte_flow_action_mark port_action = { .index = port_id };
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_PORT, .conf = &port_action },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> }
>> struct rte_flow *flow = rte_flow_create(port_id, &attr, pattern, actions, &err);
>>
>>
>>        encapsulating Outer Hdr
>>       /                       \                                      outer crc
>>      /                         \                                   /          \
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
>>      | ETH | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC | OUTER CRC |
>>      +-----+------+-----+-------+-----+------+-----+---------+-----+-----------+
> 
> I see three main use cases for egress since we do not want a PMD to parse
> traffic in software to determine if it's candidate for TEP encapsulation:
> 
> 1. Traffic generated/forwarded by an application.
> 2. Same as 1. assuming an application is aware hardware can match egress
>     traffic in addition to encapsulate it.
> 3. Traffic fully processed internally in hardware.
> 
> To handle 1., in my opinion the most common use case, PMDs should rely on an
> application-provided mark pattern item (the converse of the MARK action):
> 
>   attr = egress;
>   pattern = mark is 42 / end;
>   actions = vxlan_encap {many parameters} / end;
> 
> To handle 2, hardware with the ability to recognize and encapsulate outgoing
> traffic is required (applications can rely on rte_flow_validate()):
> 
>   attr = egress;
>   pattern = eth / ipv4 / tcp / end;
>   actions = vxlan_encap {many parameters} / end;
> 
> For 3, a combination of ingress and egress can be used needed on a given
> rule. For clarity, one should assert where traffic comes from and where it's
> supposed to go:
> 
>   attr = ingress egress;
>   pattern = eth / ipv4 / tcp / port id 0 / end;
>   actions = vxlan_encap {many parameters} / vf id 5 / end;
> 
> The {many parameters} for VXLAN_ENCAP obviously remain to be defined,
> they have to either include everything needed to construct L2, L3, L4 and
> VXLAN headers, or separate actions for each layer specified in
> innermost-to-outermost order.
> 
> No need for dedicated mbuf TEP flags.

These all look make sense to me, if we really want to avoid the TEP API, 
just a point on 3, if using port representors then the ingress port can 
be implied by the rule on which the tunnel is created on.

> 
>> Chaining multiple modification actions eg IPsec and TEP
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> For example the definition for full hw acceleration for an IPsec ESP/Transport
>> SA encapsulated in a vxlan tunnel would look something like:
>>
>> struct rte_flow_action actions[] = {
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_tep },
>>                 { .type = RTE_FLOW_ACTION_TYPE_SECURITY, .conf = &sec_session },
>>                 { .type = RTE_FLOW_ACTION_TYPE_ENCAP, .conf = &encap_eth },
>>                 { .type = RTE_FLOW_ACTION_TYPE_END }
>> }
>>
>> 1. Source Packet
>>                             +-----+------+-----+---------+-----+
>>                             | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>                             +-----+------+-----+---------+-----+
>>
>> 2. First Action - Tunnel Endpoint Encapsulation
>>
>>        +------+-----+-------+-----+------+-----+---------+-----+
>>        | IPv4 | UDP | VxLAN | ETH | IPv4 | TCP | PAYLOAD | CRC |
>>        +------+-----+-------+-----+------+-----+---------+-----+
>>
>> 3. Second Action - IPsec ESP/Transport Security Processing
>>
>>        +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>>        | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER |
>>        +------+-----+-----+-------+-----+------+-----+---------+-----+-------------+
>>
>> 4. Third Action - Outer Ethernet Encapsulation
>>
>> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+
>> | ETH | IPv4 | ESP |              ENCRYPTED PAYLOAD                 | ESP TRAILER | OUTER CRC |
>> +-----+------+-----+-----+-------+-----+------+-----+---------+-----+-------------+-----------+
>>
>> This example demonstrates the importance of making the interoperation of
>> actions to be ordered, as in the above example, a security
>> action can be defined on both the inner and outer packet by simply placing
>> another security action at the beginning of the action list.
>>
>> It also demonstrates the rationale for not collapsing the Ethernet into
>> the TEP definition as when you have multiple encapsulating actions, all
>> could potentially be the place where the Ethernet header needs to be
>> defined.
> 
> For completeness, here's a suggested alternative with neither dedicated TEP
> nor security APIs:
> 
>   attr = egress;
>   pattern = mark is 42 / end;
>   actions = vxlan_encap {many parameters} / esp_encap {many parameters} / eth_encap {many parameters} / end;
> 
> Note ESP_ENCAP is not so easy given some data must be provided by the
> application with each transmitted packet. The current security API does not
> provide means to perform ESP encapsulation, it instead focuses on encryption
> and relies on the application to prepare headers and allocate room for the
> trailer. It's an unrealistic use case at the moment but shows the potential
> of such an API.
> 
The full IPsec is currently being enabled, and was always developed with 
allow full encap/decap offload.

> - First question is what's your opinion regarding focusing on rte_flow
>    instead of a TEP API? (Note for counters: one could add COUNT actions as
>    well, what's currently missing is a way to share counters among several
>    flow rules, which is planned as well)
>
Technically I see no issue with both approaches being workable, but I 
think the flow based approach has issues in terms of usability and 
performance. In my mind, thinking of a TEP as a logical object which 
flows get mapped into maps very closely to the how they are used 
functionally in networks deployments, and is the way I've seen them 
supported in ever TOR switch API/CLI I've ever used. I also think it add 
should enable a more preformant control path when you don't need to 
specify all the TEP parameters for every flow, this is not an 
inconsiderable overhead. I saying all that I do see the value in the 
cleanness at an API level of using purely rte_flow, although I do wonder 
will that just end up moving that into the application domain.

> - Regarding dedicated encap/decap actions instead of generic ones, given all
>    protocols have different requirements (e.g. ESP encap is on a whole
>    different level of complexity and likely needs callbacks)?
> 
Agreed on the need for dedicated encap/decap TEP actions.

> - Regarding the reliance on a MARK meta pattern item as a standard means for
>    applications to tag egress traffic so a PMD knows what to do?

I do like that it as an approach but how would it work for combined 
actions, TEP + IPsec SA

> 
> - I'd like to send a deprecation notice for rte_flow regarding handling of
>    actions (documentation and change in some PMDs to reject currently valid
>    but seldom used flow rules accordingly) instead of a new flow
>    attribute. Would you ack such a change for 18.05?
> 

Apologies, I complete missed the ack for 18.05 part of the question when 
I read it first this mail, the answer would have been yes, I was out of 
office due to illness for part of that week, which was part of the 
reason for the delay in response to this mail. But I think if we only 
restrict the action ordering requirement to chained modification actions 
do we still need the deprecation notice, as it won't break any existing 
implementations, as as you note there isn't anyone supporting that yet?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
  2018-02-26 17:44   ` Doherty, Declan
@ 2018-03-05 16:23     ` Adrien Mazarguil
  0 siblings, 0 replies; 15+ messages in thread
From: Adrien Mazarguil @ 2018-03-05 16:23 UTC (permalink / raw)
  To: Doherty, Declan
  Cc: dev, Shahaf Shuler, John Daley (johndale),
	Boris Pismenny, Nelio Laranjeiro

On Mon, Feb 26, 2018 at 05:44:01PM +0000, Doherty, Declan wrote:
> On 13/02/2018 5:05 PM, Adrien Mazarguil wrote:
> > Hi,
> > 
> > Apologies for being late to this thread, I've read the ensuing discussion
> > (hope I didn't miss any) and also think rte_flow could be improved in
> > several ways to enable TEP support, in particular regarding the ordering of
> > actions.
> > 
> > On the other hand I'm not sure a dedicated API for TEP is needed at all. I'm
> > not convinced rte_security chose the right path and would like to avoid
> > repeating the same mistakes if possible, more below.
> > 
> > On Thu, Dec 21, 2017 at 10:21:13PM +0000, Doherty, Declan wrote:
> > > This RFC contains a proposal to add a new tunnel endpoint API to DPDK that when used
> > > in conjunction with rte_flow enables the configuration of inline data path encapsulation
> > > and decapsulation of tunnel endpoint network overlays on accelerated IO devices.
> > > 
> > > The proposed new API would provide for the creation, destruction, and
> > > monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs to allow the
> > > acceleration features to be discovered by applications.
<snip>
> > Although I'm not convinced an opaque object is the right approach, if we
> > choose this route I suggest the much simpler:
> > 
> >   struct rte_flow_action_tep_(encap|decap) {
> >       struct rte_tep *tep;
> >       uint32_t flow_id;
> >   };
> > 
> 
> That's a fair point, the only other action that we currently had the
> encap/decap actions supporting was the Ethernet item, and going back to a
> comment from Boris having the Ethernet header separate from the tunnel is
> probably not ideal anyway. As one of our reasons for using an opaque tep
> item was to allow modification of the TEP independently of all the flows
> being carried on it. So for instance if the src or dst MAC needs to be
> modified or the output port needs to changed, the TEP itself could be
> modified.

Makes sense. I think there's now consensus that without a dedicated API, it
can be done through multiple rte_flow groups and "jump" actions targeting
them. Such actions remain to be formally defined though.

In the meantime there is an alternative approach when opaque pattern
items/actions are unavoidable: by using negative values [1].

In addition to an opaque object to use with rte_flow, a PMD could return a
PMD-specific negative value cast as enum rte_flow_{item,action}_type and
usable with the associated port ID only.

An API could even initialize a pattern item or an action object directly:

 struct rte_flow_action tep_action;
 
 if (rte_tep_create(port_id, &tep_action, ...) != 0)
      rte_panic("nooooo!");
 /*
  * tep_action is now initialized with an opaque type and conf pointer, it
  * can be used with rte_flow_create() as part of an action list.
  */

[1] http://dpdk.org/doc/guides/prog_guide/rte_flow.html#negative-types

<snip>
> > > struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
> > > 
> > > Once the tep context is created flows can then be directed to that endpoint for
> > > processing. The following sections will outline how the author envisage flow
> > > programming will work and also how TEP acceleration can be combined with other
> > > accelerations.
> > 
> > In order to allow a single TEP context object to be shared by multiple flow
> > rules, a whole new API must be implemented and applications still have to
> > additionally create one rte_flow rule per TEP flow_id to manage. While this
> > probably results in shorter flow rule patterns and action lists, is it
> > really worth it?
> > 
> > While I understand the reasons for this approach, I'd like to push for a
> > rte_flow-only API as much as possible, I'll provide suggestions below.
> > 
> 
> Not only are the rules shorter to implement, it could help to greatly
> reduces the amount of cycles required to add flows, both in terms of the
> application marshaling the data in rte_flow patterns and the PMD parsing
> that those patterns every time a flow is added, in the case where 10k's of
> flow are getting added per second this could add a significant overhead on
> the system.

True, although only if the underlying hardware supports it; some PMDs may
still have to update each flow rule independently in order to expose such an
API. Applications can't be certain an update operation will be quick and
atomic.

<snip>
> > > /** VERY IMPORTANT NOTE **/
> > > One of the core concepts of this proposal is that actions which modify the
> > > packet are defined in the order which they are to be processed. So first decap
> > > outer ethernet header, then the outer TEP headers.
> > > I think this is not only logical from a usability point of view, it should also
> > > simplify the logic required in PMDs to parse the desired actions.
> > 
> > This. I've been thinking about it for a very long time but never got around
> > submit a patch. Handling rte_flow actions in order, allowing repeated
> > identical actions and therefore getting rid of DUP. >
> > The current approach was a bad design decision from my part, I'm convinced
> > it must be redefined before combinations become commonplace (right now no
> > PMD implements any action whose order matters as far as I know).
> > 
> 
> I don't think it was an issue with the original implementation as I don't
> think it really becomes an issue until we start working with packet
> modifications, to that note I think that we only need to limit action
> ordering to actions which modify the packet itself. Actions like counting,
> marking, selecting output, be it port/pf/vf/queue/rss are all independent to
> the actions which modify the packet.

I think a behavior that differs depending on the action type would make the
API more difficult to implement and document. Limiting it now could also
result in the need for breaking it again later.

For instance an application may want to send an unencapsulated packet to
some queue, then encapsulate its copy twice before sending the result to
some VF:

 actions queue index 5 / vxlan_encap / vlan_encap / vf id 42 / end

If the application wanted two encapsulated copies instead:

 actions vxlan_encap / vlan_encap / queue index 5 / vf id 42 / end

Defining actions as always performed left-to-right is simpler and more
versatile in my opinion.

<snip>
> > I see three main use cases for egress since we do not want a PMD to parse
> > traffic in software to determine if it's candidate for TEP encapsulation:
> > 
> > 1. Traffic generated/forwarded by an application.
> > 2. Same as 1. assuming an application is aware hardware can match egress
> >     traffic in addition to encapsulate it.
> > 3. Traffic fully processed internally in hardware.
> > 
> > To handle 1., in my opinion the most common use case, PMDs should rely on an
> > application-provided mark pattern item (the converse of the MARK action):
> > 
> >   attr = egress;
> >   pattern = mark is 42 / end;
> >   actions = vxlan_encap {many parameters} / end;
> > 
> > To handle 2, hardware with the ability to recognize and encapsulate outgoing
> > traffic is required (applications can rely on rte_flow_validate()):
> > 
> >   attr = egress;
> >   pattern = eth / ipv4 / tcp / end;
> >   actions = vxlan_encap {many parameters} / end;
> > 
> > For 3, a combination of ingress and egress can be used needed on a given
> > rule. For clarity, one should assert where traffic comes from and where it's
> > supposed to go:
> > 
> >   attr = ingress egress;
> >   pattern = eth / ipv4 / tcp / port id 0 / end;
> >   actions = vxlan_encap {many parameters} / vf id 5 / end;

I take "ingress" back from this example, it doesn't make sense in its
context.

> > The {many parameters} for VXLAN_ENCAP obviously remain to be defined,
> > they have to either include everything needed to construct L2, L3, L4 and
> > VXLAN headers, or separate actions for each layer specified in
> > innermost-to-outermost order.
> > 
> > No need for dedicated mbuf TEP flags.
> 
> These all look make sense to me, if we really want to avoid the TEP API,
> just a point on 3, if using port representors then the ingress port can be
> implied by the rule on which the tunnel is created on.

By the way I will soon submit yet another RFC on this topic. It describes
how one could configure device switching through rte_flow and its
interaction with representor ports (assuming they exist) as a follow up to
all the recent discussion.

<snip>
> > - First question is what's your opinion regarding focusing on rte_flow
> >    instead of a TEP API? (Note for counters: one could add COUNT actions as
> >    well, what's currently missing is a way to share counters among several
> >    flow rules, which is planned as well)
> > 
> Technically I see no issue with both approaches being workable, but I think
> the flow based approach has issues in terms of usability and performance. In
> my mind, thinking of a TEP as a logical object which flows get mapped into
> maps very closely to the how they are used functionally in networks
> deployments, and is the way I've seen them supported in ever TOR switch
> API/CLI I've ever used. I also think it add should enable a more preformant
> control path when you don't need to specify all the TEP parameters for every
> flow, this is not an inconsiderable overhead. I saying all that I do see the
> value in the cleanness at an API level of using purely rte_flow, although I
> do wonder will that just end up moving that into the application domain.

I see, well that's a valid use case. If TEP is really supported as a kind of
action target and not as an opaque collection of multiple flow rules, it
make sense to expose a dedicated action for it.

As previously described, I would suggest to use a negative type generated by
an experimental API while work is being performed on rte_flow to add simple
low-level encaps (VLAN, VXLAN, etc) and support for actions order and the
more-or-less related switching configuration.

We'll then determine if an opaque TEP API still makes sense and if an
official rte_flow action type should be assigned.

> > - Regarding dedicated encap/decap actions instead of generic ones, given all
> >    protocols have different requirements (e.g. ESP encap is on a whole
> >    different level of complexity and likely needs callbacks)?
> > 
> Agreed on the need for dedicated encap/decap TEP actions.
> 
> > - Regarding the reliance on a MARK meta pattern item as a standard means for
> >    applications to tag egress traffic so a PMD knows what to do?
> 
> I do like that it as an approach but how would it work for combined actions,
> TEP + IPsec SA

A given MARK ID would correspond to a given list of actions that would
include both TEP + IPsec SA in whichever order was requested, not to a
specific action.

> > - I'd like to send a deprecation notice for rte_flow regarding handling of
> >    actions (documentation and change in some PMDs to reject currently valid
> >    but seldom used flow rules accordingly) instead of a new flow
> >    attribute. Would you ack such a change for 18.05?
> > 
> 
> Apologies, I complete missed the ack for 18.05 part of the question when I
> read it first this mail, the answer would have been yes, I was out of office
> due to illness for part of that week, which was part of the reason for the
> delay in response to this mail. But I think if we only restrict the action
> ordering requirement to chained modification actions do we still need the
> deprecation notice, as it won't break any existing implementations, as as
> you note there isn't anyone supporting that yet?

Not in DPDK itself AFAIK but you never know, the change of behavior may
result in previously unseen bugs in applications.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-03-05 16:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-21 22:21 [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement Doherty, Declan
2017-12-24 17:30 ` Shahaf Shuler
2018-01-09 17:30   ` Doherty, Declan
2018-01-11 21:45     ` John Daley (johndale)
2018-01-16  8:22       ` Shahaf Shuler
2018-01-23 15:35         ` Doherty, Declan
2018-02-01 19:59           ` Shahaf Shuler
2018-01-23 14:46       ` Doherty, Declan
     [not found] ` <3560e76a-c99b-4dc3-9678-d7975acf67c9@mellanox.com>
2018-01-02 10:50   ` Boris Pismenny
2018-01-10 16:04   ` Doherty, Declan
2018-01-11 21:44 ` John Daley (johndale)
2018-01-23 15:50   ` Doherty, Declan
2018-02-13 17:05 ` Adrien Mazarguil
2018-02-26 17:44   ` Doherty, Declan
2018-03-05 16:23     ` Adrien Mazarguil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).