graph: make the in-built nodes better reusable

DPDK patches and discussions
 help / color / mirror / Atom feed

* graph: make the in-built nodes better reusable
@ 2024-10-21 12:53 Robin Jarry
  2024-10-22  3:52 ` [EXTERNAL] " Nitin Saxena
  0 siblings, 1 reply; 2+ messages in thread
From: Robin Jarry @ 2024-10-21 12:53 UTC (permalink / raw)
  To: dev, grout; +Cc: Jerin Jacob, Nitin Saxena, David Marchand, Christophe Fontaine

Hi all,

I am starting this discussion to see what can be done in order to make 
the in-built nodes (i.e. https://git.dpdk.org/dpdk/tree/lib/node) easier 
to reuse in external applications.

So far here are the limitations I have found when working on grout. Some 
of these limitations are trivial, some others are more tricky. I hope we 
can get clean solutions.

ethdev_rx and ethdev_tx require cloning
---------------------------------------

These nodes have been written to receive from or transmit to a single 
queue. When changing the number of ports and/or rx/tx queues. The graph 
needs to be recreated.

* Node names must all be unique (hence, node clones need to have 
  different names than their original).

  => There is a routine that automatically adds a "unique" suffix to the 
  cloned ethdev_rx and ethdev_tx names. The "ethdev_rx-<port>-<queue>" 
  and "ethdev_rx-<port>" name patterns are used.

  => it is not possible to prepare the new nodes in advance without 
  destroying the active graph. For example, if one port+queue isn't 
  changed, the "ethdev_rx-<port>-<queue>" name will already exist and be 
  active in the graph. Reusing the same name could lead to data races.

* Node context data cannot be passed during rte_graph_create() or 
  rte_graph_clone().

  Instead, each node init() callback must determine its own context data 
  based on the graph and node pointers it has. In most cases, it is 
  trivial, but not for nodes that have multiple copies per graph.

* Once created/cloned, nodes cannot be destroyed.

  => Removing ports and/or reducing the number queues, results in node 
  clones remaining. It isn't the end of the world but it would be nice 
  to allow destroying them for clarity.

* Graph statistics are per-node.

  => When cloning nodes, we end up with obscure node names such as 
  "ethdev_rx-4-7" in graph statistics. It would be clearer if the clones 
  would be collapsed in the statistics. Having clones is an 
  implementation detail which shouldn't reflect in the results.

  Same with the DOT graph dump, it makes the graph images bloated and 
  also gives a different image per worker. It would be clearer if the 
  original node name was used only.

ip* nodes assume the mbuf data offset is at 0
---------------------------------------------

L3 and L4 nodes assume that the mbuf they process have their data 
offsets pointing to an ethernet header.

This prevents implementing IP tunnels or control to data plane 
communication where the data pointer may need to be at the end of the 
L3 header, for example.

If we change that to adjust the data pointer to the correct OSI layer, 
it would also mandate that each individual node only deals with a single 
OSI layer.

This means that the current ip*_rewrite nodes would need to be split in 
two: ip*_output and eth_output. This may have big implications on the 
optimizations that were made in these nodes.

No explicit API to pass application specific data around
--------------------------------------------------------

This one is more a documentation issue. It would help if there was 
a clear description of how the in-built nodes work together and what 
kind of mbuf private data they require in order to function properly.

Next nodes are hard coded
-------------------------

All next-nodes are set in the in-built nodes. This prevents from reusing 
only a subset of them.

No support for logical interfaces
---------------------------------

All interfaces are supposed to be DPDK ports (e.g. IP next hops contain 
destination Ethernet addresses and DPDK port IDs). This prevents support 
of logical interfaces such as IP tunnels.

No support for multiple VRF
---------------------------

There is a single lpm/lpm6 instance for all ports. This is sort of 
linked to the previous limitation about no having support for logical 
interfaces. Ideally, the lpm/lpm6 instance should be determined from the 
VRF identifier of the input interface (nb: NOT the same thing as 
receiving DPDK port).

Cheers!

-- 
Robin

^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: [EXTERNAL] graph: make the in-built nodes better reusable
  2024-10-21 12:53 graph: make the in-built nodes better reusable Robin Jarry
@ 2024-10-22  3:52 ` Nitin Saxena
  0 siblings, 0 replies; 2+ messages in thread
From: Nitin Saxena @ 2024-10-22  3:52 UTC (permalink / raw)
  To: Robin Jarry, dev, grout
  Cc: Jerin Jacob, David Marchand, Christophe Fontaine, Nitin Saxena

Hi Robin,

Thanks for initiating this thread. I agree with majority of your points that you highlighted
We(Marvell) also started integrating *rte_graph* in various use-cases and found need of following items:

1. Hot-plugging of ports at runtime. (Example link status change from DOWN -> UP and vice-versa OR addition of new port at runtime)
Instead of creating source node(ethdev-rx-<port>-<queue>} for each port/queue per worker, we can have single "ethdev-rx" node for every worker. Each instance of "ethdev-rx" would poll list of all (ports, queue) mapped to it by control plane at runtime. With RCU primitives, control plane can runtime update list of all {ports, queue} for each "ethdev-rx' instance. This would enable hot-plugging feature and should avoid following issues that you highlighted:
- Recreation of graph instance due to port hot-plugging feature
- Statistics issue where obscure version of "ethdev-rx-X" shows up due to port delete in control plane
- Due to single instance of node, any addition cloning per (port, queue) per worker can be avoided (that exists today)
- Any naming concerns regarding "ethdev-rx-<port>-<queue>", since each worker would have single "ethdev-rx" instance

2. Node mbuf dynamic field (node_mbuf_priv1) refactor
It is better to have following fields in node_mbuf_priv1() structure
- uint16_t rx_interface for any incoming decision per interface. This corresponds to virtual interfaces as well like IPIP interface
- uint16_t tx_interface for any outgoing decision per interface. Applies to virtual interfaces as well
- Fields related to feature_arc, if it gets accepted

3. Runtime enable/disable of sub-graphs per interface
- Feature arc patch series[0] solves this issue.

4. Reusability of in-built nodes by out-of-tree applications
- Feature arc patch series[0] solves this issue

5. Hard codes static next_nodes
- Feature arc patch series solves this by adding static edges (but not hard-coded). Feature arc manages next_nodes() across various sub-graphs

6. IP route lookup
a) Changing ip lookup implementation from current LPM to RCU based rte_fib for performance and functional reasons
b) Any control plane route update should also update fast path fib
- When user updates MAC address of local port, rewrite-data must change that resides in next_hop structure
- rewrite data should also be changed when connected next-hop has changed its MAC
- When any port link status goes down, fib entry should be removed from fast path
c) Polling for partial routes in control plane. Partial routes (or incomplete routes) are those whose ARP has not been yet resolved. Only complete routes must be added to fib
d) Notion of sw_tx_port instead of ethdev_tx_port in ipX-lookup to enable tunnel protocols like IP-IP.
e) Adding punt feature to send packets to LINUX control plane
f) VRF and Multi-path routing (Nice to have features)

7. Optimized UDP integration with L7 protocols

8. Nice to have feature: Minimalistic control plane in app/graph with single epoll_fd() interface for handling external asynchronous events
- CLI socket
- libevent socket like in grout
- Any other socket polling like Netlink or application socket

9. LINUX control plane integration with standard daemons like Strongswan, FRR etc

8. VLAN nodes (Not immediate item for us but in roadmap)

Also see inline comments below.

[0]: https://patches.dpdk.org/project/dpdk/cover/20241014143401.3135897-1-nsaxena@marvell.com/

> -----Original Message-----
> From: Robin Jarry <rjarry@redhat.com>
> Sent: Monday, October 21, 2024 6:23 PM
> To: dev@dpdk.org; grout@dpdk.org
> Cc: Jerin Jacob <jerinj@marvell.com>; Nitin Saxena
> <nitin.saxena@caviumnetworks.com>; David Marchand
> <david.marchand@redhat.com>; Christophe Fontaine
> <cfontain@redhat.com>
> Subject: [EXTERNAL] graph: make the in-built nodes better reusable
>
> Hi all,
>
> I am starting this discussion to see what can be done in order to make the in-
> built nodes (i.e. https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__git.dpdk.org_dpdk_tree_lib_node&d=DwIFaQ&c=nKjWec2b6R0mOyPaz7
> xtfQ&r=cy0AkAKzy6-iEbDgLPJ8thryQyEKLLeVyCjs970CZak&m=flNiUH9F-
> aH7KzskEvR75ZZ2lHIwkRJ9j_PhBa3gcQOfmzHJqwEnHLKN-OOAQzy-
> &s=9V_iwJ-I7YNbJNpc6UEZhkV7fE7xXIOvIXPd9Ozqkys&e=) easier to reuse in
> external applications.
>
> So far here are the limitations I have found when working on grout. Some of
> these limitations are trivial, some others are more tricky. I hope we can get
> clean solutions.
>
> ethdev_rx and ethdev_tx require cloning
> ---------------------------------------
>
> These nodes have been written to receive from or transmit to a single queue.
> When changing the number of ports and/or rx/tx queues. The graph needs to
> be recreated.
>
> * Node names must all be unique (hence, node clones need to have
>   different names than their original).
>
>   => There is a routine that automatically adds a "unique" suffix to the
>   cloned ethdev_rx and ethdev_tx names. The "ethdev_rx-<port>-<queue>"
>   and "ethdev_rx-<port>" name patterns are used.
>
>   => it is not possible to prepare the new nodes in advance without
>   destroying the active graph. For example, if one port+queue isn't
>   changed, the "ethdev_rx-<port>-<queue>" name will already exist and be
>   active in the graph. Reusing the same name could lead to data races.

Answered above

>
> * Node context data cannot be passed during rte_graph_create() or
>   rte_graph_clone().
>
>   Instead, each node init() callback must determine its own context data
>   based on the graph and node pointers it has. In most cases, it is
>   trivial, but not for nodes that have multiple copies per graph.

Node context data means it is unique per node. What would be the use-case where you would like pass data from rte_graph_create()/clone() to each node ctx?

>
> * Once created/cloned, nodes cannot be destroyed.
>
>   => Removing ports and/or reducing the number queues, results in node
>   clones remaining. It isn't the end of the world but it would be nice
>   to allow destroying them for clarity.

Agreed.

>
> * Graph statistics are per-node.
>
>   => When cloning nodes, we end up with obscure node names such as
>   "ethdev_rx-4-7" in graph statistics. It would be clearer if the clones
>   would be collapsed in the statistics. Having clones is an
>   implementation detail which shouldn't reflect in the results.

Answered above
>
>   Same with the DOT graph dump, it makes the graph images bloated and
>   also gives a different image per worker. It would be clearer if the
>   original node name was used only.
>
> ip* nodes assume the mbuf data offset is at 0
> ---------------------------------------------
>
> L3 and L4 nodes assume that the mbuf they process have their data offsets
> pointing to an ethernet header.
>
> This prevents implementing IP tunnels or control to data plane
> communication where the data pointer may need to be at the end of the
> L3 header, for example.
>
> If we change that to adjust the data pointer to the correct OSI layer, it would
> also mandate that each individual node only deals with a single OSI layer.
>
> This means that the current ip*_rewrite nodes would need to be split in
> two: ip*_output and eth_output. This may have big implications on the
> optimizations that were made in these nodes.

Agreed. In-built nodes should obey layer data offsets. I didn't give much thought about it but I was thinking to use rte_pktmbuf_xxx() APIs which already exists.
Performance impact is another issue which we have to care about.

>
> No explicit API to pass application specific data around
> --------------------------------------------------------
>
> This one is more a documentation issue. It would help if there was a clear
> description of how the in-built nodes work together and what kind of mbuf
> private data they require in order to function properly.

node_mbuf_priv1 dynamic structure must holds those fields which is per packet. If data is not unique per packet, global variables can be used.
If it is a matter of documentation, we should be adding it.

>
> Next nodes are hard coded
> -------------------------
>
> All next-nodes are set in the in-built nodes. This prevents from reusing only a
> subset of them.

Feature arc solves this. There is no restriction in rte_node_edge_update() API which feature arc exploits i.e. next_nodes array can grow, more than "nb_edges" provided in "struct rte_node_register"

>
> No support for logical interfaces
> ---------------------------------
>
> All interfaces are supposed to be DPDK ports (e.g. IP next hops contain
> destination Ethernet addresses and DPDK port IDs). This prevents support of
> logical interfaces such as IP tunnels.

Agree

>
> No support for multiple VRF
> ---------------------------
>
> There is a single lpm/lpm6 instance for all ports. This is sort of linked to the
> previous limitation about no having support for logical interfaces. Ideally, the
> lpm/lpm6 instance should be determined from the VRF identifier of the input
> interface (nb: NOT the same thing as receiving DPDK port).

Agree. VRF table is a function of received interface (not just ethdev) but also virtual interfaces like IP-IP

>
> Cheers!
>
> --
> Robin


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-10-22  3:52 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-21 12:53 graph: make the in-built nodes better reusable Robin Jarry
2024-10-22  3:52 ` [EXTERNAL] " Nitin Saxena

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).