RE: [EXTERNAL] graph: make the in-built nodes better reusable

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Nitin Saxena <nsaxena@marvell.com>
To: Robin Jarry <rjarry@redhat.com>, "dev@dpdk.org" <dev@dpdk.org>,
	"grout@dpdk.org" <grout@dpdk.org>
Cc: Jerin Jacob <jerinj@marvell.com>,
	David Marchand <david.marchand@redhat.com>,
	Christophe Fontaine <cfontain@redhat.com>,
	Nitin Saxena <nsaxena16@gmail.com>
Subject: RE: [EXTERNAL] graph: make the in-built nodes better reusable
Date: Tue, 22 Oct 2024 03:52:11 +0000	[thread overview]
Message-ID: <SJ0PR18MB511145E94903B2AC2F31E930B64C2@SJ0PR18MB5111.namprd18.prod.outlook.com> (raw)
In-Reply-To: <D51HYX7S2ORA.3DY06SGFOU94N@redhat.com>

Hi Robin,

Thanks for initiating this thread. I agree with majority of your points that you highlighted
We(Marvell) also started integrating *rte_graph* in various use-cases and found need of following items:

1. Hot-plugging of ports at runtime. (Example link status change from DOWN -> UP and vice-versa OR addition of new port at runtime)
Instead of creating source node(ethdev-rx-<port>-<queue>} for each port/queue per worker, we can have single "ethdev-rx" node for every worker. Each instance of "ethdev-rx" would poll list of all (ports, queue) mapped to it by control plane at runtime. With RCU primitives, control plane can runtime update list of all {ports, queue} for each "ethdev-rx' instance. This would enable hot-plugging feature and should avoid following issues that you highlighted:
- Recreation of graph instance due to port hot-plugging feature
- Statistics issue where obscure version of "ethdev-rx-X" shows up due to port delete in control plane
- Due to single instance of node, any addition cloning per (port, queue) per worker can be avoided (that exists today)
- Any naming concerns regarding "ethdev-rx-<port>-<queue>", since each worker would have single "ethdev-rx" instance

2. Node mbuf dynamic field (node_mbuf_priv1) refactor
It is better to have following fields in node_mbuf_priv1() structure
- uint16_t rx_interface for any incoming decision per interface. This corresponds to virtual interfaces as well like IPIP interface
- uint16_t tx_interface for any outgoing decision per interface. Applies to virtual interfaces as well
- Fields related to feature_arc, if it gets accepted

3. Runtime enable/disable of sub-graphs per interface
- Feature arc patch series[0] solves this issue.

4. Reusability of in-built nodes by out-of-tree applications
- Feature arc patch series[0] solves this issue

5. Hard codes static next_nodes
- Feature arc patch series solves this by adding static edges (but not hard-coded). Feature arc manages next_nodes() across various sub-graphs

6. IP route lookup
a) Changing ip lookup implementation from current LPM to RCU based rte_fib for performance and functional reasons
b) Any control plane route update should also update fast path fib
- When user updates MAC address of local port, rewrite-data must change that resides in next_hop structure
- rewrite data should also be changed when connected next-hop has changed its MAC
- When any port link status goes down, fib entry should be removed from fast path
c) Polling for partial routes in control plane. Partial routes (or incomplete routes) are those whose ARP has not been yet resolved. Only complete routes must be added to fib
d) Notion of sw_tx_port instead of ethdev_tx_port in ipX-lookup to enable tunnel protocols like IP-IP.
e) Adding punt feature to send packets to LINUX control plane
f) VRF and Multi-path routing (Nice to have features)

7. Optimized UDP integration with L7 protocols

8. Nice to have feature: Minimalistic control plane in app/graph with single epoll_fd() interface for handling external asynchronous events
- CLI socket
- libevent socket like in grout
- Any other socket polling like Netlink or application socket

9. LINUX control plane integration with standard daemons like Strongswan, FRR etc

8. VLAN nodes (Not immediate item for us but in roadmap)

Also see inline comments below.

[0]: https://patches.dpdk.org/project/dpdk/cover/20241014143401.3135897-1-nsaxena@marvell.com/

> -----Original Message-----
> From: Robin Jarry <rjarry@redhat.com>
> Sent: Monday, October 21, 2024 6:23 PM
> To: dev@dpdk.org; grout@dpdk.org
> Cc: Jerin Jacob <jerinj@marvell.com>; Nitin Saxena
> <nitin.saxena@caviumnetworks.com>; David Marchand
> <david.marchand@redhat.com>; Christophe Fontaine
> <cfontain@redhat.com>
> Subject: [EXTERNAL] graph: make the in-built nodes better reusable
>
> Hi all,
>
> I am starting this discussion to see what can be done in order to make the in-
> built nodes (i.e. https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__git.dpdk.org_dpdk_tree_lib_node&d=DwIFaQ&c=nKjWec2b6R0mOyPaz7
> xtfQ&r=cy0AkAKzy6-iEbDgLPJ8thryQyEKLLeVyCjs970CZak&m=flNiUH9F-
> aH7KzskEvR75ZZ2lHIwkRJ9j_PhBa3gcQOfmzHJqwEnHLKN-OOAQzy-
> &s=9V_iwJ-I7YNbJNpc6UEZhkV7fE7xXIOvIXPd9Ozqkys&e=) easier to reuse in
> external applications.
>
> So far here are the limitations I have found when working on grout. Some of
> these limitations are trivial, some others are more tricky. I hope we can get
> clean solutions.
>
> ethdev_rx and ethdev_tx require cloning
> ---------------------------------------
>
> These nodes have been written to receive from or transmit to a single queue.
> When changing the number of ports and/or rx/tx queues. The graph needs to
> be recreated.
>
> * Node names must all be unique (hence, node clones need to have
>   different names than their original).
>
>   => There is a routine that automatically adds a "unique" suffix to the
>   cloned ethdev_rx and ethdev_tx names. The "ethdev_rx-<port>-<queue>"
>   and "ethdev_rx-<port>" name patterns are used.
>
>   => it is not possible to prepare the new nodes in advance without
>   destroying the active graph. For example, if one port+queue isn't
>   changed, the "ethdev_rx-<port>-<queue>" name will already exist and be
>   active in the graph. Reusing the same name could lead to data races.

Answered above

>
> * Node context data cannot be passed during rte_graph_create() or
>   rte_graph_clone().
>
>   Instead, each node init() callback must determine its own context data
>   based on the graph and node pointers it has. In most cases, it is
>   trivial, but not for nodes that have multiple copies per graph.

Node context data means it is unique per node. What would be the use-case where you would like pass data from rte_graph_create()/clone() to each node ctx?

>
> * Once created/cloned, nodes cannot be destroyed.
>
>   => Removing ports and/or reducing the number queues, results in node
>   clones remaining. It isn't the end of the world but it would be nice
>   to allow destroying them for clarity.

Agreed.

>
> * Graph statistics are per-node.
>
>   => When cloning nodes, we end up with obscure node names such as
>   "ethdev_rx-4-7" in graph statistics. It would be clearer if the clones
>   would be collapsed in the statistics. Having clones is an
>   implementation detail which shouldn't reflect in the results.

Answered above
>
>   Same with the DOT graph dump, it makes the graph images bloated and
>   also gives a different image per worker. It would be clearer if the
>   original node name was used only.
>
> ip* nodes assume the mbuf data offset is at 0
> ---------------------------------------------
>
> L3 and L4 nodes assume that the mbuf they process have their data offsets
> pointing to an ethernet header.
>
> This prevents implementing IP tunnels or control to data plane
> communication where the data pointer may need to be at the end of the
> L3 header, for example.
>
> If we change that to adjust the data pointer to the correct OSI layer, it would
> also mandate that each individual node only deals with a single OSI layer.
>
> This means that the current ip*_rewrite nodes would need to be split in
> two: ip*_output and eth_output. This may have big implications on the
> optimizations that were made in these nodes.

Agreed. In-built nodes should obey layer data offsets. I didn't give much thought about it but I was thinking to use rte_pktmbuf_xxx() APIs which already exists.
Performance impact is another issue which we have to care about.

>
> No explicit API to pass application specific data around
> --------------------------------------------------------
>
> This one is more a documentation issue. It would help if there was a clear
> description of how the in-built nodes work together and what kind of mbuf
> private data they require in order to function properly.

node_mbuf_priv1 dynamic structure must holds those fields which is per packet. If data is not unique per packet, global variables can be used.
If it is a matter of documentation, we should be adding it.

>
> Next nodes are hard coded
> -------------------------
>
> All next-nodes are set in the in-built nodes. This prevents from reusing only a
> subset of them.

Feature arc solves this. There is no restriction in rte_node_edge_update() API which feature arc exploits i.e. next_nodes array can grow, more than "nb_edges" provided in "struct rte_node_register"

>
> No support for logical interfaces
> ---------------------------------
>
> All interfaces are supposed to be DPDK ports (e.g. IP next hops contain
> destination Ethernet addresses and DPDK port IDs). This prevents support of
> logical interfaces such as IP tunnels.

Agree

>
> No support for multiple VRF
> ---------------------------
>
> There is a single lpm/lpm6 instance for all ports. This is sort of linked to the
> previous limitation about no having support for logical interfaces. Ideally, the
> lpm/lpm6 instance should be determined from the VRF identifier of the input
> interface (nb: NOT the same thing as receiving DPDK port).

Agree. VRF table is a function of received interface (not just ethdev) but also virtual interfaces like IP-IP

>
> Cheers!
>
> --
> Robin

     prev parent reply	other threads:[~2024-10-22  3:52 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-21 12:53 Robin Jarry
2024-10-22  3:52 ` Nitin Saxena [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SJ0PR18MB511145E94903B2AC2F31E930B64C2@SJ0PR18MB5111.namprd18.prod.outlook.com \
    --to=nsaxena@marvell.com \
    --cc=cfontain@redhat.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=grout@dpdk.org \
    --cc=jerinj@marvell.com \
    --cc=nsaxena16@gmail.com \
    --cc=rjarry@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).