From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id C6B0AA0524; Fri, 31 Jan 2020 19:34:57 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 1B5531C0DC; Fri, 31 Jan 2020 19:34:57 +0100 (CET) Received: from relay0146.mxlogin.com (relay0146.mxlogin.com [199.181.239.146]) by dpdk.org (Postfix) with ESMTP id 4CB8E1C0DA for ; Fri, 31 Jan 2020 19:34:55 +0100 (CET) Received: from filter004.mxroute.com ([149.28.56.236] 149.28.56.236.vultr.com) (Authenticated sender: mN4UYu2MZsgR) by relay0146.mxlogin.com (ZoneMTA) with ESMTPSA id 16ffce2572d000de62.005 for (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES128-GCM-SHA256); Fri, 31 Jan 2020 18:34:49 +0000 X-Zone-Loop: 4cfc580a795cf706b7c1813a7d3b81d7cacec66d9a88 X-Originating-IP: [149.28.56.236] Received: from galaxy.mxroute.com (unknown [23.92.70.113]) by filter004.mxroute.com (Postfix) with ESMTPS id E82B03E9E1; Fri, 31 Jan 2020 18:34:47 +0000 (UTC) Received: from irdmzpr01-ext.ir.intel.com ([192.198.151.36]) by galaxy.mxroute.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.91) (envelope-from ) id 1ixaor-0006n2-Hf; Fri, 31 Jan 2020 13:15:05 -0500 To: jerinj@marvell.com, dev@dpdk.org Cc: pkapoor@marvell.com, ndabilpuram@marvell.com, kirankumark@marvell.com, pbhagavatula@marvell.com, pathreya@marvell.com, nsaxena@marvell.com, sshankarnara@marvell.com, honnappa.nagarahalli@arm.com, thomas@monjalon.net, david.marchand@redhat.com, ferruh.yigit@intel.com, arybchenko@solarflare.com, ajit.khaparde@broadcom.com, xiaolong.ye@intel.com, rasland@mellanox.com, maxime.coquelin@redhat.com, akhil.goyal@nxp.com, cristian.dumitrescu@intel.com, john.mcnamara@intel.com, bruce.richardson@intel.com, anatoly.burakov@intel.com, gavin.hu@arm.com, drc@linux.vnet.ibm.com, konstantin.ananyev@intel.com, pallavi.kadam@intel.com, olivier.matz@6wind.com, gage.eads@intel.com, nikhil.rao@intel.com, erik.g.carrillo@intel.com, hemant.agrawal@nxp.com, artem.andreev@oktetlabs.ru, sthemmin@microsoft.com, shahafs@mellanox.com, keith.wiles@intel.com, mattias.ronnblom@ericsson.com, jasvinder.singh@intel.com, vladimir.medvedkin@intel.com, techboard@dpdk.org References: <20200131170201.3236153-1-jerinj@marvell.com> From: Ray Kinsella Autocrypt: addr=mdr@ashroe.eu; keydata= mQINBFv8B3wBEAC+5ImcgbIvadt3axrTnt7Sxch3FsmWTTomXfB8YiuHT8KL8L/bFRQSL1f6 ASCHu3M89EjYazlY+vJUWLr0BhK5t/YI7bQzrOuYrl9K94vlLwzD19s/zB/g5YGGR5plJr0s JtJsFGEvF9LL3e+FKMRXveQxBB8A51nAHfwG0WSyx53d61DYz7lp4/Y4RagxaJoHp9lakn8j HV2N6rrnF+qt5ukj5SbbKWSzGg5HQF2t0QQ5tzWhCAKTfcPlnP0GymTBfNMGOReWivi3Qqzr S51Xo7hoGujUgNAM41sxpxmhx8xSwcQ5WzmxgAhJ/StNV9cb3HWIoE5StCwQ4uXOLplZNGnS uxNdegvKB95NHZjRVRChg/uMTGpg9PqYbTIFoPXjuk27sxZLRJRrueg4tLbb3HM39CJwSB++ YICcqf2N+GVD48STfcIlpp12/HI+EcDSThzfWFhaHDC0hyirHxJyHXjnZ8bUexI/5zATn/ux TpMbc/vicJxeN+qfaVqPkCbkS71cHKuPluM3jE8aNCIBNQY1/j87k5ELzg3qaesLo2n1krBH bKvFfAmQuUuJT84/IqfdVtrSCTabvDuNBDpYBV0dGbTwaRfE7i+LiJJclUr8lOvHUpJ4Y6a5 0cxEPxm498G12Z3NoY/mP5soItPIPtLR0rA0fage44zSPwp6cQARAQABtBxSYXkgS2luc2Vs bGEgPG1kckBhc2hyb2UuZXU+iQJUBBMBCAA+FiEEcDUDlKDJaDuJlfZfdJdaH/sCCpsFAlv8 B3wCGyMFCQlmAYAFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQdJdaH/sCCptdtRAAl0oE msa+djBVYLIsax+0f8acidtWg2l9f7kc2hEjp9h9aZCpPchQvhhemtew/nKavik3RSnLTAyn B3C/0GNlmvI1l5PFROOgPZwz4xhJKGN7jOsRrbkJa23a8ly5UXwF3Vqnlny7D3z+7cu1qq/f VRK8qFyWkAb+xgqeZ/hTcbJUWtW+l5Zb+68WGEp8hB7TuJLEWb4+VKgHTpQ4vElYj8H3Z94a 04s2PJMbLIZSgmKDASnyrKY0CzTpPXx5rSJ1q+B1FCsfepHLqt3vKSALa3ld6bJ8fSJtDUJ7 JLiU8dFZrywgDIVme01jPbjJtUScW6jONLvhI8Z2sheR71UoKqGomMHNQpZ03ViVWBEALzEt TcjWgJFn8yAmxqM4nBnZ+hE3LbMo34KCHJD4eg18ojDt3s9VrDLa+V9fNxUHPSib9FD9UX/1 +nGfU/ZABmiTuUDM7WZdXri7HaMpzDRJUKI6b+/uunF8xH/h/MHW16VuMzgI5dkOKKv1LejD dT5mA4R+2zBS+GsM0oa2hUeX9E5WwjaDzXtVDg6kYq8YvEd+m0z3M4e6diFeLS77/sAOgaYL 92UcoKD+Beym/fVuC6/55a0e12ksTmgk5/ZoEdoNQLlVgd2INtvnO+0k5BJcn66ZjKn3GbEC VqFbrnv1GnA58nEInRCTzR1k26h9nmS5Ag0EW/wHfAEQAMth1vHr3fOZkVOPfod3M6DkQir5 xJvUW5EHgYUjYCPIa2qzgIVVuLDqZgSCCinyooG5dUJONVHj3nCbITCpJp4eB3PI84RPfDcC hf/V34N/Gx5mTeoymSZDBmXT8YtvV/uJvn+LvHLO4ZJdvq5ZxmDyxfXFmkm3/lLw0+rrNdK5 pt6OnVlCqEU9tcDBezjUwDtOahyV20XqxtUttN4kQWbDRkhT+HrA9WN9l2HX91yEYC+zmF1S OhBqRoTPLrR6g4sCWgFywqztpvZWhyIicJipnjac7qL/wRS+wrWfsYy6qWLIV80beN7yoa6v ccnuy4pu2uiuhk9/edtlmFE4dNdoRf7843CV9k1yRASTlmPkU59n0TJbw+okTa9fbbQgbIb1 pWsAuicRHyLUIUz4f6kPgdgty2FgTKuPuIzJd1s8s6p2aC1qo+Obm2gnBTduB+/n1Jw+vKpt 07d+CKEKu4CWwvZZ8ktJJLeofi4hMupTYiq+oMzqH+V1k6QgNm0Da489gXllU+3EFC6W1qKj tkvQzg2rYoWeYD1Qn8iXcO4Fpk6wzylclvatBMddVlQ6qrYeTmSbCsk+m2KVrz5vIyja0o5Y yfeN29s9emXnikmNfv/dA5fpi8XCANNnz3zOfA93DOB9DBf0TQ2/OrSPGjB3op7RCfoPBZ7u AjJ9dM7VABEBAAGJAjwEGAEIACYWIQRwNQOUoMloO4mV9l90l1of+wIKmwUCW/wHfAIbDAUJ CWYBgAAKCRB0l1of+wIKm3KlD/9w/LOG5rtgtCUWPl4B3pZvGpNym6XdK8cop9saOnE85zWf u+sKWCrxNgYkYP7aZrYMPwqDvilxhbTsIJl5HhPgpTO1b0i+c0n1Tij3EElj5UCg3q8mEc17 c+5jRrY3oz77g7E3oPftAjaq1ybbXjY4K32o3JHFR6I8wX3m9wJZJe1+Y+UVrrjY65gZFxcA thNVnWKErarVQGjeNgHV4N1uF3pIx3kT1N4GSnxhoz4Bki91kvkbBhUgYfNflGURfZT3wIKK +d50jd7kqRouXUCzTdzmDh7jnYrcEFM4nvyaYu0JjSS5R672d9SK5LVIfWmoUGzqD4AVmUW8 pcv461+PXchuS8+zpltR9zajl72Q3ymlT4BTAQOlCWkD0snBoKNUB5d2EXPNV13nA0qlm4U2 GpROfJMQXjV6fyYRvttKYfM5xYKgRgtP0z5lTAbsjg9WFKq0Fndh7kUlmHjuAIwKIV4Tzo75 QO2zC0/NTaTjmrtiXhP+vkC4pcrOGNsbHuaqvsc/ZZ0siXyYsqbctj/sCd8ka2r94u+c7o4l BGaAm+FtwAfEAkXHu4y5Phuv2IRR+x1wTey1U1RaEPgN8xq0LQ1OitX4t2mQwjdPihZQBCnZ wzOrkbzlJMNrMKJpEgulmxAHmYJKgvZHXZXtLJSejFjR0GdHJcL5rwVOMWB8cg== Message-ID: <76e12ab0-c467-986f-dac3-6e896f3fb193@ashroe.eu> Date: Fri, 31 Jan 2020 18:34:38 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2 MIME-Version: 1.0 In-Reply-To: <20200131170201.3236153-1-jerinj@marvell.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-AuthUser: mdr@ashroe.eu Subject: Re: [dpdk-dev] [RFC PATCH 0/5] graph: introduce graph subsystem X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Jerin, Much kudos on a huge contribution to the community. Look forward to spend more time looking at it in the next few days. I'll bite and ask the obvious questions - why would I use rte_graph over FD.io VPP? Ray K On 31/01/2020 17:01, jerinj@marvell.com wrote: > From: Jerin Jacob > > This RFC is targeted for v20.05 release. > > This RFC patch includes an implementation of graph architecture for packet > processing using DPDK primitives. > > Using graph traversal for packet processing is a proven architecture > that has been implemented in various open source libraries. > > Graph architecture for packet processing enables abstracting the data > processing functions as “nodes” and “links” them together to create a complex > “graph” to create reusable/modular data processing functions. > > The RFC patch further includes performance enhancements and modularity > to the DPDK as discussed in more detail below. > > What this RFC patch contains: > ----------------------------- > 1) The API definition to "create" nodes and "link" together to create a "graph" > for packet processing. See, lib/librte_graph/rte_graph.h > > 2) The Fast path API definition for the graph walker and enqueue function > used by the workers. See, lib/librte_graph/rte_graph_worker.h > > 3) Optimized SW implementation for (1) and (2). See, lib/librte_graph/ > > 4) Test case to verify the graph infrastructure functionality > See, app/test/test_graph.c > > 5) Performance test cases to evaluate the cost of graph walker and nodes > enqueue fast-path function for various combinations. > > See app/test/test_graph_perf.c > > 6) Packet processing nodes(Null, Rx, Tx, Pkt drop, IPV4 rewrite, IPv4 lookup) > using graph infrastructure. See lib/librte_node/* > > 7) An example application to showcase l3fwd > (functionality same as existing examples/l3fwd) using graph infrastructure and > use packets processing nodes (item (6)). See examples/l3fwd-graph/. > > Performance > ----------- > 1) Graph walk and node enqueue overhead can be tested with performance test > case application [1] > # If all packets go from a node to another node (we call it as "homerun") then > it will be just a pointer swap for a burst of packets. > # In the worst case, a couple of handful cycles to move an object from a node > to another node. > > 2) Performance comparison with existing l3fwd (The complete static code with out > any nodes) vs modular l3fwd-graph with 5 nodes > (ip4_lookup, ip4_rewrite, ethdev_tx, ethdev_rx, pkt_drop). > Here is graphical representation of the l3fwd-graph as Graphviz dot file: > http://bit.ly/39UPPGm > > # l3fwd-graph performance is -2.5% wrt static l3fwd. > > # We have simulated the similar test with existing librte_pipeline application [4]. > ip_pipline application is -48.62% wrt static l3fwd. > > The above results are on octeontx2. It may vary on other platforms. > The platforms with higher L1 and L2 caches will have further better performance. > > Tested architectures: > -------------------- > 1) AArch64 > 2) X86 > > > Graph library Features > ---------------------- > 1) Nodes as plugins > 2) Support for out of tree nodes > 3) Multi-process support. > 4) Low overhead graph walk and node enqueue > 5) Low overhead statistics collection infrastructure > 6) Support to export the graph as a Graphviz dot file. > See rte_graph_export() > Example of exported graph: http://bit.ly/2PqbqOy > 7) Allow having another graph walk implementation > in the future by segregating the fast path and slow path code. > > > Advantages of Graph architecture: > --------------------------------- > > 1) Memory latency is the enemy for high-speed packet processing, > moving the similar packet processing code to a node will reduce > the I cache and D caches misses. > 2) Exploits the probability that most packets will follow the same nodes in the graph. > 3) Allow SIMD instructions for packet processing of the node. > 4) The modular scheme allows having reusable nodes for the consumers. > 5) The modular scheme allows us to abstract the vendor HW specific > optimizations as a node. > > > What is different than existing libpipeline library > --------------------------------------------------- > At a very high level, libpipeline created to allow modular plugin interface. > Based on our analysis the performance is better in the graph model. > Check the details under the Performance section, Item (2). > > This rte_graph implementation has taken care of fixing some of the > architecture/implementations limitations with libpipeline. > > 1) Use cases like IP fragmentation, TCP ACK processing > (with new TCP data sent out in the same context) > have a problem as rte_pipeline_run() passes just pkt_mask of 64 bits to different > tables and packet pointers are stored in the single array in struct rte_pipeline_run. > > In Graph architecture, The node has complete control of how many packets are > output to next node seamlessly. > > 2) Since pktmask is passed to different tables, it takes multiple for loops to > extract pkts out of fragmented pkts_mask. This makes it difficult to prefetch > ahead a set of packets. This issue does not exist in Graph architecture. > > 3) Every table have two/three function pointers unlike graph architecture that > has a single function pointer for node. > > 4) The current libpipeline main fast-path function doesn't support tree-like > topology where 64 packets can be redirected to 64 different tables. > It is currently limited to table-based next table id instead of per-packet > action based next table id. So in a typical case, we need to cascade tables and > sequentially go through all the tables to reach the last table. > > 5) pkt_mask limit is 64 bits which is the max burst size possible. > The graph library supports up to 256. > > In short, both are significantly different architectures. > Allowing the end-user to choose the model would be a more appropriate decision > by keeping both in DPDK. > > > Why this RFC > ------------ > 1) We believe, Graph architecture provides the best performance for > reusable/modular packet processing framework. > Since DPDK does not have it, it is good to have it in DPDK. > > 2) Based on our experience, NPU HW accelerates are so different than one vendor > to another vendor. Going forward, We believe, API abstraction may not be enough > abstract the difference in HW. The Vendor-specific nodes can abstract the HW > differences and reuse generic the nodes as needed. > This would help both the silicon vendors and DPDK end users. > > 3) The framework enables the protocol stack as use native mbuf for > graph processing to avoid any conversion between the formats for > better performance. > > 4) DPDK becomes the "goto library" for userspace HW acceleration. > It is good to have native Graph packet processing library in DPDK. > > 5) Obviously, Our customers are interested in Graph library in DPDK :-) > > Identified tweaking for better performance on different targets > --------------------------------------------------------------- > 1) Test with various burst size values (256, 128, 64, 32) using > CONFIG_RTE_GRAPH_BURST_SIZE config option. > Based on our testing, on x86 and arm64 servers, The sweet spot is 256 burst size. > While on arm64 embedded SoCs, it is either 64 or 128. > > 2) Disable node statistics (use CONFIG_RTE_LIBRTE_GRAPH_STATS config option) > if not needed. > > 3) Use arm64 optimized memory copy for arm64 architecture by > selecting CONFIG_RTE_ARCH_ARM64_MEMCPY. > > Commands to run tests > --------------------- > > [1] > perf test: > echo "graph_perf_autotest" | sudo ./build/app/test/dpdk-test -c 0x30 > > [2] > functionality test: > echo "graph_autotest" | sudo ./build/app/test/dpdk-test -c 0x30 > > [3] > l3fwd-graph: > ./l3fwd-graph -c 0x100 -- -p 0x3 --config="(0, 0, 8)" -P > > [4] > # ./ip_pipeline --c 0xff0000 -- -s route.cli > > Route.cli: (Copy paste to the shell to avoid dos format issues) > > https://pastebin.com/raw/B4Ktx7TT > > > Next steps > ----------------------------- > 1) Feedback from the community on the library. > 2) Collect the API requirements from the community. > 3) Sending the next version by addressing the community initial. > feedback and fixing the following identified "pending items". > > > Pending items (Will be addressed in next revision) > ------------------------------------------------- > 1) Add documentation as a patch > 2) Add Doxygen API documentation > 3) Split the patches at a more logical level for a better review. > 4) code cleanup > 5) more optimizations in the nodes and graph infrastructure. > > > Programming guide and API walk-through > -------------------------------------- > # Anatomy of Node: > ~~~~~~~~~~~~~~~~~ > See the https://github.com/jerinjacobk/share/blob/master/Anatomy_of_a_node.svg > > The above diagram depicts the anatomy of a node. > The node is the basic building block of the graph framework. > > A node consists of: > a) process(): > > The callback function will be invoked by worker thread using > rte_graph_walk() function when there is data to be processed by the node. > A graph node process the function using process() and enqueue to next > downstream node using rte_node_enqueue*() function. > > b) Context memory: > > It is memory allocated by the library to store the node-specific context > information. which will be used by process(), init(), fini() callbacks. > > c) init(): > > The callback function which will be invoked by rte_graph_create() on when a node > gets attached to a graph. > > d) fini(): > > The callback function which will be invoked by rte_graph_destroy() on when a node > gets detached to a graph. > > > e) Node name: > > It is the name of the node. When a node registers to graph library, the library > gives the ID as rte_node_t type. Both ID or Name shall be used lookup the node. > rte_node_from_name(), rte_node_id_to_name() are the node lookup functions. > > f) nb_edges: > > Number of downstream nodes connected to this node. The next_nodes[] stores the > downstream nodes objects. rte_node_edge_update() and rte_node_edge_shrink() > functions shall be used to update the next_node[] objects. Consumers of the node > APIs are free to update the next_node[] objects till rte_graph_create() invoked. > > g) next_node[]: > > The dynamic array to store the downstream nodes connected to this node. > > > # Node creation and registration > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > a) Node implementer creates the node by implementing ops and attributes of > 'struct rte_node_register' > b) The library registers the node by invoking RTE_NODE_REGISTER on library load > using the constructor scheme. > The constructor scheme used here to support multi-process. > > > # Link the Nodes to create the graph topology > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > See the https://github.com/jerinjacobk/share/blob/master/Link_the_nodes.svg > > The above diagram shows a graph topology after linking the N nodes. > > Once nodes are available to the program, Application or node public API functions > can links them together to create a complex packet processing graph. > > There are multiple different types of strategies to link the nodes. > > Method a) Provide the next_nodes[] at the node registration time. > See 'struct rte_node_register::nb_edges'. This is a use case to address the static > node scheme where one knows upfront the next_nodes[] of the node. > > Method b) Use rte_node_edge_get(), rte_node_edge_update(), rte_node_edge_shrink() to > Update the next_nodes[] links for the node dynamically. > > Method c) Use rte_node_clone() to clone a already existing node. > When rte_node_clone() invoked, The library, would clone all the attributes > of the node and creates a new one. The name for cloned node shall be > "parent_node_name-user_provided_name". This method enables the use case of Rx and Tx > nodes where multiple of those nodes need to be cloned based on the number of CPU > available in the system. The cloned nodes will be identical, except the "context memory". > Context memory will have information of port, queue pair incase of Rx and Tx ethdev nodes. > > # Create the graph object > ~~~~~~~~~~~~~~~~~~~~~~~~~ > Now that the nodes are linked, Its time to create a graph by including > the required nodes. The application can provide a set of node patterns to > form a graph object. > The fnmatch() API used underneath for the pattern matching to include > the required nodes. > > The rte_graph_create() API shall be used to create the graph. > > Example of a graph object creation: > > {"ethdev_rx_0_0", ipv4-*, ethdev_tx_0_*"} > > In the above example, A graph object will be created with ethdev Rx > node of port 0 and queue 0, all ipv4* nodes in the system, > and ethdev tx node of port 0 with all queues. > > > # Multi core graph processing > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > In the current graph library implementation, specifically, > rte_graph_walk() and rte_node_enqueue* fast path API functions > are designed to work on single-core to have better performance. > The fast path API works on graph object, So the multi-core graph > processing strategy would be to create graph object PER WORKER. > > > # In fast path: > ~~~~~~~~~~~~~~~ > > Typical fast-path code looks like below, where the application > gets the fast-path graph object through rte_graph_lookup() > on the worker thread and run the rte_graph_walk() in a tight loop. > > struct rte_graph *graph = rte_graph_lookup("worker0"); > > while (!done) { > rte_graph_walk(graph); > } > > # Context update when graph walk in action > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The fast-path object for the node is `struct rte_node`. > > It may be possible that in slow-path or after the graph walk-in action, > the user needs to update the context of the node hence access to > struct rte_node * memory. > > rte_graph_foreach_node(), rte_graph_node_get(), rte_graph_node_get_by_name() > APIs can be used to to get the struct rte_node*. rte_graph_foreach_node() iterator > function works on struct rte_graph * fast-path graph object while others > works on graph ID or name. > > > # Get the node statistics using graph cluster > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The user may need to know the aggregate stats of the node across > multiple graph objects. Especially the situation where each > graph object bound to a worker thread. > > Introduced a graph cluster object for statistics. rte_graph_cluster_stats_create() > shall be used for creating a graph cluster with multiple graph objects and > rte_graph_cluster_stats_get() to get the aggregate node statistics. > > An example statistics output from rte_graph_cluster_stats_get() > > +-----------+------------+-------------+---------------+------------+---------------+-----------+ > |Node |calls |objs |realloc_count |objs/call |objs/sec(10E6) |cycles/call| > +------------------------+-------------+---------------+------------+---------------+-----------+ > |node0 |12977424 |3322220544 |5 |256.000 |3047.151872 |20.0000 | > |node1 |12977653 |3322279168 |0 |256.000 |3047.210496 |17.0000 | > |node2 |12977696 |3322290176 |0 |256.000 |3047.221504 |17.0000 | > |node3 |12977734 |3322299904 |0 |256.000 |3047.231232 |17.0000 | > |node4 |12977784 |3322312704 |1 |256.000 |3047.243776 |17.0000 | > |node5 |12977825 |3322323200 |0 |256.000 |3047.254528 |17.0000 | > +-----------+------------+-------------+---------------+------------+---------------+-----------+ > > # Node writing guide lines > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The process() function of a node is fast-path function and that needs to be written > carefully to achieve max performance. > > Broadly speaking, there are two different types of nodes. > > 1) First kind of nodes are those that have a fixed next_nodes[] for the > complete burst (like ethdev_rx, ethdev_tx) and it is simple to write. > Process() function can move the obj burst to the next node either using > rte_node_next_stream_move() or using rte_node_next_stream_get() and > rte_node_next_stream_put(). > > > 2) The second kind of such node is `intermediate nodes` that decide what is the next_node[] > to send to on a per-packet basis. In these nodes, > > a) Firstly, there has to be the best possible packet processing logic. > b) Secondly, each packet needs to be queued to its next node. > > At least on some architectures, we get around ~10% more performance if we can avoid copying of > packet pointers from one node to next as it is ~= memcpy(BURST_SIZE x sizeof(void *)) x NODE_COUNT. > > This can be avoided only in the case where all the packets are destined to the same > next node. We call this as home run case and we use rte_node_next_stream_move() to > just move burst of object array by swapping the pointer. a.k.a move stream from one node to next node > with least number of cycles. > > Example of intermediate node implementation with home run: > a) Start with speculation that next_node = ctx->next_node. > This could be the next_node application used in the previous function call of this node. > b) Get the next_node stream array and space using > rte_node_next_stream_get(next_node, &space) > c) while space != 0 and n_pkts_left != 0, > prefetch next pkt_set and process current pkt_set to find their next node > d) if all the next nodes of the current pkt_set match speculated next node, > just count them as successfully speculated(last_spec) till now and > continue the loop without actually moving them to the next node. > else if there is a mismatch, > copy all the pkt_set pointers that were last_spec and > move the current pkt_set to their respective next's nodes using > rte_enqueue_next_x1(). Also one of the next_node can be updated as > speculated next_node if it is more probable. Also set last_spec = 0 > e) if n_pkts_left != 0 and space != 0 > goto c) as there is space in the speculated next_node. > f) if last_spec == n_pkts_left, > then we successfully speculated all the packets to right next node. > Just call rte_node_next_stream_move(node, next_node) to just move the > stream/obj array to next node. This is home run where we avoided > memcpy of buffer pointers to next node. > g) if space = 0 and n_pkts_left != 0 > goto b) > h) Update the ctx->next_node with more probable next node. > > # In-tree node documentation > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > a) librte_node/ethdev_rx.c: > This node does rte_eth_rx_burst() into stream buffer acquired using > rte_node_next_stream_get() and does rte_node_next_stream_put(count) > only when there are packets received. Each rte_node works on only on > one rx port and queue that it gets from node->context. > For each (port X, rx_queue Y), a rte_node is cloned from ethdev_rx_base_node > as "ethdev_rx-X-Y" in rte_node_eth_config() along with updating > node->context. Each graph needs to be associated with a unique > rte_node for a (port, rx_queue). > > b) librte_node/ethdev_tx.c: > This node does rte_eth_tx_burst() for a burst of objs received by it. > It sends the burst to a fixed Tx Port and Queue information from > node->context. For each (port X), this rte_node is cloned from > ethdev_tx_node_base as "ethdev_tx-X" in rte_node_eth_config() > along with updating node->context. > Since each graph doesn't need more than one Txq, per port, > a Txq is assigned based on graph id to each rte_node instance. > Each graph needs to be associated with a rte_node for each (port). > > c) librte_node/pkt_drop.c: > This node frees all the objects that are passed to it. > > d) librte_node/ip4_lookup.c: > This node is an intermediate node that does lpm lookup for the received > ipv4 packets and the result determines each packets next node. > a) On successful lpm lookup, the result contains the nex_node id and > next-hop id with which the packet needs to be further processed. > b) On lpm lookup failure, objects are redirected to pkt_drop node. > rte_node_ip4_route_add() is control path API to add ipv4 routes. > To achieve home run, we use rte_node_stream_move() as mentioned in above > sections. > > e) librte_node/ip4_rewrite.c: > This node gets packets from ip4_lookup node with next-hop id for each > packet is embedded in rte_node_mbuf_priv1(mbuf)->nh. This id is used > to determine the L2 header to be written to the pkt before sending > the pkt out to a particular ethdev_tx node. > rte_node_ip4_rewrite_add() is control path API to add next-hop info. > > Jerin Jacob (1): > graph: introduce graph subsystem > > Kiran Kumar K (1): > test: add graph functional tests > > Nithin Dabilpuram (2): > node: add packet processing nodes > example/l3fwd_graph: l3fwd using graph architecture > > Pavan Nikhilesh (1): > test: add graph performance test cases. > > app/test/Makefile | 5 + > app/test/meson.build | 10 +- > app/test/test_graph.c | 820 +++++++++++++++++ > app/test/test_graph_perf.c | 888 +++++++++++++++++++ > config/common_base | 13 + > config/rte_config.h | 4 + > examples/Makefile | 3 + > examples/l3fwd-graph/Makefile | 58 ++ > examples/l3fwd-graph/main.c | 1131 ++++++++++++++++++++++++ > examples/l3fwd-graph/meson.build | 13 + > examples/meson.build | 6 +- > lib/Makefile | 6 + > lib/librte_graph/Makefile | 28 + > lib/librte_graph/graph.c | 578 ++++++++++++ > lib/librte_graph/graph_debug.c | 81 ++ > lib/librte_graph/graph_ops.c | 163 ++++ > lib/librte_graph/graph_populate.c | 224 +++++ > lib/librte_graph/graph_private.h | 113 +++ > lib/librte_graph/graph_stats.c | 396 +++++++++ > lib/librte_graph/meson.build | 11 + > lib/librte_graph/node.c | 419 +++++++++ > lib/librte_graph/rte_graph.h | 277 ++++++ > lib/librte_graph/rte_graph_version.map | 46 + > lib/librte_graph/rte_graph_worker.h | 280 ++++++ > lib/librte_node/Makefile | 30 + > lib/librte_node/ethdev_ctrl.c | 106 +++ > lib/librte_node/ethdev_rx.c | 218 +++++ > lib/librte_node/ethdev_rx.h | 17 + > lib/librte_node/ethdev_rx_priv.h | 45 + > lib/librte_node/ethdev_tx.c | 74 ++ > lib/librte_node/ethdev_tx_priv.h | 33 + > lib/librte_node/ip4_lookup.c | 657 ++++++++++++++ > lib/librte_node/ip4_lookup_priv.h | 17 + > lib/librte_node/ip4_rewrite.c | 340 +++++++ > lib/librte_node/ip4_rewrite_priv.h | 44 + > lib/librte_node/log.c | 14 + > lib/librte_node/meson.build | 8 + > lib/librte_node/node_private.h | 61 ++ > lib/librte_node/null.c | 23 + > lib/librte_node/pkt_drop.c | 26 + > lib/librte_node/rte_node_eth_api.h | 31 + > lib/librte_node/rte_node_ip4_api.h | 33 + > lib/librte_node/rte_node_version.map | 9 + > lib/meson.build | 5 +- > meson.build | 1 + > mk/rte.app.mk | 2 + > 46 files changed, 7362 insertions(+), 5 deletions(-) > create mode 100644 app/test/test_graph.c > create mode 100644 app/test/test_graph_perf.c > create mode 100644 examples/l3fwd-graph/Makefile > create mode 100644 examples/l3fwd-graph/main.c > create mode 100644 examples/l3fwd-graph/meson.build > create mode 100644 lib/librte_graph/Makefile > create mode 100644 lib/librte_graph/graph.c > create mode 100644 lib/librte_graph/graph_debug.c > create mode 100644 lib/librte_graph/graph_ops.c > create mode 100644 lib/librte_graph/graph_populate.c > create mode 100644 lib/librte_graph/graph_private.h > create mode 100644 lib/librte_graph/graph_stats.c > create mode 100644 lib/librte_graph/meson.build > create mode 100644 lib/librte_graph/node.c > create mode 100644 lib/librte_graph/rte_graph.h > create mode 100644 lib/librte_graph/rte_graph_version.map > create mode 100644 lib/librte_graph/rte_graph_worker.h > create mode 100644 lib/librte_node/Makefile > create mode 100644 lib/librte_node/ethdev_ctrl.c > create mode 100644 lib/librte_node/ethdev_rx.c > create mode 100644 lib/librte_node/ethdev_rx.h > create mode 100644 lib/librte_node/ethdev_rx_priv.h > create mode 100644 lib/librte_node/ethdev_tx.c > create mode 100644 lib/librte_node/ethdev_tx_priv.h > create mode 100644 lib/librte_node/ip4_lookup.c > create mode 100644 lib/librte_node/ip4_lookup_priv.h > create mode 100644 lib/librte_node/ip4_rewrite.c > create mode 100644 lib/librte_node/ip4_rewrite_priv.h > create mode 100644 lib/librte_node/log.c > create mode 100644 lib/librte_node/meson.build > create mode 100644 lib/librte_node/node_private.h > create mode 100644 lib/librte_node/null.c > create mode 100644 lib/librte_node/pkt_drop.c > create mode 100644 lib/librte_node/rte_node_eth_api.h > create mode 100644 lib/librte_node/rte_node_ip4_api.h > create mode 100644 lib/librte_node/rte_node_version.map >