From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id C2C401B1F7 for ; Tue, 19 Feb 2019 16:57:28 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Feb 2019 07:57:28 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,388,1544515200"; d="scan'208";a="276211660" Received: from unknown (HELO saesrv02-S2600CWR.intel.com) ([10.224.122.203]) by orsmga004.jf.intel.com with ESMTP; 19 Feb 2019 07:57:25 -0800 From: Vipin Varghese To: john.mcnamara@intel.com, marko.kovacevic@intel.com, thomas@monjalon.net, ferruh.yigit@intel.com, shreyansh.jain@nxp.com, dev@dpdk.org Cc: sanjay.padubidri@intel.com, amol.patel@intel.com, Vipin Varghese Date: Tue, 19 Feb 2019 21:27:19 +0530 Message-Id: <20190219155719.89443-3-vipin.varghese@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190219155719.89443-1-vipin.varghese@intel.com> References: <20190121104144.67365-2-vipin.varghese@intel.com> <20190219155719.89443-1-vipin.varghese@intel.com> Subject: [dpdk-dev] [PATCH v6 2/2] doc: add guide for debug and troubleshoot X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Feb 2019 15:57:29 -0000 Add user guide on debugging and troubleshooting for common issues and bottleneck found in the sample application model. Signed-off-by: Vipin Varghese Acked-by: Marko Kovacevic --- doc/guides/howto/debug_troubleshoot_guide.rst | 451 ++++++++++++++++++ doc/guides/howto/index.rst | 1 + 2 files changed, 452 insertions(+) create mode 100644 doc/guides/howto/debug_troubleshoot_guide.rst diff --git a/doc/guides/howto/debug_troubleshoot_guide.rst b/doc/guides/howto/debug_troubleshoot_guide.rst new file mode 100644 index 000000000..89310121f --- /dev/null +++ b/doc/guides/howto/debug_troubleshoot_guide.rst @@ -0,0 +1,451 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(c) 2018 Intel Corporation. + +.. _debug_troubleshoot_guide: + +Debug & Troubleshoot guide +========================== + +DPDK applications can be designed to have simple or complex pipeline processing +stages making use of single or multiple threads. Applications can use poll mode +hardware devices which helps in offloading CPU cycles too. It is common to find +solutions designed with + +* single or multiple primary processes + +* single primary and single secondary + +* single primary and multiple secondaries + +In all the above cases, it is tedious to isolate, debug, and understand various +behaviors which occur randomly or periodically. The goal of the guide is to +consolidate a few commonly seen issues for reference. Then, isolate to identify +the root cause through step by step debug at various stages. + +..note:: + + It is difficult to cover all possible issues; in a single attempt. With + feedback and suggestions from the community, more cases can be covered. + + +Application Overview +-------------------- + +By making use of the application model as a reference, we can discuss multiple +causes of issues in the guide. Let us assume the sample makes use of a single +primary process, with various processing stages running on multiple cores. The +application may also make uses of Poll Mode Driver, and libraries like service +cores, mempool, mbuf, eventdev, cryptodev, QoS, and ethdev. + +The overview of an application modeled using PMD is shown in +:numref:`dtg_sample_app_model`. + +.. _dtg_sample_app_model: + +.. figure:: img/dtg_sample_app_model.* + + Overview of pipeline stage of an application + + +Bottleneck Analysis +------------------- + +A couple of factors that lead the design decision could be the platform, scale +factor, and target. This distinct preference leads to multiple combinations, +that are built using PMD and libraries of DPDK. While the compiler, library +mode, and optimization flags are the components are to be constant, that +affects the application too. + + +Is there mismatch in packet (received < desired) rate? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RX Port and associated core :numref:`dtg_rx_rate`. + +.. _dtg_rx_rate: + +.. figure:: img/dtg_rx_rate.* + + RX packet rate compared against received rate. + +#. Is the configuration for the RX setup correctly? + + - Identify if port Speed and Duplex is matching to desired values with + rte_eth_link_get. + + - Check MTU value is set to the expected packet size to receive with + rte_eth_get_mtu if there are large packet drops. + + - Check promiscuous mode if the drops do not occur for unique MAC address + with rte_eth_promiscuous_get. + +#. Is the drop isolated to certain NIC only? + + - Make use of rte_eth_dev_stats to identify the drops cause. + + - If there are mbuf drops, check nb_desc for RX descriptor as it might not + be sufficient for the application. + + - If rte_eth_dev_stats shows drops are on specific RX queues, ensure RX + lcore threads has enough cycles for rte_eth_rx_burst on the port queue + pair. + + - If there are redirect to a specific port queue pair with, ensure RX lcore + threads gets enough cycles. + + - Check the RSS configuration rte_eth_dev_rss_hash_conf_get if the spread is + not even and causing drops. + + - If PMD stats are not updating, then there might be offload or configuration + which is dropping the incoming traffic. + +#. Is there drops still seen? + + - If there are multiple port queue pair, it might be the RX thread, RX + distributor, or event RX adapter not having enough cycles. + + - If there are drops seen for RX adapter or RX distributor, try using + rte_prefetch_non_temporal() which intimates the core that the mbuf in the + cache is temporary. + +Is there packet drops at receive or transmit? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RX-TX port and associated cores :numref:`dtg_rx_tx_drop`. + +.. _dtg_rx_tx_drop: + +.. figure:: img/dtg_rx_tx_drop.* + + RX-TX drops + +#. At RX + + - Identify if there are multiple RX queue configured for port by nb_rx_queues + using rte_eth_dev_info_get. + + - Using rte_eth_dev_stats shows drops in q_errors, check if RX thread is + configured to fetch packets from the port queue pair. + + - Using rte_eth_dev_stats shows drops in rx_nombuf, check if RX thread has + enough cycles to consume the packets from the queue. + +#. At TX + + - If the TX rate is falling behind the application fill rate, identify if + there are enough descriptors with rte_eth_dev_info_get for TX. + + - Check the nb_pkt in rte_eth_tx_burst is done for multiple packets. + + - Check rte_eth_tx_burst invokes the vector function call for the PMD. + + - If oerrors are getting incremented, TX packet validations are failing. + Check if there queue specific offload failures. + + - If the drops occur for large size packets, check MTU and multi-segment + support configured for NIC. + +Is there object drops in producer point for the ring library? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Producer point for ring :numref:`dtg_producer_ring`. + +.. _dtg_producer_ring: + +.. figure:: img/dtg_producer_ring.* + + Producer point for Rings + +#. Performance issue isolation at producer + + - Use rte_ring_dump to validate for all single producer flag is set to + RING_F_SP_ENQ. + + - There should be sufficient rte_ring_free_count at any point in time. + + - Extreme stalls in dequeue stage of the pipeline will cause rte_ring_full + to be true. + +Is there object drops in consumer point for the ring library? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Consumer point for ring :numref:`dtg_consumer_ring`. + +.. _dtg_consumer_ring: + +.. figure:: img/dtg_consumer_ring.* + + Consumer point for Rings + +#. Performance issue isolation at consumer + + - Use rte_ring_dump to validate for all single consumer flag is set to + RING_F_SC_DEQ. + + - If the desired burst dequeue falls behind the actual dequeue, the enqueue + stage is not filling up the ring as required. + + - Extreme stall in the enqueue will lead to rte_ring_empty to be true. + +Is there a variance in packet or object processing rate in the pipeline? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Memory objects close to NUMA :numref:`dtg_mempool`. + +.. _dtg_mempool: + +.. figure:: img/dtg_mempool.* + + Memory objects have to be close to the device per NUMA. + +#. Stall in processing pipeline can be attributes of MBUF release delays. + These can be narrowed down to + + - Heavy processing cycles at single or multiple processing stages. + + - Cache is spread due to the increased stages in the pipeline. + + - CPU thread responsible for TX is not able to keep up with the burst of + traffic. + + - Extra cycles to linearize multi-segment buffer and software offload like + checksum, TSO, and VLAN strip. + + - Packet buffer copy in fast path also results in stalls in MBUF release if + not done selectively. + + - Application logic sets rte_pktmbuf_refcnt_set to higher than the desired + value and frequently uses rte_pktmbuf_prefree_seg and does not release + MBUF back to mempool. + +#. Lower performance between the pipeline processing stages can be + + - The NUMA instance for packets or objects from NIC, mempool, and ring + should be the same. + + - Drops on a specific socket are due to insufficient objects in the pool. + Use rte_mempool_get_count or rte_mempool_avail_count to monitor when + drops occurs. + + - Try prefetching the content in processing pipeline logic to minimize the + stalls. + +#. Performance issue can be due to special cases + + - Check if MBUF continuous with rte_pktmbuf_is_contiguous as certain offload + requires the same. + + - Use rte_mempool_cache_create for user threads require access to mempool + objects. + + - If the variance is absent for larger huge pages, then try rte_mem_lock_page + on the objects, packets, lookup tables to isolate the issue. + +Is there a variance in cryptodev performance? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Crypto device and PMD :numref:`dtg_crypto`. + +.. _dtg_crypto: + +.. figure:: img/dtg_crypto.* + + CRYPTO and interaction with PMD device. + +#. Performance issue isolation for enqueue + + - Ensure cryptodev, resources and enqueue is running on NUMA cores. + + - Isolate if the cause of errors for err_count using rte_cryptodev_stats. + + - Parallelize enqueue thread for varied multiple queue pair. + +#. Performance issue isolation for dequeue + + - Ensure cryptodev, resources and dequeue are running on NUMA cores. + + - Isolate if the cause of errors for err_count using rte_cryptodev_stats. + + - Parallelize dequeue thread for varied multiple queue pair. + +#. Performance issue isolation for crypto operation + + - If the cryptodev software-assist is in use, ensure the library is built + with right (SIMD) flags or check if the queue pair using CPU ISA for + feature_flags AVX|SSE|NEON using rte_cryptodev_info_get. + + - If the cryptodev hardware-assist is in use, ensure both firmware and + drivers are up to date. + +#. Configuration issue isolation + + - Identify cryptodev instances with rte_cryptodev_count and + rte_cryptodev_info_get. + +Is user functions performance is not as expected? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Custom worker function :numref:`dtg_distributor_worker`. + +.. _dtg_distributor_worker: + +.. figure:: img/dtg_distributor_worker.* + + Custom worker function performance drops. + +#. Performance issue isolation + + - The functions running on CPU cores without context switches are the + performing scenarios. Identify lcore with rte_lcore and lcore index + mapping with CPU using rte_lcore_index. + + - The functions running on CPU cores without context switches are the + performing scenarios. Identify lcore with rte_lcore and lcore index + mapping with CPU using rte_lcore_index. + + - Use rte_thread_get_affinity to isolate functions running on the same + CPU core. + +#. Configuration issue isolation + + - Identify core role using rte_eal_lcore_role to identify RTE, OFF and + SERVICE. Check performance functions are mapped to run on the cores. + + - For high-performance execution logic ensure running it on correct NUMA + and non-master core. + + - Analyze run logic with rte_dump_stack, rte_dump_registers and rte_memdump + for more insights. + + - Make use of 'objdump' to ensure opcode is matching to the desired state. + +Is the execution cycles for dynamic service functions are not frequent? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +service functions on service cores :numref:`dtg_service`. + +.. _dtg_service: + +.. figure:: img/dtg_service.* + + functions running on service cores + +#. Performance issue isolation + + - Services configured for parallel execution should have + rte_service_lcore_count should be equal to + rte_service_lcore_count_services. + + - A service to run parallel on all cores should return + RTE_SERVICE_CAP_MT_SAFE for rte_service_probe_capability and + rte_service_map_lcore_get returns unique lcore. + + - If service function execution cycles for dynamic service functions are + not frequent? + + - If services share the same lcore should run in the overall CPU execution + cycle. + +#. Configuration issue isolation + - Check if service is running with rte_service_runstate_get. + - Generic debug via rte_service_dump. + +Is there a bottleneck in the performance of eventdev? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. Check for generic configuration + + - Ensure the event devices created are right NUMA using rte_event_dev_count + and rte_event_dev_socket_id. + + - Check for event stages if the events are looped back into the same queue. + + - If the failure is on the enqueue stage for events, check if queue depth + with rte_event_dev_info_get. + +#. If there are performance drops in the enqueue stage + + - Use rte_event_dev_dump to dump the eventdev information. + + - Periodically checks stats for queue and port to identify the starvation. + + - Check the in-flight events for the desired queue for enqueue and dequeue. + +Is there a variance in traffic manager? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Traffic Manager on TX interface :numref:`dtg_qos_tx`. + +.. _dtg_qos_tx: + +.. figure:: img/dtg_qos_tx.* + + Traffic Manager just before TX. + +#. Identify the cause for a variance from expected behavior, is due to + insufficient CPU cycles. Use rte_tm_capabilities_get to fetch features for + hierarchies, WRED and priority schedulers to be offloaded hardware. + +#. Undesired flow drops can be narrowed down to WRED, priority, and rates + limiters. + +#. Isolate the flow in which the undesired drops occur. Use + rte_tn_get_number_of_leaf_node and flow table to ping down the leaf where + drops occur. + +#. Check the stats using rte_tm_stats_update and rte_tm_node_stats_read for + drops for hierarchy, schedulers and WRED configurations. + +Is the packet not in the unexpected format? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Packet capture before and after processing :numref:`dtg_pdump`. + +.. _dtg_pdump: + +.. figure:: img/dtg_pdump.* + + Capture points of Traffic at RX-TX. + +#. To isolate the possible packet corruption in the processing pipeline, + carefully staged capture packets are to be implemented. + + - First, isolate at NIC entry and exit. Use pdump in primary to allow + secondary to access port-queue pair. The packets get copied over in RX|TX + callback by the secondary process using ring buffers. + + - Second, isolate at pipeline entry and exit. Using hooks or callbacks + capture the packet middle of the pipeline stage to copy the packets, which + can be shared to the secondary debug process via user-defined custom rings. + +..note:: + + Use similar analysis to objects and metadata corruption. + +Does the issue still persist? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The cause of the issue can be further narrowed down to the following causes. + +#. If there are vendor or application specific metadata, check for errors due + to META data error flags. Dumping private meta-data in the objects can give + insight into details for debugging. + +#. If there are multi-process for either data or configuration, check for + possible errors in the secondary process where the configuration fails and + possible data corruption in the data plane. + +#. Random drops in the RX or TX when opening other application is an indication + of the effect of a noisy neighbor. Try using the cache allocation technique + to minimize the effect between applications. + + +How to develop a custom code to debug? +------------------------------------- + +#. For an application that runs as the primary process only, debug functionality + is added in the same process. These can be invoked by timer call-back, + service core and signal handler. + +#. For the application that runs as multiple processes. debug functionality in + a standalone secondary process. diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst index a642a2be1..9527fa84d 100644 --- a/doc/guides/howto/index.rst +++ b/doc/guides/howto/index.rst @@ -18,3 +18,4 @@ HowTo Guides virtio_user_as_exceptional_path packet_capture_framework telemetry + debug_troubleshoot_guide -- 2.17.1