From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 268941B710 for ; Mon, 26 Nov 2018 08:11:52 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 25 Nov 2018 23:11:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,281,1539673200"; d="scan'208";a="111163954" Received: from unknown (HELO saesrv02-S2600CWR.intel.com) ([10.224.122.203]) by fmsmga001.fm.intel.com with ESMTP; 25 Nov 2018 23:11:49 -0800 From: Vipin Varghese To: dev@dpdk.org, thomas@monjalon.net, marko.kovacevic@intel.com, honnappa.nagarahalli@arm.com, cristian.dumitrescu@intel.com, anatoly.burakov@intel.com, bruce.richardson@intel.com, olivier.matz@6wind.com Cc: john.mcnamara@intel.com, amol.patel@intel.com, Vipin Varghese Date: Mon, 26 Nov 2018 12:38:15 +0530 Message-Id: <20181126070815.37501-2-vipin.varghese@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181126070815.37501-1-vipin.varghese@intel.com> References: <20181126070815.37501-1-vipin.varghese@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [dpdk-dev] [PATCH v3 2/2] doc: add guide for debug and troubleshoot X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Nov 2018 07:11:53 -0000 Add user guide on debug and troubleshoot for common issues and bottleneck found in various application models running on single or multi stages. Signed-off-by: Vipin Varghese Acked-by: Marko Kovacevic --- V3: - reorder for removing warning in 'make doc-guides-html' - Thomas Monjalon V2: - add offload flag check - Vipin Varghese - change tab to space - Marko Kovacevic - spelling correction - Marko Kovacevic - remove extra characters - Marko Kovacevic - add ACK by Marko - Vipn Varghese --- doc/guides/howto/debug_troubleshoot_guide.rst | 351 ++++++++++++++++++ doc/guides/howto/index.rst | 1 + 2 files changed, 352 insertions(+) create mode 100644 doc/guides/howto/debug_troubleshoot_guide.rst diff --git a/doc/guides/howto/debug_troubleshoot_guide.rst b/doc/guides/howto/debug_troubleshoot_guide.rst new file mode 100644 index 000000000..55589085e --- /dev/null +++ b/doc/guides/howto/debug_troubleshoot_guide.rst @@ -0,0 +1,351 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(c) 2018 Intel Corporation. + +.. _debug_troubleshoot_via_pmd: + +Debug & Troubleshoot guide via PMD +================================== + +DPDK applications can be designed to run as single thread simple stage to +multiple threads with complex pipeline stages. These application can use poll +mode devices which helps in offloading CPU cycles. A few models are + + * single primary + * multiple primary + * single primary single secondary + * single primary multiple secondary + +In all the above cases, it is a tedious task to isolate, debug and understand +odd behaviour which can occurring random or periodic. The goal of guide is to +share and explore a few commonly seen patterns and behaviour. Then isolate and +identify the root cause via step by step debug at various processing stages. + +Application Overview +-------------------- + +Let us take up an example application as reference for explaining issues and +patterns commonly seen. The sample application in discussion makes use of +single primary model with various pipeline stages. The application uses PMD +and libraries such as service cores, mempool, pkt mbuf, event, crypto, QoS +and eth. + +The overview of an application modeled using PMD is shown in +:numref:`dtg_sample_app_model`. + +.. _dtg_sample_app_model: + +.. figure:: img/dtg_sample_app_model.* + + Overview of pipeline stage of an application + +Bottleneck Analysis +------------------- + +To debug the bottleneck and performance issues the desired application +is made to run in an environment matching as below + +#. Linux 64-bit|32-bit +#. DPDK PMD and libraries are used +#. Libraries and PMD are either static or shared. But not both +#. Machine flag optimizations of gcc or compiler are made constant + +Is there mismatch in packet rate (received < send)? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RX Port and associated core :numref:`dtg_rx_rate`. + +.. _dtg_rx_rate: + +.. figure:: img/dtg_rx_rate.* + + RX send rate compared against Received rate + +#. are generic configuration correct? + - What is port Speed, Duplex? rte_eth_link_get() + - Is packet of higher sizes are dropped? rte_eth_get_mtu() + - Are only specific MAC are received? rte_eth_promiscuous_get() + +#. are there NIC specific drops? + - Check rte_eth_rx_queue_info_get() for nb_desc, scattered_rx, + - Check rte_eth_dev_stats() for Stats per queue + - Is stats of other queues shows no change via + rte_eth_dev_dev_rss_hash_conf_get() + - Check if port offload and queue offload matches. + +#. If problem still persists, this might be at RX lcore thread + - Check if RX thread, distributor or event rx adapter is holding or + processing more than required + - try using rte_prefetch_non_temporal() to intimate the mbuf in pulled + to cache for temporary + + +Are there packet drops (receive|transmit)? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RX-TX Port and associated cores :numref:`dtg_rx_tx_drop`. + +.. _dtg_rx_tx_drop: + +.. figure:: img/dtg_rx_tx_drop.* + + RX-TX drops + +#. at RX + - Get the rx queues by rte_eth_dev_info_get() for nb_rx_queues + - Check for miss, errors, qerros by rte_eth_dev_stats() for imissed, + ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count + +#. at TX + - Are we doing in bulk to reduce the TX descriptor overhead? + - Check rte_eth_dev_stats() for oerrors, qerros, rte_mbuf_ref_count + - Is the packet multi segmented? Check if port and queue offlaod is set. + +Are there object drops in producer point for ring? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Producer point for ring :numref:`dtg_producer_ring`. + +.. _dtg_producer_ring: + +.. figure:: img/dtg_producer_ring.* + + Producer point for Rings + +#. Performance for Producer + - Fetch the type of RING 'rte_ring_dump()' for flags (RING_F_SP_ENQ) + - If '(burst enqueue - actual enqueue) > 0' check rte_ring_count() or + rte_ring_free_count() + - If 'burst or single enqueue is 0', then there is no more space check + rte_ring_full() or not + +Are there object drops in consumer point for ring? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Consumer point for ring :numref:`dtg_consumer_ring`. + +.. _dtg_consumer_ring: + +.. figure:: img/dtg_consumer_ring.* + + Consumer point for Rings + +#. Performance for Consumer + - Fetch the type of RING – rte_ring_dump() for flags (RING_F_SC_DEQ) + - If '(burst dequeue - actual dequeue) > 0' for rte_ring_free_count() + - If 'burst or single enqueue' always results 0 check the ring is empty + via rte_ring_empty() + +Is packets or objects are not processed at desired rate? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Memory objects close to NUMA :numref:`dtg_mempool`. + +.. _dtg_mempool: + +.. figure:: img/dtg_mempool.* + + Memory objects has to be close to device per NUMA + +#. Is the performance low? + - Are packets received from multiple NIC? rte_eth_dev_count_all() + - Are NIC interfaces on different socket? use rte_eth_dev_socket_id() + - Is mempool created with right socket? rte_mempool_create() or + rte_pktmbuf_pool_create() + - Are we seeing drop on specific socket? It might require more + mempool objects; try allocating more objects + - Is there single RX thread for multiple NIC? try having multiple + lcore to read from fixed interface or we might be hitting cache + limit, so Increase cache_size for pool_create() + +#. Are we are still seeing low performance + - Check if sufficient objects in mempool by rte_mempool_avail_count() + - Is failure in some pkt? we might be getting pkts with size > mbuf + data size. Check rte_pktmbuf_is_continguous() + - If user pthread is used to access object access + rte_mempool_cache_create() + - Try using 1GB huge pages instead of 2MB. If there is difference, + try then rte_mem_lock_page() for 2MB pages + +.. note:: + Stall in release of MBUF can be because + + * Processing pipeline is too heavy + * Number of stages are too many + * TX is not transferred at desired rate + * Multi segment is not offloaded at TX device. + * Application misuse scenarios can be + - not freeing packets + - invalid rte_pktmbuf_refcnt_set + - invalid rte_pktmbuf_prefree_seg + +Is there difference in performance for crypto? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Crypto device and PMD :numref:`dtg_crypto`. + +.. _dtg_crypto: + +.. figure:: img/dtg_crypto.* + + CRYPTO and interaction with PMD device + +#. are generic configuration correct? + - Get total crypto devices – rte_cryptodev_count() + - Cross check SW or HW flags are configured properly + rte_cryptodev_info_get() for feature_flags + +#. If enqueue request > actual enqueue (drops)? + - Is the queue pair setup for proper node + rte_cryptodev_queue_pair_setup() for socket_id + - Is the session_pool created from same socket_id as queue pair? + - Is enqueue thread same socket_id? + - rte_cryptodev_stats() for drops err_count for enqueue or dequeue + - Are there multiple threads enqueue or dequeue from same queue pair? + +#. If enqueue rate > dequeue rate? + - Is dequeue lcore thread is same socket_id? + - If SW crypto is in use, check if the CRYPTO Library build with + right (SIMD) flags Or check if the queue pair using CPU ISA by + rte_cryptodev_info_get() for feature_flags for AVX|SSE + - If we are using HW crypto – Is the card on same NUMA socket as + queue pair and session pool? + +worker functions not giving performance? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Custom worker function :numref:`dtg_distributor_worker`. + +.. _dtg_distributor_worker: + +.. figure:: img/dtg_distributor_worker.* + + Custom worker function performance drops + +#. Performance + - Threads context switches more frequently? Identify lcore with + rte_lcore() and lcore index mapping with rte_lcore_index(). Best + performance when mapping of thread and core is 1:1. + - Check lcore role type and state? rte_eal_lcore_role for + rte, off and service. User function on service core might be + sharing timeslots with other functions. + - Check the cpu core? check rte_thread_get_affinity() and + rte_eal_get_lcore_state() for run state. + +#. Debug + - Mode of operation? rte_eal_get_configuration() for master, fetch + lcore|service|numa count, process_type. + - Check lcore run mode? rte_eal_lcore_role() for rte, off, service. + - process details? rte_dump_stack(), rte_dump_registers() and + rte_memdump() will give insights. + +service functions are not frequent enough? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +service functions on service cores :numref:`dtg_service`. + +.. _dtg_service: + +.. figure:: img/dtg_service.* + + functions running on service cores + +#. Performance + - Get service core count? rte_service_lcore_count() and compare with + result of rte_eal_get_configuration() + - Check if registered service is available? + rte_service_get_by_name(), rte_service_get_count() and + rte_service_get_name() + - Is given service running parallel on multiple lcores? + rte_service_probe_capability() and rte_service_map_lcore_get() + - Is service running? rte_service_runstate_get() + +#. Debug + - Find how many services are running on specific service lcore by + rte_service_lcore_count_services() + - Generic debug via rte_service_dump() + +Is there bottleneck in eventdev? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. are generic configuration correct? + - Get event_dev devices? rte_event_dev_count() + - Are they created on correct socket_id? - rte_event_dev_socket_id() + - Check if HW or SW capabilities? - rte_event_dev_info_get() for + event_qos, queue_all_types, burst_mode, multiple_queue_port, + max_event_queue|dequeue_depth + - Is packet stuck in queue? check for stages (event qeueue) where + packets are looped back to same or previous stages. + +#. Performance drops in enqueue (event count > actual enqueue)? + - Dump the event_dev information? rte_event_dev_dump() + - Check stats for queue and port for eventdev + - Check the inflight, current queue element for enqueue|deqeue + +How to debug QoS via TM? +~~~~~~~~~~~~~~~~~~~~~~~~ + +TM on TX interface :numref:`dtg_qos_tx`. + +.. _dtg_qos_tx: + +.. figure:: img/dtg_qos_tx.* + + Traffic Manager just before TX + +#. Is configuration right? + - Get current capabilities for DPDK port rte_tm_capabilities_get() + for max nodes, level, shaper_private, shaper_shared, sched_n_children + and stats_mask + - Check if current leaf are configured identically rte_tm_capabilities_get() + for lead_nodes_identicial + - Get leaf nodes for a dpdk port - rte_tn_get_number_of_leaf_node() + - Check level capabilities by rte_tm_level_capabilities_get for n_nodes + - Max, nonleaf_max, leaf_max + - identical, non_identical + - Shaper_private_supported + - Stats_mask + - Cman wred packet|byte supported + - Cman head drop supported + - Check node capabilities by rte_tm_node_capabilities_get for n_nodes + - Shaper_private_supported + - Stats_mask + - Cman wred packet|byte supported + - Cman head drop supported + - Debug via stats - rte_tm_stats_update() and rte_tm_node_stats_read() + +Packet is not of right format? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Packet capture before and after processing :numref:`dtg_pdump`. + +.. _dtg_pdump: + +.. figure:: img/dtg_pdump.* + + Capture points of Traffic at RX-TX + +#. with primary enabling then secondary can access. Copies packets from + specific RX or TX queues to secondary process ring buffers. + +.. note:: + Need to explore: + * if secondary shares same interface can we enable from secondary + for rx|tx happening on primary + * Specific PMD private data dump the details + * User private data if present, dump the details + +How to develop custom code to debug? +------------------------------------ + +- For single process - the debug functionality is to be added in same + process +- For multiple process - the debug functionality can be added to + secondary multi process + +.. note:: + + Primary's Debug functions invoked via + #. Timer call-back + #. Service function under service core + #. USR1 or USR2 signal handler + diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst index a642a2be1..9527fa84d 100644 --- a/doc/guides/howto/index.rst +++ b/doc/guides/howto/index.rst @@ -18,3 +18,4 @@ HowTo Guides virtio_user_as_exceptional_path packet_capture_framework telemetry + debug_troubleshoot_guide -- 2.17.1