From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 7A03DA034E for ; Mon, 14 Feb 2022 15:48:38 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6F66E41144; Mon, 14 Feb 2022 15:48:38 +0100 (CET) Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by mails.dpdk.org (Postfix) with ESMTP id 3A11E41144; Mon, 14 Feb 2022 15:48:37 +0100 (CET) Received: from pps.filterd (m0246629.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 21EBeNkM031188; Mon, 14 Feb 2022 14:48:36 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type : list-id : list-unsubscribe : list-archive : list-post : list-help : list-subscribe : list-archive : content-transfer-encoding; s=corp-2021-07-09; bh=KZnL4HVY3q6VkdlwPVLOjLLbhQdNGam1gBonJDDzgO0=; b=uplLppBd7HP0PTstkBEPERcBRrQ/mJYKqZ8APhcaC9oU4+6RVM74wQDDi3c1NTxfhXW4 nIvhkW8G7GsMfTrUwyXdS8jj7Ib4zV5jWiVTiSBc4Lyzl7RWWhrflYlqUXAObw1B5EHu ceIURLqS0kgKUQoF6kR+aK8ZGtJyk7cBw3CDUchSrx9tm6gwo0ZzVlugCNbaThFys6Yd T5nguTTC8OmlOCKg0J4C9fNCfdB5Q8PRXwnxHT3lS63YlD7nPIHXGuykiEhy0A+aCG+K WLPDDORE3oop6AhlUDDVKLmgUTtvdJOZG641QQ/8iuS+hWgpxxxmMqVrTJTUkXWyw4gc Tg== Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by mx0b-00069f02.pphosted.com with ESMTP id 3e64sbvfwv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 14 Feb 2022 14:48:36 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.1.2/8.16.1.2) with SMTP id 21EEihOg136577; Mon, 14 Feb 2022 14:48:34 GMT Received: from pps.reinject (localhost [127.0.0.1]) by userp3020.oracle.com with ESMTP id 3e66bm512q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 14 Feb 2022 14:48:34 +0000 Received: from userp3020.oracle.com (userp3020.oracle.com [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 21EEjWga139255; Mon, 14 Feb 2022 14:48:34 GMT Received: from vashri-x40-j.in.oracle.com (dhcp-10-166-186-120.vpn.oracle.com [10.166.186.120]) by userp3020.oracle.com with ESMTP id 3e66bm50ye-1; Mon, 14 Feb 2022 14:48:33 +0000 From: vipul.ashri@oracle.com To: grive@u256.net, vipul.ashri@oracle.com, dev@dpdk.org Cc: stable@dpdk.org Subject: Re: [PATCH v2] net/failsafe: link_update request crashing at boot Date: Mon, 14 Feb 2022 20:17:20 +0530 Message-ID: <87c84612-4116-4fe7-a711-f5f364513c3d@www.fastmail.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20211021214215.1633-1-vipul.ashri@oracle.com> References: <20211021115139.2634-1-vipul.ashri@oracle.com> <20211021214215.1633-1-vipul.ashri@oracle.com> Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 14824A0C43; Mon, 22 Nov 2021 11:23:26 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 89DB6410E4; Mon, 22 Nov 2021 11:23:25 +0100 (CET) Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by mails.dpdk.org (Postfix) with ESMTP id 336E640395; Mon, 22 Nov 2021 11:23:24 +0100 (CET) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id A54F25C01A2; Mon, 22 Nov 2021 05:23:22 -0500 (EST) Received: from imap48 ([10.202.2.98]) by compute3.internal (MEProxy); Mon, 22 Nov 2021 05:23:22 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=u256.net; h= mime-version:message-id:in-reply-to:references:date:from:to:cc :subject:content-type; s=fm3; bh=jbxbe+fG2yZ5tczUSMCbkSbn8xLU65L 0qVBZc/W3rdY=; b=pana8aOo4dXFc0/Y3GnxDfIABd3wwC4OK/i8dAPn6HMlsyI sseKeP+10yotHbMEEk6rP7O86Usi4EbjXmFk8BxlN2DZ4wC3sUKKUdeSQ4+k71ED 4PWp6Hk/87HIJ61jUOl/6aBFZEbprLdGdb9neKcF1g0iCUxvb98hKiHwNR0OBb6N hJJe7AU9ERH3SvyNTlH/abx2XkZoWznpmvybpbSPz1Z6O3sHwZ6uAGYCJvmJqkM9 EVNFL4cn+moTxxgQvWYb2Yo7xNIneZ+Q6OTmJgLIJi0+VToK6Yxx2pTB0v4MFwLh 32+LTtlNTbWkiVYq8ddEL6/SPhMDfnbhK2v3MfQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=jbxbe+ fG2yZ5tczUSMCbkSbn8xLU65L0qVBZc/W3rdY=; b=U3xwE2nAJlaWw/QtsPUAyd HLEzgghrKsdgcB3A2YHioyWHH0oCNwGpJAJsBrySnv3cDVmv4bZPidLfurTnLq/o JvelxHk6JcFcfEv1vAVgA3q11KNLeESIwyEd2ykR41h/j1c2Or7RvfKfVjGbXUU5 k9yZ7LSeYlI2M9Oq5RItH6irPcS5aBK5gNhIiF0VWyvdvvuUOtDwlDM0Ld4u4IKb jvebHU4erkebIKZYeC0Wb3itzD9oG2qgvH3LguJdOP8UaORCv9migX6xUIHJmU0X KLngGsbR8ER3CpJgZUnBqNShGZ4EEEGkbUeoW4ESmdTh1K6IOrm7q5HQI6kXTNww == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvuddrgeeggdduhecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkjghffffhvffutgesthdtre dtreerjeenucfhrhhomhepifgrtohtrghnpgftihhvvghtuceoghhrihhvvgesuhdvheei rdhnvghtqeenucggtffrrghtthgvrhhnpeevfeehffektdduhfevtdekueetfeekudeiie ethfduffeluddvgeetffdvueelgfenucffohhmrghinhepphhmugdrnhgvthenucevlhhu shhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehgrhhivhgvsehuvd ehiedrnhgvth X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id 65D6321E006E; Mon, 22 Nov 2021 05:23:22 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.5.0-alpha0-1371-g2296cc3491-fm-20211109.003-g2296cc34 Mime-Version: 1.0 Content-Type: text/plain X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list Archived-At: Content-Transfer-Encoding: 8bit X-Proofpoint-GUID: wSJrS2zN6paF3GfTDGPYzvmQ9qkNSY3t X-Proofpoint-ORIG-GUID: wSJrS2zN6paF3GfTDGPYzvmQ9qkNSY3t X-BeenThere: stable@dpdk.org List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org Message-ID: <20220214144720.gYziJxzZohYvqwFvMbNgCZCcQ00EaFCxY1fNx-X1_1U@z> From: Gaƫtan Rivet >On Thu, Oct 21, 2021, at 23:42, vipul.ashri@oracle.com wrote: >> From: Vipul Ashri >> >> failsafe crashed while sending early link_update request during >> boot time initialization. >> Based on debugging we found failsafe device was good but sub- >> devices were progressing towards initialization and SUBOPS macro >> where expanding macro gives [partial_dev]->dev_ops->link_update() >> execution of which triggered crash because dev_ops==0. similar >> crash seen at failsafe_eth_dev_close() >> >> Failsafe driver need a separate check for subdevices similar to >> "RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);" which is >> called to almost every eth_dev function. >> >> Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") >> Cc: stable@dpdk.org >> Signed-off-by: Vipul Ashri > >Hello Vipul, > >I'm sorry for the delay, I missed your fix on the mailing list. > >IIUC, the issue is that failsafe finished init and received an ethdev >operation call, but one of its sub-device, although marked DEV_ACTIVE, >has its eth_dev->dev_ops field NULL. > >It is really surprising to me, because there aren't many ways for a sub-device >to become DEV_ACTIVE. > >The only two ways are > > * by executing 'fs_dev_configure()', which will first execute > rte_eth_dev_configure() on the sub-device, and on error would > stop *without* setting DEV_ACTIVE. > rte_eth_dev_configure() will itself execute > RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV), so it would > return negative errno and fs_dev_configure() would abort. > > * by executing 'fs_dev_remove()' and the sub-device was 'DEV_STARTED' > to begin with, then it is retrograded to DEV_ACTIVE once stopped. > >So I don't understand yet how it is possible for a sub-device to become DEV_ACTIVE >while its eth_dev->dev_ops are NULL. It seems more like a bug, memory corruption or >just an unexpected execution pattern. > >Could describe in more detail the execution? >In particular, setting the EAL log-level to debug with the option: >' --log-level pmd.net.failsafe:debug ' >for example while using testpmd or your DPDK app. >It should show ethdev level accesses to the sub-devices, and error values. > >Best regards, >-- >Gaetan Rivet Hi Gaetan Sorry for very late reply, we were busy working on 21.11 integration. Although we have adopted this code internally for us but I am sharing the patch to opensource for community benefit. This is specific case of AZURE setup with our very customized complex environment. Let me share the logs with traceback first ================================================================================================================== SECONDARY PROCESS timestamp=1633598184 TCZ0.0.0 Cycle 152 (Build 1832) signal 11 (Segmentation fault), address is 0x31117bbce6c8 from 0x47d08b1 [bt]: ( 1) _Z18snprintf_backtraceRPciiP9siginfo_tPv (+ 0xf4) - sp = 0x7fffef3fd110, ip = 0x3acdc54 [bt]: ( 2) _Z13crit_err_hdlriP9siginfo_tPv (+ 0x159) - sp = 0x7fffef3fdc20, ip = 0x3acdf29 [bt]: ( 3) _ZN13SignalAdapter12handleSignalEiP9siginfo_tPv (+ 0x104) - sp = 0x7fffef3fdf00, ip = 0x274d4c4 [bt]: ( 4) _L_unlock_18 (+ 0x2c) - sp = 0x7fffef3fdf80, ip = 0x7ffff7bce630 [bt]: ( 5) rte_eth_dev_attach_secondary (+ 0x21) - sp = 0x7fffef3fec50, ip = 0x47d08b1 [bt]: ( 6) rte_eth_from_ring (+ 0x3438) - sp = 0x7fffef3fec80, ip = 0x4e49da8 [bt]: ( 7) _init (+ 0xa1b8) - sp = 0x7fffef3feec0, ip = 0x12e0368 [bt]: ( 8) local_dev_probe (+ 0xac) - sp = 0x7fffef3feef0, ip = 0x478fd2c [bt]: ( 9) rte_uuid_unparse (+ 0x274) - sp = 0x7fffef3fef30, ip = 0x47a3e94 [bt]: (10) rte_eal_vfio_get_vf_token (+ 0xd7) - sp = 0x7fffef3ff110, ip = 0x47b04b7 [bt]: (11) eal_hugepage_info_read (+ 0x602) - sp = 0x7fffef3ff170, ip = 0x47b2cd2 [bt]: (12) start_thread (+ 0xc5) - sp = 0x7fffef3ff220, ip = 0x7ffff7bc6ea5 [bt]: (13) clone (+ 0x6d) - sp = 0x7fffef3ff2c0, ip = 0x7ffff004096d EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket:eal_dev_mp_request EAL: Cannot send request to primary EAL: Failed to send hotplug request to primary net_failsafe: Failed to probe devargs net_tap_vsc0 EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket:eal_dev_mp_request EAL: Cannot send request to primary EAL: Failed to send hotplug request to primary net_failsafe: Failed to probe devargs net_tap_vsc1 EAL: No legacy callbacks, legacy socket not created EAL: Drop mp reply: eal_dev_mp_request ================================================================================================================== PRIMARY PROCESS timestamp=1633598196 TCZ0.0.0 Cycle 152 (Build 1832) signal 11 (Segmentation fault), address is 0x38 from 0x9d8fbe [bt]: ( 1) _Z18snprintf_backtraceRPciiP9siginfo_tPv (+ 0xf4) - sp = 0x7fffecf41150, ip = 0x100dd44 [bt]: ( 2) _Z13crit_err_hdlriP9siginfo_tPv (+ 0x159) - sp = 0x7fffecf41c60, ip = 0x100e019 [bt]: ( 3) _ZN13SignalAdapter12handleSignalEiP9siginfo_tPv (+ 0x104) - sp = 0x7fffecf41f40, ip = 0xff4894 [bt]: ( 4) _L_unlock_18 (+ 0x2c) - sp = 0x7fffecf41fc0, ip = 0x7ffff61d9630 [bt]: ( 5) failsafe_eth_dev_close (+ 0x65e) - sp = 0x7fffecf42c90, ip = 0x9d8fbe [bt]: ( 6) rte_eth_link_get_nowait (+ 0x6a) - sp = 0x7fffecf42cf0, ip = 0x62fa0a [bt]: ( 7) _ZN11StatsThread9statsLoopEP10CustomObject (+ 0x33e) - sp = 0x7fffecf42d20, ip = 0xedea2e [bt]: ( 8) _ZN11StatsThread9statsLoopEP10CustomObject (+ 0x8dc) - sp = 0x7fffecf42d90, ip = 0xedefcc [bt]: ( 9) ThreadFunction (+ 0xe6) - sp = 0x7fffecf42db0, ip = 0x7ffff6b477e6 [bt]: (10) start_thread (+ 0xc5) - sp = 0x7fffecf42de0, ip = 0x7ffff61d1ea5 [bt]: (11) clone (+ 0x6d) - sp = 0x7fffecf42e80, ip = 0x7ffff0a6b96d ================================================================================================================== DPDK 20.11.2 core mask is 00000000000000000000000000004000 DPDK Custom Process initialized with 2 ports the min max TxQ is maxTxQueues 16 Using 1 RxQs for port 0 (# F-core=1) Using 1 RxQs for port 3 (# F-core=1) Core 14 (port=0, rxQ=0) kni_ring=(nil) Core 14 (port=3, rxQ=0) kni_ring=(nil) Core 14 txN = 0 Thread for core 14 using ring from usbc of 0x31117b29bb00 Ring size must be powers of 2, adjusting from 8196 to 16384 Thread for core 14 using ring from MEDIA of 0x31117b27b840 Encaps Memory Zone= 48044 sizeof encaps = 60 Trace Memory Zone= 272 Policy Memory Zone= 8196 sizeof policy = 240 link status for port 0 is 1 link status for port 3 is 1 PORT 0 supports 16 rx queues and 16 tx queues (driver_name = net_failsafe, driver_type = 16) PORT 0 is polling for link-change, interrupts disabled [DPDK] tap_flow_create(): Kernel refused TC filter rule creation (17): File exists [DPDK] net_failsafe: Failed to create flow on sub_device 1 add_flow(): create() fails for port 0; Reason: overlapping rules or Kernel too old for flower support Error adding broadcast flow PORT 3 supports 16 rx queues and 16 tx queues (driver_name = net_failsafe, driver_type = 16) PORT 3 is polling for link-change, interrupts disabled [DPDK] EAL: Failed to hotplug add device on primary [DPDK] tap_flow_create(): Kernel refused TC filter rule creation (17): File exists [DPDK] net_failsafe: Failed to create flow on sub_device 1 add_flow(): create() fails for port 3; Reason: overlapping rules or Kernel too old for flower support Error adding broadcast flow Cmd Thread is available Capture object initialized init :Stats Thread is available ifLinkUpdate: Sending OperStatus for port=0 stat=1 ifLinkUpdate: Port 0 Link Change - speed 40000 Mbps - full-duplex [DPDK] EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket_2934_298e9db8d1:eal_dev_mp_request [DPDK] EAL: rte_mp_request_sync failed [DPDK] EAL: Failed to send hotplug request to secondary [DPDK] EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket_2934_298e9db8d1:eal_dev_mp_request [DPDK] EAL: rte_mp_request_sync failed [DPDK] EAL: Failed to hotplug add device on primary [DPDK] Invalid port_id=2 [DPDK] net_failsafe: Operation rte_eth_stats_get failed for sub_device 1 with error -19 There is some race at secondary process and primary got crashed because its data-structures and partially filled. Let me know if you need GDB analysis, I can share with next reply if you are still unsatisfied. GDB analysis will be bigger. Thanks! Regards