From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <3chas3@gmail.com>
Received: from mail-io0-f195.google.com (mail-io0-f195.google.com
 [209.85.223.195]) by dpdk.org (Postfix) with ESMTP id A463E1B39E
 for <dev@dpdk.org>; Tue,  7 Nov 2017 19:12:28 +0100 (CET)
Received: by mail-io0-f195.google.com with SMTP id b186so3095481iof.8
 for <dev@dpdk.org>; Tue, 07 Nov 2017 10:12:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc; bh=VBE0knAGwtdv6wMHaZhSVcIcub+FCkpcrkYIcFwu2OE=;
 b=k9Y14zOQSlLOtWDUXhbKnNCkjj9m7SNxOXr2YVJFsM2PkiLRiiOjFowDW/k0jR5YoO
 dlPBdtqkBDyhGxtS9K2EP2OnLuDjMbgVT9Qt1TFbjfohRP6YSIBmwZJ7B49C0RiZJDEs
 00j02leiTl8+QiyDMZPaS3vwAH/7kUdxbrctfPQPPx8fY64PnQnU/SrmSJxMHlI4MOJE
 GBRUplP3X8eGIyAtCVZ8IPaiyRUUepvjzeRUj8QlMeN4ixYI/db5k6dblLxN0TRhRLyn
 Ip5MeKVnu5V6RH6blf36a7lg45NZuNmZiN5UsT6R+0fVukQgqioxlwleGLmfHcGolp1z
 qelA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=VBE0knAGwtdv6wMHaZhSVcIcub+FCkpcrkYIcFwu2OE=;
 b=daZI7CSz8AME23EG2NAFtNCgp0OVJTQ/9waqqsh0IrIVA84HZ/Bx0d469agKLONFfa
 aPFPBE+dZ73k9R4C239fJwtLYA3Ceem7rv9j6vOEbO8cZcLuexsehxif05HOVlom/4ew
 nr3p9jIaIbok2EYC00lV2ZX3YGHRXdTEejRpbP0F6hFaN3s/8QGkfQqMrVba/HR0hPH6
 UU2SIyeOEI7ghcUK3XxMSTnMH4Mo4BGR5FJ3AsT1iIeAXsYodjf4V1n+kW0iNTtc7Kxw
 SeiGTX2vtWTYOEasLHOWA4xKq70UlgREuD6xnd2kpEZwNxzr/DiRDTPYia/GsXT12KDB
 52YQ==
X-Gm-Message-State: AMCzsaUD/YeT6egs+EzCSqlmE/y/m7ZqU1OkXbAldOZaiPS9eN9Uxb8T
 7WlvlYGIaenu63XHz+15aLJmFjq12MTsYzeyXyQ=
X-Google-Smtp-Source: ABhQp+SEIHcyS9LI0RhL5NGgEHmyfmJTTA6KwBhrTXikrIQZu9TrYukxOE2ey2NT8dttlpd9IY0RWrHEh3tLaaDUIJU=
X-Received: by 10.107.69.14 with SMTP id s14mr25081619ioa.113.1510078347758;
 Tue, 07 Nov 2017 10:12:27 -0800 (PST)
MIME-Version: 1.0
Received: by 10.107.16.29 with HTTP; Tue, 7 Nov 2017 10:12:27 -0800 (PST)
In-Reply-To: <1537486.rYyyKMFLXg@xps>
References: <20171103223822.28852-1-ferruh.yigit@intel.com>
 <2004961.P5XXAOnQC2@xps>
 <CAG2-Gkk1Z3ZE3XpyK_FQVLnQb9qeY3tYnny6n9x2m2E0MkFRyA@mail.gmail.com>
 <1537486.rYyyKMFLXg@xps>
From: Chas Williams <3chas3@gmail.com>
Date: Tue, 7 Nov 2017 13:12:27 -0500
Message-ID: <CAG2-GknAx_N_dbQAcVnmSDSF01ivfP7shZUd5ewJdZK1GCsbbA@mail.gmail.com>
To: Thomas Monjalon <thomas@monjalon.net>
Cc: Ferruh Yigit <ferruh.yigit@intel.com>, dev@dpdk.org, 
 Jianfeng Tan <jianfeng.tan@intel.com>, Jingjing Wu <jingjing.wu@intel.com>, 
 Shijith Thotton <shijith.thotton@caviumnetworks.com>,
 Gregory Etelson <gregory@weka.io>, 
 Harish Patil <harish.patil@cavium.com>, George Prekas <george.prekas@epfl.ch>, 
 Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>,
 Rasesh Mody <rasesh.mody@cavium.com>, Lee Roberts <lee.roberts@hpe.com>,
 Stephen Hemminger <stephen@networkplumber.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-dev] [dpdk-stable] [PATCH v2] igb_uio: prevent reset for
 a list of devices
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Nov 2017 18:12:29 -0000

Environment: Dell PowerEdge R730, Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection shared via PCI pass-through
Host: Debian 8
Guest: Custom Debian 8 with DPDK application based on 17.11

When we shutdown the guest, the kernel panics with:

[  279.021818] Do you have a strange power saving mode enabled?
[  279.021819] Dazed and confused, but trying to continue
[  279.021847] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 3
[  279.021849] {1}[Hardware Error]: event severity: fatal
[  279.021850] {1}[Hardware Error]:  Error 0, type: fatal
[  279.021851] {1}[Hardware Error]:   section_type: PCIe error
[  279.021852] {1}[Hardware Error]:   port_type: 0, PCIe end point
[  279.021853] {1}[Hardware Error]:   version: 1.16
[  279.021854] {1}[Hardware Error]:   command: 0x0507, status: 0x4010
[  279.021855] {1}[Hardware Error]:   device_id: 0000:03:00.0
[  279.021855] {1}[Hardware Error]:   slot: 0
[  279.021856] {1}[Hardware Error]:   secondary_bus: 0x00
[  279.021857] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
[  279.021858] {1}[Hardware Error]:   class_code: 000002
[  279.021859] Kernel panic - not syncing: Fatal hardware error!
[  279.021977] sched: Unexpected reschedule of offline CPU#1!
[  279.021984] ------------[ cut here ]------------
[  279.021992] WARNING: CPU: 43 PID: 2807 at
/build/linux-fHlJSJ/linux-4.12.6/arch/x86/kernel/smp.c:128
native_smp_send_reschedule+0x34/0x40
[  279.021993] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1
vfio openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c
crc32c_generic nfsd nfs_aclr
pcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
fscache tun intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm irqbypass mgag200 ttm drm_kms_helper drm joydev
crct10dif_pclmul crc32_pclmu
l ghash_clmulni_intel i2c_algo_bit ipmi_si ipmi_devintf iTCO_wdt
intel_cstate iTCO_vendor_support evdev intel_uncore mxm_wmi lpc_ich
ipmi_msghandler mfd_core ioatdma intel_rapl_perf dcdbas pcspkr shpchp
mei_me button wmi mei acpi_power_m
eter tpm_crb autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sg
hid_generic usbhid hid sd_mod
[  279.022044]  crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd
glue_helper ahci ehci_pci libahci ehci_hcd ixgbe libata megaraid_sas
usbcore dca i40e usb_common ptp pps_core scsi_mod mdio
[  279.022060] CPU: 43 PID: 2807 Comm: revalidator85 Not tainted
4.12.0-1-amd64 #1 Debian 4.12.6-1
[  279.022061] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.3.4
11/08/2016
[  279.022062] task: ffff91d0473f7100 task.stack: ffffafef8f4a4000
[  279.022066] RIP: 0010:native_smp_send_reschedule+0x34/0x40
[  279.022067] RSP: 0018:ffffafef8f4a7c98 EFLAGS: 00010082
[  279.022069] RAX: 000000000000002e RBX: ffff91d059d24080 RCX:
0000000000000001
[  279.022070] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
0000000000000046
[  279.022071] RBP: ffff91d04691d100 R08: 0000000000000000 R09:
000000000000002e
[  279.022072] R10: ffffafef8f4a7c90 R11: 00000000001cbb78 R12:
ffff91d85d21ae80
[  279.022073] R13: ffff91d059d24000 R14: 0000000000000002 R15:
0000000000000008
[  279.022075] FS:  00007f726affd700(0000) GS:ffff91d85d740000(0000)
knlGS:0000000000000000
[  279.022076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  279.022077] CR2: 00007fd422a52c48 CR3: 000000042d90f000 CR4:
00000000003426e0
[  279.022078] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  279.022079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  279.022080] Call Trace:
[  279.022086]  ? check_preempt_wakeup+0x181/0x220
[  279.022091]  ? check_preempt_curr+0x74/0x80
[  279.022094]  ? ttwu_do_wakeup+0x19/0x140
[  279.022098]  ? try_to_wake_up+0x1b8/0x470
[  279.022101]  ? wake_up_q+0x3f/0x70
[  279.022106]  ? futex_wake+0x15a/0x170
[  279.022108]  ? do_futex+0x2df/0xa90
[  279.022111]  ? SyS_futex+0x7a/0x170
[  279.022113]  ? SyS_read+0x76/0xc0
[  279.022118]  ? system_call_fast_compare_end+0xc/0x97
[  279.022119] Code: a3 05 51 fb cc 00 73 15 48 8b 05 28 74 a3 00 be fd 00
00 00 48 8b 80 a0 00 00 00 ff e0 89 fe 48 c7 c7 88 5c de b6 e8 e2 c9 13 00
<0f> ff c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 5d 00
[  279.022151] ---[ end trace eddc980dc8648163 ]---
[  279.454274] Kernel Offset: 0x35400000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)

The test engineer says this doesn't happen if we use SRIOV (which makes
sense since the device isn't directly shared between the guest and the
host).  If I remove the pci_reset_function() from igb_uio's .release, then
all is well.


On Tue, Nov 7, 2017 at 8:02 AM, Thomas Monjalon <thomas@monjalon.net> wrote:

> 07/11/2017 12:50, Chas Williams:
> > We still have an issue with this and PCI pass-through.  If a guest is
> > restarted while using PCI pass-through and igb_uio issues a
> > pci_reset_function(), this causes the host to crash.
>
> Please, could you better explain the exact scenario and the cause of the
> crash?
> Thanks
>
>