From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 1F7C4A0534 for ; Wed, 12 Feb 2020 14:49:56 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 14AC0137D; Wed, 12 Feb 2020 14:49:56 +0100 (CET) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 07FB4137D; Wed, 12 Feb 2020 14:49:53 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Feb 2020 05:49:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,433,1574150400"; d="scan'208";a="237714927" Received: from fyigit-mobl.ger.corp.intel.com (HELO [10.237.221.20]) ([10.237.221.20]) by orsmga006.jf.intel.com with ESMTP; 12 Feb 2020 05:49:50 -0800 To: Matan Azrad , "Yigit, Ferruh" , "dev@dpdk.org" , Bernard Iremonger Cc: Gaetan Rivet , Thomas Monjalon , "stable@dpdk.org" , David Marchand , Jeff Guo , Qi Zhang References: <1573548459-6931-1-git-send-email-matan@mellanox.com> <1573548459-6931-2-git-send-email-matan@mellanox.com> <19a86d69-9bcc-42c9-b000-98b3860de42f@intel.com> <4df501fa-06d8-744d-27cd-f5742992e109@intel.com> <89e4cfea-ccfa-c7c1-c201-6ca3ff845267@intel.com> <33a6aa15-28e4-e770-204c-25ca230ca653@intel.com> From: Ferruh Yigit Autocrypt: addr=ferruh.yigit@intel.com; prefer-encrypt=mutual; keydata= mQINBFXZCFABEADCujshBOAaqPZpwShdkzkyGpJ15lmxiSr3jVMqOtQS/sB3FYLT0/d3+bvy qbL9YnlbPyRvZfnP3pXiKwkRoR1RJwEo2BOf6hxdzTmLRtGtwWzI9MwrUPj6n/ldiD58VAGQ +iR1I/z9UBUN/ZMksElA2D7Jgg7vZ78iKwNnd+vLBD6I61kVrZ45Vjo3r+pPOByUBXOUlxp9 GWEKKIrJ4eogqkVNSixN16VYK7xR+5OUkBYUO+sE6etSxCr7BahMPKxH+XPlZZjKrxciaWQb +dElz3Ab4Opl+ZT/bK2huX+W+NJBEBVzjTkhjSTjcyRdxvS1gwWRuXqAml/sh+KQjPV1PPHF YK5LcqLkle+OKTCa82OvUb7cr+ALxATIZXQkgmn+zFT8UzSS3aiBBohg3BtbTIWy51jNlYdy ezUZ4UxKSsFuUTPt+JjHQBvF7WKbmNGS3fCid5Iag4tWOfZoqiCNzxApkVugltxoc6rG2TyX CmI2rP0mQ0GOsGXA3+3c1MCdQFzdIn/5tLBZyKy4F54UFo35eOX8/g7OaE+xrgY/4bZjpxC1 1pd66AAtKb3aNXpHvIfkVV6NYloo52H+FUE5ZDPNCGD0/btFGPWmWRmkPybzColTy7fmPaGz cBcEEqHK4T0aY4UJmE7Ylvg255Kz7s6wGZe6IR3N0cKNv++O7QARAQABtCVGZXJydWggWWln aXQgPGZlcnJ1aC55aWdpdEBpbnRlbC5jb20+iQJUBBMBCgA+AhsDAh4BAheABQsJCAcDBRUK CQgLBRYCAwEAFiEE0jZTh0IuwoTjmYHH+TPrQ98TYR8FAl1meboFCQlupOoACgkQ+TPrQ98T YR9ACBAAv2tomhyxY0Tp9Up7mNGLfEdBu/7joB/vIdqMRv63ojkwr9orQq5V16V/25+JEAD0 60cKodBDM6HdUvqLHatS8fooWRueSXHKYwJ3vxyB2tWDyZrLzLI1jxEvunGodoIzUOtum0Ce gPynnfQCelXBja0BwLXJMplM6TY1wXX22ap0ZViC0m714U5U4LQpzjabtFtjT8qOUR6L7hfy YQ72PBuktGb00UR/N5UrR6GqB0x4W41aZBHXfUQnvWIMmmCrRUJX36hOTYBzh+x86ULgg7H2 1499tA4o6rvE13FiGccplBNWCAIroAe/G11rdoN5NBgYVXu++38gTa/MBmIt6zRi6ch15oLA Ln2vHOdqhrgDuxjhMpG2bpNE36DG/V9WWyWdIRlz3NYPCDM/S3anbHlhjStXHOz1uHOnerXM 1jEjcsvmj1vSyYoQMyRcRJmBZLrekvgZeh7nJzbPHxtth8M7AoqiZ/o/BpYU+0xZ+J5/szWZ aYxxmIRu5ejFf+Wn9s5eXNHmyqxBidpCWvcbKYDBnkw2+Y9E5YTpL0mS0dCCOlrO7gca27ux ybtbj84aaW1g0CfIlUnOtHgMCmz6zPXThb+A8H8j3O6qmPoVqT3qnq3Uhy6GOoH8Fdu2Vchh TWiF5yo+pvUagQP6LpslffufSnu+RKAagkj7/RSuZV25Ag0EV9ZMvgEQAKc0Db17xNqtSwEv mfp4tkddwW9XA0tWWKtY4KUdd/jijYqc3fDD54ESYpV8QWj0xK4YM0dLxnDU2IYxjEshSB1T qAatVWz9WtBYvzalsyTqMKP3w34FciuL7orXP4AibPtrHuIXWQOBECcVZTTOdZYGAzaYzxiA ONzF9eTiwIqe9/oaOjTwTLnOarHt16QApTYQSnxDUQljeNvKYt1lZE/gAUUxNLWsYyTT+22/ vU0GDUahsJxs1+f1yEr+OGrFiEAmqrzpF0lCS3f/3HVTU6rS9cK3glVUeaTF4+1SK5ZNO35p iVQCwphmxa+dwTG/DvvHYCtgOZorTJ+OHfvCnSVjsM4kcXGjJPy3JZmUtyL9UxEbYlrffGPQ I3gLXIGD5AN5XdAXFCjjaID/KR1c9RHd7Oaw0Pdcq9UtMLgM1vdX8RlDuMGPrj5sQrRVbgYH fVU/TQCk1C9KhzOwg4Ap2T3tE1umY/DqrXQgsgH71PXFucVjOyHMYXXugLT8YQ0gcBPHy9mZ qw5mgOI5lCl6d4uCcUT0l/OEtPG/rA1lxz8ctdFBVOQOxCvwRG2QCgcJ/UTn5vlivul+cThi 6ERPvjqjblLncQtRg8izj2qgmwQkvfj+h7Ex88bI8iWtu5+I3K3LmNz/UxHBSWEmUnkg4fJl Rr7oItHsZ0ia6wWQ8lQnABEBAAGJAjwEGAEKACYCGwwWIQTSNlOHQi7ChOOZgcf5M+tD3xNh HwUCXWZ5wAUJB3FgggAKCRD5M+tD3xNhH2O+D/9OEz62YuJQLuIuOfL67eFTIB5/1+0j8Tsu o2psca1PUQ61SZJZOMl6VwNxpdvEaolVdrpnSxUF31kPEvR0Igy8HysQ11pj8AcgH0a9FrvU /8k2Roccd2ZIdpNLkirGFZR7LtRw41Kt1Jg+lafI0efkiHKMT/6D/P1EUp1RxOBNtWGV2hrd 0Yg9ds+VMphHHU69fDH02SwgpvXwG8Qm14Zi5WQ66R4CtTkHuYtA63sS17vMl8fDuTCtvfPF HzvdJLIhDYN3Mm1oMjKLlq4PUdYh68Fiwm+boJoBUFGuregJFlO3hM7uHBDhSEnXQr5mqpPM 6R/7Q5BjAxrwVBisH0yQGjsWlnysRWNfExAE2sRePSl0or9q19ddkRYltl6X4FDUXy2DTXa9 a+Fw4e1EvmcF3PjmTYs9IE3Vc64CRQXkhujcN4ZZh5lvOpU8WgyDxFq7bavFnSS6kx7Tk29/ wNJBp+cf9qsQxLbqhW5kfORuZGecus0TLcmpZEFKKjTJBK9gELRBB/zoN3j41hlEl7uTUXTI JQFLhpsFlEdKLujyvT/aCwP3XWT+B2uZDKrMAElF6ltpTxI53JYi22WO7NH7MR16Fhi4R6vh FHNBOkiAhUpoXRZXaCR6+X4qwA8CwHGqHRBfYFSU/Ulq1ZLR+S3hNj2mbnSx0lBs1eEqe2vh cA== Message-ID: <200f3f01-fedb-b795-a733-e135957e8e99@intel.com> Date: Wed, 12 Feb 2020 13:49:49 +0000 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org Sender: "stable" On 2/3/2020 5:10 PM, Matan Azrad wrote: > > Hi > > From: Ferruh Yigit >> On 1/25/2020 6:56 PM, Matan Azrad wrote: >>> Hi Ferruh >>> >>> From: Ferruh Yigit >>>> On 1/23/2020 7:25 PM, Matan Azrad wrote: >>>>> Hi >>>>> >>>>> From: Ferruh Yigit >>>>>> On 1/23/2020 3:29 PM, Matan Azrad wrote: >>>>>>> >>>>>>> Hi >>>>>>> >>>>>>> From: Ferruh Yigit >>>>>>>> On 1/23/2020 2:05 PM, Matan Azrad wrote: >>>>>>>>> Hi >>>>>>>>> >>>>>>>>> From: Yigit, Ferruh >>>>>>>>>> On 11/12/2019 8:47 AM, Matan Azrad wrote: >>>>>>>>>>> The port was not validated before detaching. >>>>>>>>>>> >>>>>>>>>>> Ignore port detach operation when the port is not valid. >>>>>>>>>>> >>>>>>>>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device >>>>>>>>>>> twice") >>>>>>>>>>> Cc: thomas@monjalon.net >>>>>>>>>>> Cc: stable@dpdk.org >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Matan Azrad >>>>>>>>>>> --- >>>>>>>>>>> app/test-pmd/testpmd.c | 3 +++ >>>>>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>>>>> >>>>>>>>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c >>>>>>>>>>> index 4444346..370eefe 100644 >>>>>>>>>>> --- a/app/test-pmd/testpmd.c >>>>>>>>>>> +++ b/app/test-pmd/testpmd.c >>>>>>>>>>> @@ -2545,6 +2545,9 @@ struct extmem_param { >>>>>>>>>>> >>>>>>>>>>> printf("Removing a device...\n"); >>>>>>>>>>> >>>>>>>>>>> + if (port_id_is_invalid(port_id, ENABLED_WARN)) >>>>>>>>>>> + return; >>>>>>>>>>> + >>>>>>>>>>> dev = rte_eth_devices[port_id].device; >>>>>>>>>>> if (dev == NULL) { >>>>>>>>>>> printf("Device already removed\n"); >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The patch is already in 19.11 [1] but it is breaking the >>>>>>>>>> testpmd hotplug support. >>>>>>>>>> Before 'detach_port_device()' called, the port has been stopped >>>>>>>>>> and closed [2], which will make port fail from 'port_id_is_invalid()' >>>>>>>>>> check and the device removal path never fully called. >>>>>>>>>> The implication is, since device not detached, vfio request >>>>>>>>>> interrupt keeps triggered continuously and re-starts the detach >>>>>>>>>> path, but because of the half cleaned device it fails and app >>>>>>>>>> gets stuck with a >>>>>>>> continuous log [3]. >>>>>>>>>> >>>>>>>>>> I wonder if the actual hotplug has been tested with this patch, >>>>>>>>>> the commit log is not clear about the motivation and >>>>>>>>>> implication of the patch, I am not clear why this check is >>>>>>>>>> added but I am sending a patch soon to remove it back. >>>>>>>>> >>>>>>>>> The motivation of this patch was to prevent double detach on >>>>>>>>> same port, >>>>>>>> so the user cannot call detach of invalid port. >>>>>>>> >>>>>>>> What is the definition of the 'invalid port', if you mean device >>>>>>>> already detached case, in the second call of the function "if >>>>>>>> (dev == NULL)" check should prevent it going forward. >>>>>>> >>>>>>> No, ethdev doesn't zero the device pointer when it release a port. >>>>>> >>>>>> As far as I can see it does, please see below. >>>>> >>>>> The code below is problematic because: >>>>> >>>>> 1. It is very bad that the application changing ethdev structure directly. >>>> >>>> Where the application is changing the ethdev structure? >>> >>> See it in the function we talk on: >>> rte_eth_devices[sibling].device = NULL; >>> >>> The application shouldn't do it - it should be done only by ethdev lib or by >> the PMDs. >>> >>> Are you agree here? >> >> This is really no fun :( >> >> It is not done by application, I already provided the call trace. This is done by >> the path of driver .remove(). > > Yes, probably, but also by testpmd application, I copied it from testpmd application. > > Don't you see it? > >>> >>>> Application calls the 'rte_dev_remove()' API, which does the job. >>> >>> Agree, This function is freeing(rte_free) the rte_device (actually >>> makes the rte_eth_devices[sibling].device pointer dangled) and releases >> its related resources what makes the device detached. >> >> No it doesn't, I provided full call stack, and showed where the value set to >> NULL. > > See again the testpmd function - it does it too. > >>> >>>>> 2. The below code run over valid port only, not on invalid >>>>> port(UNUSED >>>> state). >>>>> >>>>> So, the device pointer will still be valid if the port is invalid. >>>>> >>>>> All of this shows that this function try to detach only a valid port >>>>> (probably >>>> mainly because it is called by Testpmd detach command). >>>>> >>>>>>> So even if the port is in unused state already - means invalid, >>>>>>> the device >>>>>> pointer still may be valid and point to the last port that used the same >> id. >>>>>> >>>>>> If the port is closed, it is unused state, and ethdev layer >>>>>> resources freed but as you said device related structures are still >>>>>> there, device pointer is still valid and it is still in probed >>>>>> device list etc.. We need to able to detach the device even after it is >> unused state. >>>>> >>>>> Yes, but detach is for device, not for port. >>>>> The device pointer must be taken only when the port is in valid state. >>>>> Why? >>>>> Because if the port is in UNUSED state it is free to be allocated >>>>> again by >>>> ethdev layer for other device, then, the device pointer may point to >>>> other device. >>>>> >>> >>> Do you agree on the above statement I wrote? >>> >>>>>> "stop -> close -> detach" is a normal order, we shouldn't prevent >>>>>> it, but your check does prevent it. >>>>> >>>>> Yes, this is good order, but the pointer of the device should be >>>>> taken >>>> before close. >>>>> My patch prevent accessing invalid structure. >>>> >>>> The ethdev close() dev_ops, frees ethdev related resources, the >>>> rte_device is still valid in that struct. >>> >>> That’s exactly my concern. >>> I think you wrong here, the rte_device may be invalid in that struct, >> especially after close(): >>> >>> When the port ID is closed and released, its ethdev structure moves to >> UNUSED state. >>> When an ethdev structure is in UNUSED state it may be attached again to >> another rte_device - see function rte_eth_dev_allocate. >>> Are you agree here? >>> >>> In this case, when a new device is attached after close() and before >> detach_port_device() we may remove wrong rte_device and cause a lot of >> problems. >> >> The problem here is re-using the ethdev structure when it is closed but not >> freed completely, resulting overwriting some fields of it. This is another issue >> and can be fixed in the alloc path. > > Sorry, don't agree with you here. > Port which is closed can be allocated again for other device - this is the basic for hot-plug mechanism in dpdk. > Reading the rte_device from port which was closed may remove other rte_device which is not related. > > Agree that the PMD should clear the ethdev structure in remove, mlx5 doesn't do it and should be fixed, I don't know about other PMDS. > But this is not the issue I talk about. > > Testpmd shouldn't read device pointer from port which was closed - this is race. > >>> >>> Do you understand that? >>> >>> One more problematic case is a user mistake by the Testpmd command >> which may cause segfault in the good case and memory overriding in the >> worst case (my patch case): >>> >>> port stop all >>> port detach 0 >>> port detach 0 >>> >>> detach the same port twice will cause referencing of freed pointer of >> rte_device. >>> >>> >>> All of that is because Testpmd takes ethdev structure information from >> invalid ethdev structure. >>> >>> My patch prevents it. >> >> For this case I am already getting "Device already removed" message from >> 'detach_port_device()' function. >> >> Your patch is doing two things: >> - Hiding the fact that PMD .remove() is not setting the device pointer to null > > The device pointer is zero also by testpmd - the hiding is here. > >> - Breaking the hotplug functionality > > To be precise - stay it broken. > >> >>> >>> >>> >>>> And yes your patch prevents accessing them and prevents hotplug >>>> remove the device. >>>> >>> >>> Yes, my patch is not good, solved issues and caused a new one. >>> >>> Agree that we need a new fix, my suggestion here is: >>> >>> 1. In the Testpmd internal management for hutplug (rmv_port_callback): >>> Call stop() >>> Take rte_device pointer( before port close). >>> Call close(). >>> If no other valid port for the rte_device: >>> call detach() by the saved rte_device pointer. >> >> Not sure about pushing more to the application, like checking if any other >> port using a device etc.. > > And for device pointer before close(), do you agree? > >> As far as I understand your concern is when multiple ethdev are using same >> device, why not handle this in driver .remove() path, like detect if device still >> needs to be used and if so free only ethdev resources and return error, this >> error will prevent device resources to be freed: >> >> pci_unplug() >> ret = rte_pci_detach_dev(pdev); >> if (ret == 0) >> rte_pci_remove_device(pdev); >> rte_devargs_remove(dev->devargs); >> ... >> >> This will cause the application receive an error but this is kind of true because >> all resources are not freed because they are shared. >> >> When last ethdev detached, driver can send success causing all device >> resources to be freed. > > Can be good for multi-port handling, but testpmd should handle this error and report it correctly. > >>> 2. Replace the Testpmd command line for "port detach" with "detach [rte >> device name]": >>> Why? >>> Detach by port is problematic: >>> 1. If the port is closed - Testpmd cannot get its rte_device from the >> related ethdev port structure. >>> 2. If the port is not closed - It is not safe to detach it. >>> 3. Attach is done by rte_device name, detach should be in same way. >> >> Testpmd can first close() later detach(). > > Yes, close by port, detach by rte_device name (for example pci name). > That’s what I said. > > >> If it is closed already, agreed that new attached devices shouldn't be able to >> this struct until it is freed completely. But this is kind of edge case, because it >> required new device to be attached after old one closed but before it is >> detached. >> >>> Are you agree? > > This is race, no edge. > What is "freed completely"? > IMO it is when the port is in UNUSED state (after close\release). > > Hotplug can be triggered internally in parallel. > >>> >>> I hope you understand now. >>> >>>>> And yes, Testpmd detach stays broken after my patch and after this >>>>> patch >>>> too. >>>>> >>>>> >>>>>> > >>>> To simplify things, can you please clarify what error are you getting >>>> with this patch, and can you please give some details how to >>>> reproduce it? So I can debug the issue you are having. >>> >>> Added details above, hope everything is clear when you read this line >>> 😊 >> >> Overall I believe this all fuss is about the PMD you are testing not cleaning the >> 'rte_eth_devices[port_id].device' pointer which should be handled in driver >> level but you are trying to fix this in testpmd causing it fail. > > Sorry, but no, It is all about hotplug race. > > Even if the PMD clear the device pointer, the testpmd still may release wrong rte_device. Yes it may, although that is less likely to occur, it requires a new device hot added between close() and detach of the other device. Would you be agree to say there are two problems: 1) When testpmd close a port, a new attached port can re-use it over writing some fields, relying the data structures of the closed port is not safe. 2) PMD not cleaning ethdev->device pointer in the .remove() may cause issues in double detach of a port. For (1) I suggest fixing it in the attach path, don't re-use an eth_dev port id unless it is completely freed, may need to add new state for it. Does it make sense? For (2) PMDs want to get hotplug support needs to fix it.