DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Hyong Youb Kim (hyonkim)" <hyonkim@cisco.com>
To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>,
	Thomas Monjalon <thomas@monjalon.net>
Cc: David Marchand <david.marchand@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	 "anatoly.burakov@intel.com" <anatoly.burakov@intel.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"maxime.coquelin@redhat.com" <maxime.coquelin@redhat.com>,
	"stephen@networkplumber.org" <stephen@networkplumber.org>,
	"igor.russkikh@aquantia.com" <igor.russkikh@aquantia.com>,
	"pavel.belous@aquantia.com" <pavel.belous@aquantia.com>,
	"allain.legacy@windriver.com" <allain.legacy@windriver.com>,
	"matt.peters@windriver.com" <matt.peters@windriver.com>,
	"ravi1.kumar@amd.com" <ravi1.kumar@amd.com>,
	Rasesh Mody <rmody@marvell.com>,
	Shahed Shaikh <shshaikh@marvell.com>,
	"ajit.khaparde@broadcom.com" <ajit.khaparde@broadcom.com>,
	"somnath.kotur@broadcom.com" <somnath.kotur@broadcom.com>,
	"hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>,
	"shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>,
	"wenzhuo.lu@intel.com" <wenzhuo.lu@intel.com>,
	"mw@semihalf.com" <mw@semihalf.com>,
	"mk@semihalf.com" <mk@semihalf.com>,
	"gtzalik@amazon.com" <gtzalik@amazon.com>,
	"evgenys@amazon.com" <evgenys@amazon.com>,
	"John Daley (johndale)" <johndale@cisco.com>,
	"qi.z.zhang@intel.com" <qi.z.zhang@intel.com>,
	"xiao.w.wang@intel.com" <xiao.w.wang@intel.com>,
	"xuanziyang2@huawei.com" <xuanziyang2@huawei.com>,
	"cloud.wangxiaoyun@huawei.com" <cloud.wangxiaoyun@huawei.com>,
	"zhouguoyang@huawei.com" <zhouguoyang@huawei.com>,
	"beilei.xing@intel.com" <beilei.xing@intel.com>,
	"jingjing.wu@intel.com" <jingjing.wu@intel.com>,
	"qiming.yang@intel.com" <qiming.yang@intel.com>,
	"konstantin.ananyev@intel.com" <konstantin.ananyev@intel.com>,
	"alejandro.lucero@netronome.com" <alejandro.lucero@netronome.com>,
	"arybchenko@solarflare.com" <arybchenko@solarflare.com>,
	"tiwei.bie@intel.com" <tiwei.bie@intel.com>,
	"zhihong.wang@intel.com" <zhihong.wang@intel.com>,
	"yongwang@vmware.com" <yongwang@vmware.com>,
	"stable@dpdk.org" <stable@dpdk.org>,
	Nithin Kumar Dabilpuram <ndabilpuram@marvell.com>
Subject: Re: [dpdk-dev] [PATCH] vfio: fix interrupts race condition
Date: Mon, 15 Jul 2019 07:20:03 +0000	[thread overview]
Message-ID: <MWHPR11MB1839D64DABB4FD4F6F863DE8BFCF0@MWHPR11MB1839.namprd11.prod.outlook.com> (raw)
In-Reply-To: <BYAPR18MB2424568FDD38F18548CC5B9CC8CF0@BYAPR18MB2424.namprd18.prod.outlook.com>

> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Monday, July 15, 2019 2:35 PM
[...]
> Subject: RE: [dpdk-dev] [PATCH] vfio: fix interrupts race condition
> 
> > > >
> > > > This is a real bug which should be fixed in this release.
> > > > As the patch is quite big and needs a strong validation, I prefer
> > > > merging it quickly to give a lot of time before releasing 19.08-rc2.
> > > > The maintainers of all concerned PMDs are Cc.
> > > > Please make sure the interrupts are still working well with VFIO.
> > > >
> > > > Applied, thanks
> > > >
> > >
> > > [Apologies in advance if email format gets messed up. Forced to use
> > > outlook for the first time..]
> > >
> > > Hi,
> > >
> > > This commit breaks MSI-X + rxq interrupts. I think others are seeing
> > > the same error?
> > >
> > > sudo ~/dpdk/examples/l3fwd-power/build/l3fwd-power \ -c 0x1e -n 4 -
> w
> > > 0000:1a:00.0 --log-level=pmd,debug -- -p 0x1 -P --config
> > "(0,0,2),(0,1,3),(0,2,4)"
> > > [...]
> > > EAL: Error enabling MSI-X interrupts for fd 35
> > >
> > > A rough sequence of events goes like this. The above test is using 3
> > > rxqs (3 interrupts).
> > >
> > > 1. During probe, pci_vfio_setup_interrupts() runs.
> > > This now does ioctl(VFIO_DEVICE_SET_IRQS) for the 1st efd
> > > (intr_handle->fd).
> > >
> > > ioctl does:
> > > - pci_enable_msix(1 vector) because this is the first time enabling
> > >   interrupts.
> > > - request_irq(vector 0)
> > >
> > > 2. App configs
> > > The app sets port_conf.intr_conf.rxq=1, configs 3 rxqs, etc.
> > >
> > > 3. rte_eth_dev_start()
> > > PMD calls:
> > > - rte_intr_efd_enable()
> > >   This creates 3 efds (intr_handle->nb_efd = 3).
> > > - rte_intr_enable() => vfio_enable_msix()
> > >   This does ioctl(VFIO_DEVICE_SET_IRQS) for the 3 efds.
> > >
> > > ioctl now needs to request_irq() for vectors 1, 2, 3 for the 3 new
> > > efds. It does not do another pci_enable_msix() as it has been done
> > > earlier. Before calling request_irq(), it sees that only 1 vector was
> > > enabled in earlier pci_enable_msix(), so it fails with EINVAL.
> > >
> > > We would need pci_enable_msix(4 vectors) for this to work
> > > (intr_handle->fd + 3 efds).
> > >
> > > Prior to this patch, VFIO_DEVICE_SET_IRQS is done only in
> > > vfio_enable_msix(). So, ioctl ends up doing pci_enable_msix(4 vectors)
> > > and request_irq() for each of the 4 efds, which completes
> > > successfully.
> > >
> > > Not an expert in this area.. Perhaps, defer enabling 1st efd
> > > (intr_handle->fd) until the first invocation of vfio_enable_msix(), so
> > > it knows the app wants to use 4 vectors in total?
> > >
> > > Also, vfio_disable_msix() looks a bit wrong.
> > >
> > >         irq_set.flags = VFIO_IRQ_SET_DATA_NONE |
> > VFIO_IRQ_SET_ACTION_TRIGGER;
> > >         irq_set.index = VFIO_PCI_MSIX_IRQ_INDEX;
> > >         irq_set.start = RTE_INTR_VEC_RXTX_OFFSET;
> > >         irq_set.count = intr_handle->nb_efd;
> > >
> > > This tells vfio-pci to simulate interrupts by triggering efds? To
> > > free_irq() specific efds, I think we need DATA_EVENTFD and set fd =
> > > -1.
> > >
> > > flags = DATA_EVENTFD | ACTION_TRIGGER
> > > data = [fd(-1), fd(-1), ...]
> > >
> > > I have not tested this part myself yet.
> 
> 
> We do see the following failure[1] on octeontx2 PMD with this patch.
> We will try to find a fix.
> 
>         irq_set = (struct vfio_irq_set *)irq_set_buf;
>         irq_set->argsz = len;
>         irq_set->start = 0;
>         irq_set->count = intr_handle->max_intr;
>         irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>                         VFIO_IRQ_SET_ACTION_TRIGGER;
>         irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
> 
>         fd_ptr = (int32_t *)&irq_set->data[0];
>         for (i = 0; i < irq_set->count; i++)
>                 fd_ptr[i] = -1;
> 
>         rc = ioctl(intr_handle->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
>         if (rc)
>                 otx2_err("Failed to set irqs vector rc=%d", rc);
> 	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[1]
> 
> 
> 
> 
> 
> >
> > Thanks for your detailed report Hyong.
> > Would you be able to propose a fix?
> >

Did more digging. Another lengthy email..

It feels tricky to fix the problem properly, as it is getting late..
Here is a recap of the problem, as best I can.

1. INTx:
Both vfio-pci and igb_uio mask the interrupt (i.e. write 1 to
Interrupt Disable in PCI config) and then trigger callback to PMD.
PMD needs to unmask the interrupt when exiting its irq callback. This
is currently achieved by calling rte_intr_enable. Several PMDs use
this pattern.

irq_callback() {
  take_action()
  rte_intr_enable()
}

2. MSI/MSI-X:
No automatic masking done within vfio-pci/igb_uio before triggering
callback to PMD. No need to "re-enable" interrupt by calling
rte_intr_enable. If we forget INTx, we can simply remove
rte_intr_enable from all PMD irq callbacks..

The current vfio commit effectively turns rte_intr_enable into no-op
for MSI/MSI-X (ignore rxq interrupts for now).. So it is equivalent to
removing rte_intr_enable from irq callback.

Prior to this commit, rte_intr_enable ends up re-doing irq setup:
free_irq() -> request_irq(). In the bugzilla issue (qede), an
interrupt arrives in between these, and gets "lost", which causes
something that is waiting for it to timeout, etc. In the bz, note that
INTx has no issues, probably because it is level-triggered.


Now, about this commit (beware unorganized thoughts from me)..

I think David wants to turn rte_intr_enable/disable into unmask/mask,
and avoid free_irq()/request_irq() "post probe". 3 cases to
consider...

1. INTx:
Mission accomplished via ACTION_(UN)MASK.

2. MSI/MSI-X and 1st vector (intr_handle->fd):
rte_intr_enable/disable is now no-op. This is not quite right, since
interrupt remains enabled even after a call to rte_intr_disable. For
MSI/MSI-X, ACTION_(UN)MASK is no-op (unimpl) in vfio-pci, so no way to
mask. It's been that way ever since, as far as I can tell.

Prior to this commit, rte_intr_disable() does free_irq(), so interrupt
does get disabled.

3. MSI/MSI-X and rxq vectors (intr_handle->efds):
Broken as reported earlier.


If we limit scope to only qede, then a variation of David's earlier
patch (self NACKed) would be sufficient.

http://patchwork.dpdk.org/patch/55310/

qede has separate handlers for INTx and MSI/MSI-X. So, just need to
remove rte_intr_enable() from the MSI/MSI-X handler.

--- a/drivers/net/qede/qede_ethdev.c
+++ b/drivers/net/qede/qede_ethdev.c
@@ -261,8 +261,6 @@ qede_interrupt_handler(void *param)
        struct ecore_dev *edev = &qdev->edev;

        qede_interrupt_action(ECORE_LEADING_HWFN(edev));
-       if (rte_intr_enable(eth_dev->intr_handle))
-               DP_ERR(edev, "rte_intr_enable failed\n");
 }


Trying to see if the following works. Do not have a patch yet.
- Revert pci_vfio.c so we do not enable interrupt during probe
- In eal_interrupts.c
  - Add state bit so we know if interrupt is enabled
  - For INTx, if enabled, use David's code to mask/unmask
  - For MSI/MSI-X, if enabled, do not enable again
    (i.e. do not do VFIO_DEVICE_SET_IRQS)


Jerin or others, do not let me stop you. Kinda reluctant to be the
owner of this issue at the moment :-)

Thank you.
-Hyong


      reply	other threads:[~2019-07-15  7:20 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-02 12:48 [dpdk-dev] [RFC PATCH] vfio: move eventfd/interrupt pairing at setup time David Marchand
2019-07-09 13:40 ` [dpdk-dev] [EXT] " Shahed Shaikh
2019-07-10 12:33 ` [dpdk-dev] [PATCH] vfio: fix interrupts race condition David Marchand
2019-07-10 21:20   ` Thomas Monjalon
2019-07-14  5:10     ` Hyong Youb Kim (hyonkim)
2019-07-14 11:19       ` Thomas Monjalon
2019-07-15  5:35         ` Jerin Jacob Kollanukkaran
2019-07-15  7:20           ` Hyong Youb Kim (hyonkim) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MWHPR11MB1839D64DABB4FD4F6F863DE8BFCF0@MWHPR11MB1839.namprd11.prod.outlook.com \
    --to=hyonkim@cisco.com \
    --cc=ajit.khaparde@broadcom.com \
    --cc=alejandro.lucero@netronome.com \
    --cc=alex.williamson@redhat.com \
    --cc=allain.legacy@windriver.com \
    --cc=anatoly.burakov@intel.com \
    --cc=arybchenko@solarflare.com \
    --cc=beilei.xing@intel.com \
    --cc=cloud.wangxiaoyun@huawei.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=evgenys@amazon.com \
    --cc=gtzalik@amazon.com \
    --cc=hemant.agrawal@nxp.com \
    --cc=igor.russkikh@aquantia.com \
    --cc=jerinj@marvell.com \
    --cc=jingjing.wu@intel.com \
    --cc=johndale@cisco.com \
    --cc=konstantin.ananyev@intel.com \
    --cc=matt.peters@windriver.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=mk@semihalf.com \
    --cc=mw@semihalf.com \
    --cc=ndabilpuram@marvell.com \
    --cc=pavel.belous@aquantia.com \
    --cc=qi.z.zhang@intel.com \
    --cc=qiming.yang@intel.com \
    --cc=ravi1.kumar@amd.com \
    --cc=rmody@marvell.com \
    --cc=shreyansh.jain@nxp.com \
    --cc=shshaikh@marvell.com \
    --cc=somnath.kotur@broadcom.com \
    --cc=stable@dpdk.org \
    --cc=stephen@networkplumber.org \
    --cc=thomas@monjalon.net \
    --cc=tiwei.bie@intel.com \
    --cc=wenzhuo.lu@intel.com \
    --cc=xiao.w.wang@intel.com \
    --cc=xuanziyang2@huawei.com \
    --cc=yongwang@vmware.com \
    --cc=zhihong.wang@intel.com \
    --cc=zhouguoyang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).