[dpdk-dev] virtio kills qemu VM after stopping/starting ports

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] virtio kills qemu VM after stopping/starting ports
@ 2016-09-01 20:53 Kyle Larose
  2016-09-02  6:55 ` Tan, Jianfeng
  2016-09-05  4:10 ` Yuanhan Liu
  0 siblings, 2 replies; 7+ messages in thread
From: Kyle Larose @ 2016-09-01 20:53 UTC (permalink / raw)
  To: dev; +Cc: huawei.xie, yuanhan.liu, jianfeng.tan

Hello everyone,

In my own testing, I recently stumbled across an issue where I could get qemu to exit when sending traffic to my application. To do this, I simply needed to do the following:

1) Start my virtio interfaces
2) Send some traffic into/out of the interfaces
3) Stop the interfaces
4) Start the interfaces
5) Send some more traffic

At this point, I would lose connectivity to my VM.  Further investigation revealed qemu exiting with the following log:

	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index from 5 to 1

I found the following bug report against qemu, reported by a user of DPDK: https://bugs.launchpad.net/qemu/+bug/1558175

That thread seems to have stalled out, so I think we probably should deal with the problem within DPDK itself. Either way, later in the bug report chain, we see a link to this patch to DPDK: http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of the bug report claims that this patch fixes the problem. Perhaps it does. However, it introduces a new problem: If I remove the patch, I cannot reproduce the problem. So, that leads me to believe that it has caused a regression.

To summarize the patch’s changes, it basically changes the virtio_dev_stop function to flag the device as stopped, and stops the device when closing/uninitializing it. However, there is a seemingly unintended side-effect. 

In virtio_dev_start, we have the following block of code:

	/* On restart after stop do not touch queues */
	if (hw->started)
		return 0;

	/* Do final configuration before rx/tx engine starts */
	virtio_dev_rxtx_start(dev);

….

Prior to the patch, if an interface were stopped then started, without restarting the application, the queues would be left as-is, because hw->started would be set to 1. Now, calling stop sets hw->started to 0, which means the next call to start will “touch the queues”. This is the unintended side-effect that causes the problem.

I made a change locally to break the state of the device into two: started and opened. The devices starts out neither started nor opened. If the device is accepting packets, it is started. If the device has set up its queues, it is opened. Stopping the device does not close the device. This allows me to change the check above to:

 	if (hw->opened) {
		hw->started=1
		return 0;
	}

Then, if I stop and start the device, it does not reinitialize the queues. I have no problem. I can restart ports as much as I want, and the system keeps running. Traffic flows when they’ve restarted as well, which is always a plus. ☺

Some background:
- I tested against DPDK 16.04 and DPDK 16.07.
- I’m using virtio NICs:
- CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
- Host OS: CentOS Linux release 7.1.1503 (Core)
- Guest OS: CentOS Linux release 7.2.1511 (Core)
- Qemu-kvm version: 1.5.3-86.el7_1.6

I plan on submitting a patch to fix this tomorrow. Let me know if anyone has any thoughts about this, or a better way to fix it.

Thanks,

Kyle

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] virtio kills qemu VM after stopping/starting ports
  2016-09-01 20:53 [dpdk-dev] virtio kills qemu VM after stopping/starting ports Kyle Larose
@ 2016-09-02  6:55 ` Tan, Jianfeng
  2016-09-02 12:35   ` Kyle Larose
  2016-09-05  4:10 ` Yuanhan Liu
  1 sibling, 1 reply; 7+ messages in thread
From: Tan, Jianfeng @ 2016-09-02  6:55 UTC (permalink / raw)
  To: Kyle Larose, dev; +Cc: huawei.xie, yuanhan.liu

Hi Kyle,


On 9/2/2016 4:53 AM, Kyle Larose wrote:
> Hello everyone,
>
> In my own testing, I recently stumbled across an issue where I could get qemu to exit when sending traffic to my application. To do this, I simply needed to do the following:
>
> 1) Start my virtio interfaces
> 2) Send some traffic into/out of the interfaces
> 3) Stop the interfaces
> 4) Start the interfaces
> 5) Send some more traffic
>
> At this point, I would lose connectivity to my VM.  Further investigation revealed qemu exiting with the following log:
>
> 	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index from 5 to 1
>
> I found the following bug report against qemu, reported by a user of DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
>
> That thread seems to have stalled out, so I think we probably should deal with the problem within DPDK itself. Either way, later in the bug report chain, we see a link to this patch to DPDK: http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of the bug report claims that this patch fixes the problem. Perhaps it does.

I once got a chance to analyze the bug you referred here. The patch 
(http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648) does not fix 
that bug. The root cause of that bug is: when DPDK appilcation gets 
killed, nobody tells the vhost backend to stop. So when restarting the 
DPDK app, those hugepages are reused, and initialized to all zero. And 
unavoidably, "idx" in those memory is changed to 0. I have written a 
patch to notify the vhost backend to stop when DPDK app gets killed 
suddenly (more accurate, when /dev/uioX gets closed), and this patch 
will be sent out recently. And this patch does not fix your problem, 
either. You did not kill the app. (I should not update info about that 
bug here, and I mean they are different problems)

> However, it introduces a new problem: If I remove the patch, I cannot reproduce the problem. So, that leads me to believe that it has caused a regression.
>
> To summarize the patch’s changes, it basically changes the virtio_dev_stop function to flag the device as stopped, and stops the device when closing/uninitializing it. However, there is a seemingly unintended side-effect.
>
> In virtio_dev_start, we have the following block of code:
>
> 	/* On restart after stop do not touch queues */
> 	if (hw->started)
> 		return 0;
>
> 	/* Do final configuration before rx/tx engine starts */
> 	virtio_dev_rxtx_start(dev);
>
> ….
>
> Prior to the patch, if an interface were stopped then started, without restarting the application, the queues would be left as-is, because hw->started would be set to 1.

Yes, my previous patch did break this behavior (by stopping and 
re-starting the device, the queues would be left as-is) and leads to the 
problem here. And you are proposing to recover.

But is this a right behavior to follow?
On the one side, when we stop the virtio device, should we notify the 
vhost backend to stop too? Currently, we just flag it as stopped.
On the other side, now in many PMDs, we mix the dev 
initialization/configuration code into dev_start functions, that is to 
day, we re-configure the device every time we start it (may be to make 
sure that changed configuration will be configured). Then what if we 
increase the queue numbers of virtio?

Thanks,
Jianfeng


>   Now, calling stop sets hw->started to 0, which means the next call to start will “touch the queues”. This is the unintended side-effect that causes the problem.
>
> I made a change locally to break the state of the device into two: started and opened. The devices starts out neither started nor opened. If the device is accepting packets, it is started. If the device has set up its queues, it is opened. Stopping the device does not close the device. This allows me to change the check above to:
>
>   	if (hw->opened) {
> 		hw->started=1
> 		return 0;
> 	}
>
> Then, if I stop and start the device, it does not reinitialize the queues. I have no problem. I can restart ports as much as I want, and the system keeps running. Traffic flows when they’ve restarted as well, which is always a plus. ☺
>
> Some background:
> - I tested against DPDK 16.04 and DPDK 16.07.
> - I’m using virtio NICs:
> - CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> - Host OS: CentOS Linux release 7.1.1503 (Core)
> - Guest OS: CentOS Linux release 7.2.1511 (Core)
> - Qemu-kvm version: 1.5.3-86.el7_1.6
>
> I plan on submitting a patch to fix this tomorrow. Let me know if anyone has any thoughts about this, or a better way to fix it.
>
> Thanks,
>
> Kyle

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] virtio kills qemu VM after stopping/starting ports
  2016-09-02  6:55 ` Tan, Jianfeng
@ 2016-09-02 12:35   ` Kyle Larose
  2016-09-02 12:53     ` Tan, Jianfeng
  0 siblings, 1 reply; 7+ messages in thread
From: Kyle Larose @ 2016-09-02 12:35 UTC (permalink / raw)
  To: Tan, Jianfeng, dev; +Cc: huawei.xie, yuanhan.liu



> -----Original Message-----
> From: Tan, Jianfeng [mailto:jianfeng.tan@intel.com]
> Sent: Friday, September 02, 2016 2:56 AM
> To: Kyle Larose; dev@dpdk.org
> Cc: huawei.xie@intel.com; yuanhan.liu@linux.intel.com
> Subject: Re: virtio kills qemu VM after stopping/starting ports
> 
> Hi Kyle,
> 
> 
> On 9/2/2016 4:53 AM, Kyle Larose wrote:
> > Hello everyone,
> >
> > In my own testing, I recently stumbled across an issue where I could get qemu
> to exit when sending traffic to my application. To do this, I simply needed to do
> the following:
> >
> > 1) Start my virtio interfaces
> > 2) Send some traffic into/out of the interfaces
> > 3) Stop the interfaces
> > 4) Start the interfaces
> > 5) Send some more traffic
> >
> > At this point, I would lose connectivity to my VM.  Further investigation
> revealed qemu exiting with the following log:
> >
> > 	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index
> from 5
> > to 1
> >
> > I found the following bug report against qemu, reported by a user of
> > DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
> >
> > That thread seems to have stalled out, so I think we probably should deal with
> the problem within DPDK itself. Either way, later in the bug report chain, we
> see a link to this patch to DPDK:
> http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of
> the bug report claims that this patch fixes the problem. Perhaps it does.
> 
> I once got a chance to analyze the bug you referred here. The patch
> (http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648) does not fix that
> bug. The root cause of that bug is: when DPDK appilcation gets killed, nobody
> tells the vhost backend to stop. So when restarting the DPDK app, those
> hugepages are reused, and initialized to all zero. And unavoidably, "idx" in
> those memory is changed to 0. I have written a patch to notify the vhost
> backend to stop when DPDK app gets killed suddenly (more accurate, when
> /dev/uioX gets closed), and this patch will be sent out recently. And this patch
> does not fix your problem, either. You did not kill the app. (I should not update
> info about that bug here, and I mean they are different problems)
> 
> > However, it introduces a new problem: If I remove the patch, I cannot
> reproduce the problem. So, that leads me to believe that it has caused a
> regression.
> >
> > To summarize the patch’s changes, it basically changes the virtio_dev_stop
> function to flag the device as stopped, and stops the device when
> closing/uninitializing it. However, there is a seemingly unintended side-effect.
> >
> > In virtio_dev_start, we have the following block of code:
> >
> > 	/* On restart after stop do not touch queues */
> > 	if (hw->started)
> > 		return 0;
> >
> > 	/* Do final configuration before rx/tx engine starts */
> > 	virtio_dev_rxtx_start(dev);
> >
> > ….
> >
> > Prior to the patch, if an interface were stopped then started, without
> restarting the application, the queues would be left as-is, because hw->started
> would be set to 1.
> 
> Yes, my previous patch did break this behavior (by stopping and re-starting the
> device, the queues would be left as-is) and leads to the problem here. And you
> are proposing to recover.
> 
> But is this a right behavior to follow?
> On the one side, when we stop the virtio device, should we notify the vhost
> backend to stop too? Currently, we just flag it as stopped.
> On the other side, now in many PMDs, we mix the dev
> initialization/configuration code into dev_start functions, that is to day, we re-
> configure the device every time we start it (may be to make sure that changed
> configuration will be configured). Then what if we increase the queue numbers
> of virtio?

Hi Jianfeng,


Let me see if my understanding is correct. You're saying that there is a problem introduced by your patch, but going back to the original behaviour is not ideal: it still leaves a gap with respect to changing the number of queues with virtio. The ideal solution would have us properly tear down the device when it is stopped, and then properly reinitialize it when it is started? That seems reasonable. However, I'm not sure what it would take.

Does it make sense to submit a patch like I suggested in the short term, and have somebody work on a more robust long term solution? How could we go about making sure that happens? (I can't allocate much time to fixing this, so I likely won't be able to do anything more complicated than what I've proposed).

> 
> Thanks,
> Jianfeng
> 
> 
> >   Now, calling stop sets hw->started to 0, which means the next call to start
> will “touch the queues”. This is the unintended side-effect that causes the
> problem.
> >
> > I made a change locally to break the state of the device into two: started and
> opened. The devices starts out neither started nor opened. If the device is
> accepting packets, it is started. If the device has set up its queues, it is opened.
> Stopping the device does not close the device. This allows me to change the
> check above to:
> >
> >   	if (hw->opened) {
> > 		hw->started=1
> > 		return 0;
> > 	}
> >
> > Then, if I stop and start the device, it does not reinitialize the
> > queues. I have no problem. I can restart ports as much as I want, and
> > the system keeps running. Traffic flows when they’ve restarted as
> > well, which is always a plus. ☺
> >
> > Some background:
> > - I tested against DPDK 16.04 and DPDK 16.07.
> > - I’m using virtio NICs:
> > - CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> > - Host OS: CentOS Linux release 7.1.1503 (Core)
> > - Guest OS: CentOS Linux release 7.2.1511 (Core)
> > - Qemu-kvm version: 1.5.3-86.el7_1.6
> >
> > I plan on submitting a patch to fix this tomorrow. Let me know if anyone has
> any thoughts about this, or a better way to fix it.
> >
> > Thanks,
> >
> > Kyle


Thanks,

Kyle

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] virtio kills qemu VM after stopping/starting ports
  2016-09-02 12:35   ` Kyle Larose
@ 2016-09-02 12:53     ` Tan, Jianfeng
  2016-09-05  3:20       ` Tan, Jianfeng
  0 siblings, 1 reply; 7+ messages in thread
From: Tan, Jianfeng @ 2016-09-02 12:53 UTC (permalink / raw)
  To: Kyle Larose, dev; +Cc: huawei.xie, yuanhan.liu

Hi Kyle,


On 9/2/2016 8:35 PM, Kyle Larose wrote:
>
>> -----Original Message-----
>> From: Tan, Jianfeng [mailto:jianfeng.tan@intel.com]
>> Sent: Friday, September 02, 2016 2:56 AM
>> To: Kyle Larose; dev@dpdk.org
>> Cc: huawei.xie@intel.com; yuanhan.liu@linux.intel.com
>> Subject: Re: virtio kills qemu VM after stopping/starting ports
>>
>> Hi Kyle,
>>
>>
>> On 9/2/2016 4:53 AM, Kyle Larose wrote:
>>> Hello everyone,
>>>
>>> In my own testing, I recently stumbled across an issue where I could get qemu
>> to exit when sending traffic to my application. To do this, I simply needed to do
>> the following:
>>> 1) Start my virtio interfaces
>>> 2) Send some traffic into/out of the interfaces
>>> 3) Stop the interfaces
>>> 4) Start the interfaces
>>> 5) Send some more traffic
>>>
>>> At this point, I would lose connectivity to my VM.  Further investigation
>> revealed qemu exiting with the following log:
>>> 	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index
>> from 5
>>> to 1
>>>
>>> I found the following bug report against qemu, reported by a user of
>>> DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
>>>
>>> That thread seems to have stalled out, so I think we probably should deal with
>> the problem within DPDK itself. Either way, later in the bug report chain, we
>> see a link to this patch to DPDK:
>> http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of
>> the bug report claims that this patch fixes the problem. Perhaps it does.
>>
>> I once got a chance to analyze the bug you referred here. The patch
>> (http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648) does not fix that
>> bug. The root cause of that bug is: when DPDK appilcation gets killed, nobody
>> tells the vhost backend to stop. So when restarting the DPDK app, those
>> hugepages are reused, and initialized to all zero. And unavoidably, "idx" in
>> those memory is changed to 0. I have written a patch to notify the vhost
>> backend to stop when DPDK app gets killed suddenly (more accurate, when
>> /dev/uioX gets closed), and this patch will be sent out recently. And this patch
>> does not fix your problem, either. You did not kill the app. (I should not update
>> info about that bug here, and I mean they are different problems)
>>
>>> However, it introduces a new problem: If I remove the patch, I cannot
>> reproduce the problem. So, that leads me to believe that it has caused a
>> regression.
>>> To summarize the patch’s changes, it basically changes the virtio_dev_stop
>> function to flag the device as stopped, and stops the device when
>> closing/uninitializing it. However, there is a seemingly unintended side-effect.
>>> In virtio_dev_start, we have the following block of code:
>>>
>>> 	/* On restart after stop do not touch queues */
>>> 	if (hw->started)
>>> 		return 0;
>>>
>>> 	/* Do final configuration before rx/tx engine starts */
>>> 	virtio_dev_rxtx_start(dev);
>>>
>>> ….
>>>
>>> Prior to the patch, if an interface were stopped then started, without
>> restarting the application, the queues would be left as-is, because hw->started
>> would be set to 1.
>>
>> Yes, my previous patch did break this behavior (by stopping and re-starting the
>> device, the queues would be left as-is) and leads to the problem here. And you
>> are proposing to recover.
>>
>> But is this a right behavior to follow?
>> On the one side, when we stop the virtio device, should we notify the vhost
>> backend to stop too? Currently, we just flag it as stopped.
>> On the other side, now in many PMDs, we mix the dev
>> initialization/configuration code into dev_start functions, that is to day, we re-
>> configure the device every time we start it (may be to make sure that changed
>> configuration will be configured). Then what if we increase the queue numbers
>> of virtio?
> Hi Jianfeng,
>
>
> Let me see if my understanding is correct. You're saying that there is a problem introduced by your patch, but going back to the original behaviour is not ideal: it still leaves a gap with respect to changing the number of queues with virtio. The ideal solution would have us properly tear down the device when it is stopped, and then properly reinitialize it when it is started? That seems reasonable. However, I'm not sure what it would take.

Yes, exactly.

> Does it make sense to submit a patch like I suggested in the short term, and have somebody work on a more robust long term solution? How could we go about making sure that happens? (I can't allocate much time to fixing this, so I likely won't be able to do anything more complicated than what I've proposed).

Let's see if maintainers have any suggestions here. (And I personally do 
not oppose your proposal for short term).

Thanks,
Jianfeng

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] virtio kills qemu VM after stopping/starting ports
  2016-09-02 12:53     ` Tan, Jianfeng
@ 2016-09-05  3:20       ` Tan, Jianfeng
  0 siblings, 0 replies; 7+ messages in thread
From: Tan, Jianfeng @ 2016-09-05  3:20 UTC (permalink / raw)
  To: Tan, Jianfeng, Kyle Larose, dev; +Cc: Xie, Huawei, yuanhan.liu

Hi,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tan, Jianfeng
> Sent: Friday, September 2, 2016 8:54 PM
> To: Kyle Larose; dev@dpdk.org
> Cc: Xie, Huawei; yuanhan.liu@linux.intel.com
> Subject: Re: [dpdk-dev] virtio kills qemu VM after stopping/starting ports
> 
> Hi Kyle,
> 
> 
> On 9/2/2016 8:35 PM, Kyle Larose wrote:
> >
> >> -----Original Message-----
> >> From: Tan, Jianfeng [mailto:jianfeng.tan@intel.com]
> >> Sent: Friday, September 02, 2016 2:56 AM
> >> To: Kyle Larose; dev@dpdk.org
> >> Cc: huawei.xie@intel.com; yuanhan.liu@linux.intel.com
> >> Subject: Re: virtio kills qemu VM after stopping/starting ports
> >>
> >> Hi Kyle,
> >>
> >>
> >> On 9/2/2016 4:53 AM, Kyle Larose wrote:
> >>> Hello everyone,
> >>>
> >>> In my own testing, I recently stumbled across an issue where I could get
> qemu
> >> to exit when sending traffic to my application. To do this, I simply needed
> to do
> >> the following:
> >>> 1) Start my virtio interfaces
> >>> 2) Send some traffic into/out of the interfaces
> >>> 3) Stop the interfaces
> >>> 4) Start the interfaces
> >>> 5) Send some more traffic
> >>>
> >>> At this point, I would lose connectivity to my VM.  Further investigation
> >> revealed qemu exiting with the following log:
> >>> 	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index
> >> from 5
> >>> to 1
> >>>
> >>> I found the following bug report against qemu, reported by a user of
> >>> DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
> >>>
> >>> That thread seems to have stalled out, so I think we probably should
> deal with
> >> the problem within DPDK itself. Either way, later in the bug report chain,
> we
> >> see a link to this patch to DPDK:
> >> http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The
> submitter of
> >> the bug report claims that this patch fixes the problem. Perhaps it does.
> >>
> >> I once got a chance to analyze the bug you referred here. The patch
> >> (http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648) does not fix
> that
> >> bug. The root cause of that bug is: when DPDK appilcation gets killed,
> nobody
> >> tells the vhost backend to stop. So when restarting the DPDK app, those
> >> hugepages are reused, and initialized to all zero. And unavoidably, "idx" in
> >> those memory is changed to 0. I have written a patch to notify the vhost
> >> backend to stop when DPDK app gets killed suddenly (more accurate,
> when
> >> /dev/uioX gets closed), and this patch will be sent out recently. And this
> patch
> >> does not fix your problem, either. You did not kill the app. (I should not
> update
> >> info about that bug here, and I mean they are different problems)
> >>
> >>> However, it introduces a new problem: If I remove the patch, I cannot
> >> reproduce the problem. So, that leads me to believe that it has caused a
> >> regression.
> >>> To summarize the patch’s changes, it basically changes the
> virtio_dev_stop
> >> function to flag the device as stopped, and stops the device when
> >> closing/uninitializing it. However, there is a seemingly unintended side-
> effect.
> >>> In virtio_dev_start, we have the following block of code:
> >>>
> >>> 	/* On restart after stop do not touch queues */
> >>> 	if (hw->started)
> >>> 		return 0;
> >>>
> >>> 	/* Do final configuration before rx/tx engine starts */
> >>> 	virtio_dev_rxtx_start(dev);
> >>>
> >>> ….
> >>>
> >>> Prior to the patch, if an interface were stopped then started, without
> >> restarting the application, the queues would be left as-is, because hw-
> >started
> >> would be set to 1.
> >>
> >> Yes, my previous patch did break this behavior (by stopping and re-
> starting the
> >> device, the queues would be left as-is) and leads to the problem here.
> And you
> >> are proposing to recover.
> >>
> >> But is this a right behavior to follow?
> >> On the one side, when we stop the virtio device, should we notify the
> vhost
> >> backend to stop too? Currently, we just flag it as stopped.
> >> On the other side, now in many PMDs, we mix the dev
> >> initialization/configuration code into dev_start functions, that is to day,
> we re-
> >> configure the device every time we start it (may be to make sure that
> changed
> >> configuration will be configured). Then what if we increase the queue
> numbers
> >> of virtio?
> > Hi Jianfeng,
> >
> >
> > Let me see if my understanding is correct. You're saying that there is a
> problem introduced by your patch, but going back to the original behaviour is
> not ideal: it still leaves a gap with respect to changing the number of queues
> with virtio. The ideal solution would have us properly tear down the device
> when it is stopped, and then properly reinitialize it when it is started? That
> seems reasonable. However, I'm not sure what it would take.
> 
> Yes, exactly.
> 
> > Does it make sense to submit a patch like I suggested in the short term, and
> have somebody work on a more robust long term solution? How could we go
> about making sure that happens? (I can't allocate much time to fixing this, so
> I likely won't be able to do anything more complicated than what I've
> proposed).
> 
> Let's see if maintainers have any suggestions here. (And I personally do
> not oppose your proposal for short term).

To add more info:

The previous patch 9a0615af774 was to fix the problem which is described here:
http://dpdk.org/ml/archives/dev/2015-December/030911.html

And it's about reconfiguring queue number after stopping the device and then restart the device.

Thanks,
Jianfeng

> 
> Thanks,
> Jianfeng
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] virtio kills qemu VM after stopping/starting ports
  2016-09-01 20:53 [dpdk-dev] virtio kills qemu VM after stopping/starting ports Kyle Larose
  2016-09-02  6:55 ` Tan, Jianfeng
@ 2016-09-05  4:10 ` Yuanhan Liu
  2016-09-06 12:25   ` Kyle Larose
  1 sibling, 1 reply; 7+ messages in thread
From: Yuanhan Liu @ 2016-09-05  4:10 UTC (permalink / raw)
  To: Kyle Larose; +Cc: dev, huawei.xie, jianfeng.tan

On Thu, Sep 01, 2016 at 08:53:31PM +0000, Kyle Larose wrote:
> Hello everyone,

Hi,

Firstly, thanks for the report and detailed analysis!

> 
> In my own testing, I recently stumbled across an issue where I could get qemu to exit when sending traffic to my application. To do this, I simply needed to do the following:
> 
> 1) Start my virtio interfaces
> 2) Send some traffic into/out of the interfaces
> 3) Stop the interfaces
> 4) Start the interfaces
> 5) Send some more traffic
> 
> At this point, I would lose connectivity to my VM.  Further investigation revealed qemu exiting with the following log:
> 
> 	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index from 5 to 1
> 
> I found the following bug report against qemu, reported by a user of DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
> 
> That thread seems to have stalled out, so I think we probably should deal with the problem within DPDK itself. Either way, later in the bug report chain, we see a link to this patch to DPDK: http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of the bug report claims that this patch fixes the problem. Perhaps it does. However, it introduces a new problem: If I remove the patch, I cannot reproduce the problem. So, that leads me to believe that it has caused a regression.

Yes, it is a regression from that point of view.

> To summarize the patch’s changes, it basically changes the virtio_dev_stop function to flag the device as stopped, and stops the device when closing/uninitializing it. However, there is a seemingly unintended side-effect. 
> 
> In virtio_dev_start, we have the following block of code:
> 
> 	/* On restart after stop do not touch queues */
> 	if (hw->started)
> 		return 0;
> 
> 	/* Do final configuration before rx/tx engine starts */
> 	virtio_dev_rxtx_start(dev);
> 
> ….
> 
> Prior to the patch, if an interface were stopped then started, without restarting the application, the queues would be left as-is, because hw->started would be set to 1. Now, calling stop sets hw->started to 0, which means the next call to start will “touch the queues”. This is the unintended side-effect that causes the problem.
> 
> I made a change locally to break the state of the device into two: started and opened. The devices starts out neither started nor opened. If the device is accepting packets, it is started. If the device has set up its queues, it is opened. Stopping the device does not close the device. This allows me to change the check above to:
> 
>  	if (hw->opened) {
> 		hw->started=1
> 		return 0;
> 	}

It would work in your case, but it makes thing complex.

So, I talked with Jianfeng and revisited the original issue he meant to
fix: failure (maybe crash) on stop, re-configure queue number and restart.

Yes, that case is broken, but the fix wasn't right, neither: we can't
simply re-alloc and re-setup queue on start, because vhost is only aware
of the first setup.  You could check following link for more information,
including the right fix (you need follow the discussion to find that).

In summary, I will revert commit 9a0615af774 (and carry it to the stable
branch as well). Later, I will fix the virtio multiple queue issue.

	--yliu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] virtio kills qemu VM after stopping/starting ports
  2016-09-05  4:10 ` Yuanhan Liu
@ 2016-09-06 12:25   ` Kyle Larose
  0 siblings, 0 replies; 7+ messages in thread
From: Kyle Larose @ 2016-09-06 12:25 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, huawei.xie, jianfeng.tan



> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Monday, September 05, 2016 12:10 AM
> To: Kyle Larose
> Cc: dev@dpdk.org; huawei.xie@intel.com; jianfeng.tan@intel.com
> Subject: Re: virtio kills qemu VM after stopping/starting ports
> 
> On Thu, Sep 01, 2016 at 08:53:31PM +0000, Kyle Larose wrote:
> > Hello everyone,
> 
> Hi,
> 
> Firstly, thanks for the report and detailed analysis!
> 
> >
> > In my own testing, I recently stumbled across an issue where I could get qemu
> to exit when sending traffic to my application. To do this, I simply needed to do
> the following:
> >
> > 1) Start my virtio interfaces
> > 2) Send some traffic into/out of the interfaces
> > 3) Stop the interfaces
> > 4) Start the interfaces
> > 5) Send some more traffic
> >
> > At this point, I would lose connectivity to my VM.  Further investigation
> revealed qemu exiting with the following log:
> >
> > 	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index
> from 5
> > to 1
> >
> > I found the following bug report against qemu, reported by a user of
> > DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
> >
> > That thread seems to have stalled out, so I think we probably should deal with
> the problem within DPDK itself. Either way, later in the bug report chain, we
> see a link to this patch to DPDK:
> http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of
> the bug report claims that this patch fixes the problem. Perhaps it does.
> However, it introduces a new problem: If I remove the patch, I cannot
> reproduce the problem. So, that leads me to believe that it has caused a
> regression.
> 
> Yes, it is a regression from that point of view.
> 
> > To summarize the patch’s changes, it basically changes the virtio_dev_stop
> function to flag the device as stopped, and stops the device when
> closing/uninitializing it. However, there is a seemingly unintended side-effect.
> >
> > In virtio_dev_start, we have the following block of code:
> >
> > 	/* On restart after stop do not touch queues */
> > 	if (hw->started)
> > 		return 0;
> >
> > 	/* Do final configuration before rx/tx engine starts */
> > 	virtio_dev_rxtx_start(dev);
> >
> > ….
> >
> > Prior to the patch, if an interface were stopped then started, without
> restarting the application, the queues would be left as-is, because hw->started
> would be set to 1. Now, calling stop sets hw->started to 0, which means the
> next call to start will “touch the queues”. This is the unintended side-effect that
> causes the problem.
> >
> > I made a change locally to break the state of the device into two: started and
> opened. The devices starts out neither started nor opened. If the device is
> accepting packets, it is started. If the device has set up its queues, it is opened.
> Stopping the device does not close the device. This allows me to change the
> check above to:
> >
> >  	if (hw->opened) {
> > 		hw->started=1
> > 		return 0;
> > 	}
> 
> It would work in your case, but it makes thing complex.
> 
> So, I talked with Jianfeng and revisited the original issue he meant to
> fix: failure (maybe crash) on stop, re-configure queue number and restart.
> 
> Yes, that case is broken, but the fix wasn't right, neither: we can't simply re-
> alloc and re-setup queue on start, because vhost is only aware of the first setup.
> You could check following link for more information, including the right fix (you
> need follow the discussion to find that).
> 
> In summary, I will revert commit 9a0615af774 (and carry it to the stable
> branch as well). Later, I will fix the virtio multiple queue issue.
> 

Alright, so we should probably reject my patch, then. :) http://dpdk.org/dev/patchwork/patch/15596/



> 	--yliu

Thanks for getting back to me on this.

Kyle

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-09-06 12:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-01 20:53 [dpdk-dev] virtio kills qemu VM after stopping/starting ports Kyle Larose
2016-09-02  6:55 ` Tan, Jianfeng
2016-09-02 12:35   ` Kyle Larose
2016-09-02 12:53     ` Tan, Jianfeng
2016-09-05  3:20       ` Tan, Jianfeng
2016-09-05  4:10 ` Yuanhan Liu
2016-09-06 12:25   ` Kyle Larose

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).