[dpdk-dev] vhost-user technical isssues

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] vhost-user technical isssues
@ 2014-11-11 21:37 Xie, Huawei
  2014-11-12  4:12 ` Tetsuya Mukawa
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Xie, Huawei @ 2014-11-11 21:37 UTC (permalink / raw)
  To: 'Tetsuya Mukawa', dev

Hi Tetsuya:
There are two major technical issues in my mind for vhost-user implementation.

1) memory region map
Vhost-user passes us file fd and offset for each memory region. Unfortunately the mmap offset is "very" wrong. I discovered this issue long time ago, and also found
that I couldn't mmap the huge page file even with correct offset(need double check).
Just now I find that people reported this issue on Nov 3.
[Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd for region(0) to map the  whole file.
I think we should use this way temporarily to support qemu-2.1 as it has that bug.

2) what message is the indicator for vhost start/release?
Previously  for vhost-cuse, it has SET_BACKEND message.
What we should do for vhost-user?
SET_VRING_KICK for start?
What about for release?
Unlike the kernel virtio, the DPDK virtio in guest could be restarted. 

Thoughts?

-huawei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-11 21:37 [dpdk-dev] vhost-user technical isssues Xie, Huawei
@ 2014-11-12  4:12 ` Tetsuya Mukawa
  2014-11-13  6:30   ` Linhaifeng
  2014-11-14  0:22   ` Xie, Huawei
  2014-11-13  6:12 ` Linhaifeng
  2014-11-13  6:27 ` Linhaifeng
  2 siblings, 2 replies; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-12  4:12 UTC (permalink / raw)
  To: Xie, Huawei, dev

Hi Xie,

(2014/11/12 6:37), Xie, Huawei wrote:
> Hi Tetsuya:
> There are two major technical issues in my mind for vhost-user implementation.
>
> 1) memory region map
> Vhost-user passes us file fd and offset for each memory region. Unfortunately the mmap offset is "very" wrong. I discovered this issue long time ago, and also found
> that I couldn't mmap the huge page file even with correct offset(need double check).
> Just now I find that people reported this issue on Nov 3.
> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd for region(0) to map the  whole file.
> I think we should use this way temporarily to support qemu-2.1 as it has that bug.
I agree with you.
Also we may have an issue about un-mapping file on hugetlbfs of linux.
When I check munmap(), it seems 'size' need to be aligned by hugepage size.
(I guess it may be a kernel bug. Might be fixed already.)
Please add return value checking code for munmap().
Still munmap() might be failed.

>
> 2) what message is the indicator for vhost start/release?
> Previously  for vhost-cuse, it has SET_BACKEND message.
> What we should do for vhost-user?
> SET_VRING_KICK for start?
I think so.

> What about for release?
> Unlike the kernel virtio, the DPDK virtio in guest could be restarted. 
>
> Thoughts?
I guess we need to consider 2 types of restarting.
One is virtio-net driver restarting, the other is vhost-user backend
restarting.
But, so far, it's nice to start to think about virtio-net driver
restarting first.

Probably we need to implement a way to let vhost-user backend know
virtio-net driver is restarted.
I am not sure what is good way to let vhost-user backend know it.
But how about followings RFC?

- When unix domain socket is closed, vhost-user backend should treat it
as "release".
 It is useful when QEMU itself is gone suddenly.

- Also, implementing new ioctl command like VHOST_RESET_BACKEND.
 This command should be sent from virtio-net device of QEMU when
 VIRTIO_CONFIG_STATUS_RESET register of virtio-net device is set by
vrtio-net driver.
 (Usually this register is set when virtio-net driver is initialized or
stopped.)
 It means we need to change QEMU. ;)
 It seems virtio-net PMD already sets this register when PMD is
initialized or stopped.
 So this framework should work, and can let vhost-user backend know
driver resetting.
 (And I guess we can say same things for virtio-net kernel driver.)
 It might be enough to close an unix domain socket, instead of
implementing new command.
 But in the case, we may need auto reconnection mechanism.

- We also need to consider DPDK application is gone suddenly without
setting reset register.
 In the case, vhost-user backend cannot know it. Only user (or some kind
of watchdog
 applications on guest) knows it.
 Because of this, user(or app.) should have responsibility to solve this
situation.
 To be more precise, user should let vhost-user backend know device
releasing.
 If user starts an other DPDK application without solving the issue, the
new DPDK application may
 access memory that vhost-user backend is also accessing.
 I guess user can solve the issue using "dpdk_nic_bind.py".
 The script can move virtio-net device to kernel virtio-net driver, and
return it to igb_uio.
 While those steps, virtio-net device is initialized by virtio-net
kernel driver.
 So vhost-user backend can know device releasing.

Tetsuya

>
> -huawei

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-11 21:37 [dpdk-dev] vhost-user technical isssues Xie, Huawei
  2014-11-12  4:12 ` Tetsuya Mukawa
@ 2014-11-13  6:12 ` Linhaifeng
  2014-11-13  6:27 ` Linhaifeng
  2 siblings, 0 replies; 22+ messages in thread
From: Linhaifeng @ 2014-11-13  6:12 UTC (permalink / raw)
  To: Xie, Huawei, 'Tetsuya Mukawa', dev



On 2014/11/12 5:37, Xie, Huawei wrote:
> Hi Tetsuya:
> There are two major technical issues in my mind for vhost-user implementation.
> 
> 1) memory region map
> Vhost-user passes us file fd and offset for each memory region. Unfortunately the mmap offset is "very" wrong. I discovered this issue long time ago, and also found
> that I couldn't mmap the huge page file even with correct offset(need double check).
> Just now I find that people reported this issue on Nov 3.
> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd for region(0) to map the  whole file.
> I think we should use this way temporarily to support qemu-2.1 as it has that bug.
> 

this bug is not in dpdk's vhost-user just for qemu's vhost-user backend
> 2) what message is the indicator for vhost start/release?
> Previously  for vhost-cuse, it has SET_BACKEND message.
> What we should do for vhost-user?
> SET_VRING_KICK for start?
> What about for release?
> Unlike the kernel virtio, the DPDK virtio in guest could be restarted. 
> 
> Thoughts?
> 
> -huawei
> 
> 

-- 
Regards,
Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-11 21:37 [dpdk-dev] vhost-user technical isssues Xie, Huawei
  2014-11-12  4:12 ` Tetsuya Mukawa
  2014-11-13  6:12 ` Linhaifeng
@ 2014-11-13  6:27 ` Linhaifeng
  2014-11-14  1:28   ` Xie, Huawei
  2 siblings, 1 reply; 22+ messages in thread
From: Linhaifeng @ 2014-11-13  6:27 UTC (permalink / raw)
  To: Xie, Huawei, 'Tetsuya Mukawa', dev



On 2014/11/12 5:37, Xie, Huawei wrote:
> Hi Tetsuya:
> There are two major technical issues in my mind for vhost-user implementation.
> 
> 1) memory region map
> Vhost-user passes us file fd and offset for each memory region. Unfortunately the mmap offset is "very" wrong. I discovered this issue long time ago, and also found
> that I couldn't mmap the huge page file even with correct offset(need double check).
> Just now I find that people reported this issue on Nov 3.
> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd for region(0) to map the  whole file.
> I think we should use this way temporarily to support qemu-2.1 as it has that bug.
> 

the size of region 0 is not same as the file size. may be you should mmap the other region.

region 0:
        gpa = 0x0
        size = 655360
        ua = 0x2aaaaac00000
        offset = 0

region 1:// use this region to mmap.BTW how to avoid mmap twice when there are two devices?
        gpa = 0xC0000
        size = 2146697216
        ua = 0x2aaaaacc0000
        offset = 786432



> 2) what message is the indicator for vhost start/release?
> Previously  for vhost-cuse, it has SET_BACKEND message.
> What we should do for vhost-user?
> SET_VRING_KICK for start?
> What about for release?
> Unlike the kernel virtio, the DPDK virtio in guest could be restarted. 
> 
> Thoughts?
> 
> -huawei
> 
> 

-- 
Regards,
Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-12  4:12 ` Tetsuya Mukawa
@ 2014-11-13  6:30   ` Linhaifeng
  2014-11-14  2:30     ` Tetsuya Mukawa
  2014-11-14  0:22   ` Xie, Huawei
  1 sibling, 1 reply; 22+ messages in thread
From: Linhaifeng @ 2014-11-13  6:30 UTC (permalink / raw)
  To: Tetsuya Mukawa, Xie, Huawei, dev



On 2014/11/12 12:12, Tetsuya Mukawa wrote:
> Hi Xie,
> 
> (2014/11/12 6:37), Xie, Huawei wrote:
>> Hi Tetsuya:
>> There are two major technical issues in my mind for vhost-user implementation.
>>
>> 1) memory region map
>> Vhost-user passes us file fd and offset for each memory region. Unfortunately the mmap offset is "very" wrong. I discovered this issue long time ago, and also found
>> that I couldn't mmap the huge page file even with correct offset(need double check).
>> Just now I find that people reported this issue on Nov 3.
>> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
>> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd for region(0) to map the  whole file.
>> I think we should use this way temporarily to support qemu-2.1 as it has that bug.
> I agree with you.
> Also we may have an issue about un-mapping file on hugetlbfs of linux.
> When I check munmap(), it seems 'size' need to be aligned by hugepage size.
> (I guess it may be a kernel bug. Might be fixed already.)
> Please add return value checking code for munmap().
> Still munmap() might be failed.
> 
are you munmmap the region 0? region 0 is not need to mmap so not need to munmap too.

I can munmap success with the other regions.

>>
>> 2) what message is the indicator for vhost start/release?
>> Previously  for vhost-cuse, it has SET_BACKEND message.
>> What we should do for vhost-user?
>> SET_VRING_KICK for start?
> I think so.
> 
>> What about for release?
>> Unlike the kernel virtio, the DPDK virtio in guest could be restarted. 
>>
>> Thoughts?
> I guess we need to consider 2 types of restarting.
> One is virtio-net driver restarting, the other is vhost-user backend
> restarting.
> But, so far, it's nice to start to think about virtio-net driver
> restarting first.
> 
> Probably we need to implement a way to let vhost-user backend know
> virtio-net driver is restarted.
> I am not sure what is good way to let vhost-user backend know it.
> But how about followings RFC?
> 
> - When unix domain socket is closed, vhost-user backend should treat it
> as "release".
>  It is useful when QEMU itself is gone suddenly.
> 
> - Also, implementing new ioctl command like VHOST_RESET_BACKEND.
>  This command should be sent from virtio-net device of QEMU when
>  VIRTIO_CONFIG_STATUS_RESET register of virtio-net device is set by
> vrtio-net driver.
>  (Usually this register is set when virtio-net driver is initialized or
> stopped.)
>  It means we need to change QEMU. ;)
>  It seems virtio-net PMD already sets this register when PMD is
> initialized or stopped.
>  So this framework should work, and can let vhost-user backend know
> driver resetting.
>  (And I guess we can say same things for virtio-net kernel driver.)
>  It might be enough to close an unix domain socket, instead of
> implementing new command.
>  But in the case, we may need auto reconnection mechanism.
> 
> - We also need to consider DPDK application is gone suddenly without
> setting reset register.
>  In the case, vhost-user backend cannot know it. Only user (or some kind
> of watchdog
>  applications on guest) knows it.
>  Because of this, user(or app.) should have responsibility to solve this
> situation.
>  To be more precise, user should let vhost-user backend know device
> releasing.
>  If user starts an other DPDK application without solving the issue, the
> new DPDK application may
>  access memory that vhost-user backend is also accessing.
>  I guess user can solve the issue using "dpdk_nic_bind.py".
>  The script can move virtio-net device to kernel virtio-net driver, and
> return it to igb_uio.
>  While those steps, virtio-net device is initialized by virtio-net
> kernel driver.
>  So vhost-user backend can know device releasing.
> 
> Tetsuya
> 
>>
>> -huawei
> 
> 
> 
> 

-- 
Regards,
Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-12  4:12 ` Tetsuya Mukawa
  2014-11-13  6:30   ` Linhaifeng
@ 2014-11-14  0:22   ` Xie, Huawei
  2014-11-14  2:52     ` Tetsuya Mukawa
  1 sibling, 1 reply; 22+ messages in thread
From: Xie, Huawei @ 2014-11-14  0:22 UTC (permalink / raw)
  To: Tetsuya Mukawa, dev



> -----Original Message-----
> From: Tetsuya Mukawa [mailto:mukawa@igel.co.jp]
> Sent: Tuesday, November 11, 2014 9:13 PM
> To: Xie, Huawei; dev@dpdk.org
> Cc: Long, Thomas
> Subject: Re: vhost-user technical isssues
> 
> Hi Xie,
> 
> (2014/11/12 6:37), Xie, Huawei wrote:
> > Hi Tetsuya:
> > There are two major technical issues in my mind for vhost-user
> implementation.
> >
> > 1) memory region map
> > Vhost-user passes us file fd and offset for each memory region. Unfortunately
> the mmap offset is "very" wrong. I discovered this issue long time ago, and also
> found
> > that I couldn't mmap the huge page file even with correct offset(need double
> check).
> > Just now I find that people reported this issue on Nov 3.
> > [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
> > Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd
> for region(0) to map the  whole file.
> > I think we should use this way temporarily to support qemu-2.1 as it has that
> bug.
> I agree with you.
> Also we may have an issue about un-mapping file on hugetlbfs of linux.
> When I check munmap(), it seems 'size' need to be aligned by hugepage size.
> (I guess it may be a kernel bug. Might be fixed already.)
> Please add return value checking code for munmap().
> Still munmap() might be failed.
> 
> >
> > 2) what message is the indicator for vhost start/release?
> > Previously  for vhost-cuse, it has SET_BACKEND message.
> > What we should do for vhost-user?
> > SET_VRING_KICK for start?
> I think so.
> 
> > What about for release?
> > Unlike the kernel virtio, the DPDK virtio in guest could be restarted.
> >
> > Thoughts?
> I guess we need to consider 2 types of restarting.
> One is virtio-net driver restarting, the other is vhost-user backend
> restarting.
> But, so far, it's nice to start to think about virtio-net driver
> restarting first.
> 
> Probably we need to implement a way to let vhost-user backend know
> virtio-net driver is restarted.
> I am not sure what is good way to let vhost-user backend know it.
> But how about followings RFC?

I checked your code today, and didn't find the logic to deal with virtio reconfiguration.
> 
> - When unix domain socket is closed, vhost-user backend should treat it
> as "release".
>  It is useful when QEMU itself is gone suddenly.
This is the simple case.
> 
> - Also, implementing new ioctl command like VHOST_RESET_BACKEND.
>  This command should be sent from virtio-net device of QEMU when
>  VIRTIO_CONFIG_STATUS_RESET register of virtio-net device is set by
> vrtio-net driver.
>  (Usually this register is set when virtio-net driver is initialized or
> stopped.)
>  It means we need to change QEMU. ;)
>  It seems virtio-net PMD already sets this register when PMD is
> initialized or stopped.
>  So this framework should work, and can let vhost-user backend know
> driver resetting.
>  (And I guess we can say same things for virtio-net kernel driver.)
>  It might be enough to close an unix domain socket, instead of
> implementing new command.
I don't understand wrt closing the socket.

The socket connection from qemu will be opened and close once during the life cycle of
virtio. This is correct behavior. But virtio  driver couldn't be reconfigured several times by guest.
It is done by writing status val to the STATUS register.

>  But in the case, we may need auto reconnection mechanism.
> 
> - We also need to consider DPDK application is gone suddenly without
> setting reset register.
>  In the case, vhost-user backend cannot know it. Only user (or some kind
> of watchdog
>  applications on guest) knows it.
>  Because of this, user(or app.) should have responsibility to solve this
> situation.
>  To be more precise, user should let vhost-user backend know device
> releasing.
>  If user starts an other DPDK application without solving the issue, the
> new DPDK application may
>  access memory that vhost-user backend is also accessing.
>  I guess user can solve the issue using "dpdk_nic_bind.py".
>  The script can move virtio-net device to kernel virtio-net driver, and
> return it to igb_uio.
>  While those steps, virtio-net device is initialized by virtio-net
> kernel driver.
>  So vhost-user backend can know device releasing.
> 


My thought without new message support:
When vhost-user receives another configuration message since last time it is ready for
processing,  then we could release it from data core, and process the next reconfiguration
message, and then re-add it  to data core when it is ready again(check new kick message as before).

The candidate message is set_mem_table.

It is ok we keep the device on data core until we receive the new reconfiguration message. Just waste vhost
some cycles checking the avail idx.

> Tetsuya
> 
> >
> > -huawei
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-13  6:27 ` Linhaifeng
@ 2014-11-14  1:28   ` Xie, Huawei
  2014-11-14  2:24     ` Linhaifeng
  0 siblings, 1 reply; 22+ messages in thread
From: Xie, Huawei @ 2014-11-14  1:28 UTC (permalink / raw)
  To: Linhaifeng, 'Tetsuya Mukawa', dev



> -----Original Message-----
> From: Linhaifeng [mailto:haifeng.lin@huawei.com]
> Sent: Wednesday, November 12, 2014 11:28 PM
> To: Xie, Huawei; 'Tetsuya Mukawa'; dev@dpdk.org
> Subject: Re: [dpdk-dev] vhost-user technical isssues
> 
> 
> 
> On 2014/11/12 5:37, Xie, Huawei wrote:
> > Hi Tetsuya:
> > There are two major technical issues in my mind for vhost-user
> implementation.
> >
> > 1) memory region map
> > Vhost-user passes us file fd and offset for each memory region. Unfortunately
> the mmap offset is "very" wrong. I discovered this issue long time ago, and also
> found
> > that I couldn't mmap the huge page file even with correct offset(need double
> check).
> > Just now I find that people reported this issue on Nov 3.
> > [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
> > Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd
> for region(0) to map the  whole file.
> > I think we should use this way temporarily to support qemu-2.1 as it has that
> bug.
> >
> 
> the size of region 0 is not same as the file size. may be you should mmap the
> other region.
Haifeng:

Will calculate the maximum memory size, and use any file fd to mmap it.
Here we assume the fds for different regions actually point to the same file.

In theory we should use the fd for each region to map each memory region.
In fact we could map once. This will also save address space for 1GB huge page
due to mmap alignment requirement.
> 
> region 0:
>         gpa = 0x0
>         size = 655360
>         ua = 0x2aaaaac00000
>         offset = 0
> 
> region 1:// use this region to mmap.BTW how to avoid mmap twice when there
> are two devices?
>         gpa = 0xC0000
>         size = 2146697216
>         ua = 0x2aaaaacc0000
>         offset = 786432

What do you mean by two devices?
> 
> 
> 
> > 2) what message is the indicator for vhost start/release?
> > Previously  for vhost-cuse, it has SET_BACKEND message.
> > What we should do for vhost-user?
> > SET_VRING_KICK for start?
> > What about for release?
> > Unlike the kernel virtio, the DPDK virtio in guest could be restarted.
> >
> > Thoughts?
> >
> > -huawei
> >
> >
> 
> --
> Regards,
> Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  1:28   ` Xie, Huawei
@ 2014-11-14  2:24     ` Linhaifeng
  2014-11-14  2:35       ` Tetsuya Mukawa
  2014-11-14  6:24       ` Xie, Huawei
  0 siblings, 2 replies; 22+ messages in thread
From: Linhaifeng @ 2014-11-14  2:24 UTC (permalink / raw)
  To: Xie, Huawei, 'Tetsuya Mukawa', dev, lilijun, zhangkun



On 2014/11/14 9:28, Xie, Huawei wrote:
> 
> 
>> -----Original Message-----
>> From: Linhaifeng [mailto:haifeng.lin@huawei.com]
>> Sent: Wednesday, November 12, 2014 11:28 PM
>> To: Xie, Huawei; 'Tetsuya Mukawa'; dev@dpdk.org
>> Subject: Re: [dpdk-dev] vhost-user technical isssues
>>
>>
>>
>> On 2014/11/12 5:37, Xie, Huawei wrote:
>>> Hi Tetsuya:
>>> There are two major technical issues in my mind for vhost-user
>> implementation.
>>>
>>> 1) memory region map
>>> Vhost-user passes us file fd and offset for each memory region. Unfortunately
>> the mmap offset is "very" wrong. I discovered this issue long time ago, and also
>> found
>>> that I couldn't mmap the huge page file even with correct offset(need double
>> check).
>>> Just now I find that people reported this issue on Nov 3.
>>> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
>>> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd
>> for region(0) to map the  whole file.
>>> I think we should use this way temporarily to support qemu-2.1 as it has that
>> bug.
>>>
>>
>> the size of region 0 is not same as the file size. may be you should mmap the
>> other region.
> Haifeng:
> 
> Will calculate the maximum memory size, and use any file fd to mmap it.
> Here we assume the fds for different regions actually point to the same file.

actually there may be two hugepage files created by qemu.
one day i create a 4G VM found qemu create 2 hugepage file and send them to vhost-user.
you can try to test it.

> 
> In theory we should use the fd for each region to map each memory region.
> In fact we could map once. This will also save address space for 1GB huge page
> due to mmap alignment requirement.
>>
>> region 0:
>>         gpa = 0x0
>>         size = 655360
>>         ua = 0x2aaaaac00000
>>         offset = 0
>>
>> region 1:// use this region to mmap.BTW how to avoid mmap twice when there
>> are two devices?
>>         gpa = 0xC0000
>>         size = 2146697216
>>         ua = 0x2aaaaacc0000
>>         offset = 786432
> 
> What do you mean by two devices?
>>
>>

e.g there are two vhost-user backends in a VM, we will receive two SET_MEM_TABLE messages, actually we only need mmap once in one message.

I think qemu should add a new message to send all hugepage fd and size once.
as this we not need to mmap and calculate memory in set_mem_table message.

>>
>>> 2) what message is the indicator for vhost start/release?
>>> Previously  for vhost-cuse, it has SET_BACKEND message.
>>> What we should do for vhost-user?
>>> SET_VRING_KICK for start?
>>> What about for release?
>>> Unlike the kernel virtio, the DPDK virtio in guest could be restarted.
>>>
>>> Thoughts?
>>>
>>> -huawei
>>>
>>>
>>
>> --
>> Regards,
>> Haifeng
> 
> 
> .
> 

-- 
Regards,
Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-13  6:30   ` Linhaifeng
@ 2014-11-14  2:30     ` Tetsuya Mukawa
  2014-11-14  3:13       ` Linhaifeng
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-14  2:30 UTC (permalink / raw)
  To: Linhaifeng, Xie, Huawei; +Cc: dev

Hi Lin,

(2014/11/13 15:30), Linhaifeng wrote:
> On 2014/11/12 12:12, Tetsuya Mukawa wrote:
>> Hi Xie,
>>
>> (2014/11/12 6:37), Xie, Huawei wrote:
>>> Hi Tetsuya:
>>> There are two major technical issues in my mind for vhost-user implementation.
>>>
>>> 1) memory region map
>>> Vhost-user passes us file fd and offset for each memory region. Unfortunately the mmap offset is "very" wrong. I discovered this issue long time ago, and also found
>>> that I couldn't mmap the huge page file even with correct offset(need double check).
>>> Just now I find that people reported this issue on Nov 3.
>>> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
>>> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd for region(0) to map the  whole file.
>>> I think we should use this way temporarily to support qemu-2.1 as it has that bug.
>> I agree with you.
>> Also we may have an issue about un-mapping file on hugetlbfs of linux.
>> When I check munmap(), it seems 'size' need to be aligned by hugepage size.
>> (I guess it may be a kernel bug. Might be fixed already.)
>> Please add return value checking code for munmap().
>> Still munmap() might be failed.
>>
> are you munmmap the region 0? region 0 is not need to mmap so not need to munmap too.
>
> I can munmap success with the other regions.
Could you please let me know how many size do you specify when you
munmap region1?

I still fail to munmap region1.
Here is a patch to vhost-user test of QEMU. Could you please check it?

----------------------------------
diff --git a/tests/vhost-user-test.c b/tests/vhost-user-test.c
index 75fedf0..4e17910 100644
--- a/tests/vhost-user-test.c
+++ b/tests/vhost-user-test.c
@@ -37,7 +37,7 @@
#endif

#define QEMU_CMD_ACCEL " -machine accel=tcg"
-#define QEMU_CMD_MEM " -m 512 -object
memory-backend-file,id=mem,size=512M,"\
+#define QEMU_CMD_MEM " -m 6000 -object
memory-backend-file,id=mem,size=6000M,"\
"mem-path=%s,share=on -numa node,memdev=mem"
#define QEMU_CMD_CHR " -chardev socket,id=chr0,path=%s"
#define QEMU_CMD_NETDEV " -netdev
vhost-user,id=net0,chardev=chr0,vhostforce"
@@ -221,14 +221,16 @@ static void read_guest_mem(void)

/* check for sanity */
g_assert_cmpint(fds_num, >, 0);
- g_assert_cmpint(fds_num, ==, memory.nregions);
+ //g_assert_cmpint(fds_num, ==, memory.nregions);

+ fprintf(stderr, "%s(%d)\n", __func__, __LINE__);
/* iterate all regions */
for (i = 0; i < fds_num; i++) {
+ int ret = 0;

/* We'll check only the region statring at 0x0*/
if (memory.regions[i].guest_phys_addr != 0x0) {
- continue;
+ //continue;
}

g_assert_cmpint(memory.regions[i].memory_size, >, 1024);
@@ -237,6 +239,13 @@ static void read_guest_mem(void)

guest_mem = mmap(0, size, PROT_READ | PROT_WRITE,
MAP_SHARED, fds[i], 0);
+ fprintf(stderr, "guest_phys_addr=%lu, memory_size=%lu, "
+ "userspace_addr=%lu, mmap_offset=%lu\n",
+ memory.regions[i].guest_phys_addr,
+ memory.regions[i].memory_size,
+ memory.regions[i].userspace_addr,
+ memory.regions[i].mmap_offset);
+ fprintf(stderr, "mmap=%p, size=%lu\n", guest_mem, size);

g_assert(guest_mem != MAP_FAILED);
guest_mem += (memory.regions[i].mmap_offset / sizeof(*guest_mem));
@@ -248,7 +257,20 @@ static void read_guest_mem(void)
g_assert_cmpint(a, ==, b);
}

- munmap(guest_mem, memory.regions[i].memory_size);
+ ret = munmap(guest_mem, memory.regions[i].memory_size);
+ fprintf(stderr, "munmap=%p, size=%lu, ret=%d\n",
+ guest_mem, memory.regions[i].memory_size, ret);
+ {
+ size_t hugepagesize;
+
+ size = memory.regions[i].memory_size;
+ /* assume hugepage size is 1GB, try again */
+ hugepagesize = 1024 * 1024 * 1024;
+ size = (size + hugepagesize - 1) / hugepagesize * hugepagesize;
+ }
+ ret = munmap(guest_mem, size);
+ fprintf(stderr, "munmap=%p, size=%lu, ret=%d\n",
+ guest_mem, size, ret);
}

g_assert_cmpint(1, ==, 1);
----------------------------------
$ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
region=0, mmap=0x2aaac0000000, size=6291456000
region=0, munmap=0x2aab80000000, size=3070230528, ret=-1 << failed
region=0, munmap=0x2aab80000000, size=3221225472, ret=0
region=1, mmap=0x2aab80000000, size=655360
region=1, munmap=0x2aab80000000, size=655360, ret=-1 << failed
region=1, munmap=0x2aab80000000, size=1073741824, ret=0


Thanks,
Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  2:24     ` Linhaifeng
@ 2014-11-14  2:35       ` Tetsuya Mukawa
  2014-11-14  6:24       ` Xie, Huawei
  1 sibling, 0 replies; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-14  2:35 UTC (permalink / raw)
  To: Linhaifeng, Xie, Huawei, dev, lilijun, zhangkun

(2014/11/14 11:24), Linhaifeng wrote:
> On 2014/11/14 9:28, Xie, Huawei wrote:
> actually there may be two hugepage files created by qemu. one day i
> create a 4G VM found qemu create 2 hugepage file and send them to
> vhost-user. you can try to test it.

That's the case.
Bcasue I didn't think actually we can do like that, I try to mmap
region0 with guest memory size.

Thanks,
Tetsuya.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  0:22   ` Xie, Huawei
@ 2014-11-14  2:52     ` Tetsuya Mukawa
  2014-11-15  1:42       ` Xie, Huawei
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-14  2:52 UTC (permalink / raw)
  To: Xie, Huawei, dev

Hi Xie,

(2014/11/14 9:22), Xie, Huawei wrote:
>> I think so. I guess we need to consider 2 types of restarting. One is
>> virtio-net driver restarting, the other is vhost-user backend
>> restarting. But, so far, it's nice to start to think about virtio-net
>> driver restarting first. Probably we need to implement a way to let
>> vhost-user backend know virtio-net driver is restarted. I am not sure
>> what is good way to let vhost-user backend know it. But how about
>> followings RFC? 
> I checked your code today, and didn't find the logic to deal with virtio reconfiguration.
Yes.
I guess the first implementation of librte_vhost may just replace
vhost-example function.
Probably vhost-example doesn't think about restarting.
Because of this, I haven't implemented.

> My thought without new message support: When vhost-user receives
> another configuration message since last time it is ready for
> processing, then we could release it from data core, and process the
> next reconfiguration message, and then re-add it to data core when it
> is ready again(check new kick message as before). The candidate
> message is set_mem_table. It is ok we keep the device on data core
> until we receive the new reconfiguration message. Just waste vhost
> some cycles checking the avail idx.

For example, let's assume DPDK app1 is started on guest with virtio-net
device port.
If DPDK app1 on guest is stopped, and other DPDK app2 on guest is
started without virtio-net device port.
Hugepages DPDK app1 used will be used by DPDK app2.
It means the memory accessed by vhost-user backend might be changed by
DPDK app2.
And vhost-user backend will be crashed.
So I guess we need some kinds of reset message.

Thanks,
Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  2:30     ` Tetsuya Mukawa
@ 2014-11-14  3:13       ` Linhaifeng
  2014-11-14  3:40         ` Tetsuya Mukawa
  0 siblings, 1 reply; 22+ messages in thread
From: Linhaifeng @ 2014-11-14  3:13 UTC (permalink / raw)
  To: Tetsuya Mukawa, Xie, Huawei; +Cc: dev



On 2014/11/14 10:30, Tetsuya Mukawa wrote:
> Hi Lin,
> 
> (2014/11/13 15:30), Linhaifeng wrote:
>> On 2014/11/12 12:12, Tetsuya Mukawa wrote:
>>> Hi Xie,
>>>
>>> (2014/11/12 6:37), Xie, Huawei wrote:
>>>> Hi Tetsuya:
>>>> There are two major technical issues in my mind for vhost-user implementation.
>>>>
>>>> 1) memory region map
>>>> Vhost-user passes us file fd and offset for each memory region. Unfortunately the mmap offset is "very" wrong. I discovered this issue long time ago, and also found
>>>> that I couldn't mmap the huge page file even with correct offset(need double check).
>>>> Just now I find that people reported this issue on Nov 3.
>>>> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
>>>> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the fd for region(0) to map the  whole file.
>>>> I think we should use this way temporarily to support qemu-2.1 as it has that bug.
>>> I agree with you.
>>> Also we may have an issue about un-mapping file on hugetlbfs of linux.
>>> When I check munmap(), it seems 'size' need to be aligned by hugepage size.
>>> (I guess it may be a kernel bug. Might be fixed already.)
>>> Please add return value checking code for munmap().
>>> Still munmap() might be failed.
>>>
>> are you munmmap the region 0? region 0 is not need to mmap so not need to munmap too.
>>
>> I can munmap success with the other regions.
> Could you please let me know how many size do you specify when you
> munmap region1?
> 

2G (region->memory_size + region.memory->offset)

> I still fail to munmap region1.
> Here is a patch to vhost-user test of QEMU. Could you please check it?
> 
> ----------------------------------
> diff --git a/tests/vhost-user-test.c b/tests/vhost-user-test.c
> index 75fedf0..4e17910 100644
> --- a/tests/vhost-user-test.c
> +++ b/tests/vhost-user-test.c
> @@ -37,7 +37,7 @@
> #endif
> 
> #define QEMU_CMD_ACCEL " -machine accel=tcg"
> -#define QEMU_CMD_MEM " -m 512 -object
> memory-backend-file,id=mem,size=512M,"\
> +#define QEMU_CMD_MEM " -m 6000 -object
> memory-backend-file,id=mem,size=6000M,"\
> "mem-path=%s,share=on -numa node,memdev=mem"
> #define QEMU_CMD_CHR " -chardev socket,id=chr0,path=%s"
> #define QEMU_CMD_NETDEV " -netdev
> vhost-user,id=net0,chardev=chr0,vhostforce"
> @@ -221,14 +221,16 @@ static void read_guest_mem(void)
> 
> /* check for sanity */
> g_assert_cmpint(fds_num, >, 0);
> - g_assert_cmpint(fds_num, ==, memory.nregions);
> + //g_assert_cmpint(fds_num, ==, memory.nregions);
> 
> + fprintf(stderr, "%s(%d)\n", __func__, __LINE__);
> /* iterate all regions */
> for (i = 0; i < fds_num; i++) {
> + int ret = 0;
> 
> /* We'll check only the region statring at 0x0*/
> if (memory.regions[i].guest_phys_addr != 0x0) {
> - continue;
> + //continue;
> }

if (memory.regions[i].guest_phys_addr == 0x0) {
	close(fd);
	continue;
}

> 
> g_assert_cmpint(memory.regions[i].memory_size, >, 1024);
> @@ -237,6 +239,13 @@ static void read_guest_mem(void)
> 
> guest_mem = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_SHARED, fds[i], 0);
> + fprintf(stderr, "guest_phys_addr=%lu, memory_size=%lu, "
> + "userspace_addr=%lu, mmap_offset=%lu\n",
> + memory.regions[i].guest_phys_addr,
> + memory.regions[i].memory_size,
> + memory.regions[i].userspace_addr,
> + memory.regions[i].mmap_offset);
> + fprintf(stderr, "mmap=%p, size=%lu\n", guest_mem, size);
> 
> g_assert(guest_mem != MAP_FAILED);
> guest_mem += (memory.regions[i].mmap_offset / sizeof(*guest_mem));
> @@ -248,7 +257,20 @@ static void read_guest_mem(void)
> g_assert_cmpint(a, ==, b);
> }
> 
> - munmap(guest_mem, memory.regions[i].memory_size);
> + ret = munmap(guest_mem, memory.regions[i].memory_size);
> + fprintf(stderr, "munmap=%p, size=%lu, ret=%d\n",
> + guest_mem, memory.regions[i].memory_size, ret);
> + {
> + size_t hugepagesize;
> +
> + size = memory.regions[i].memory_size;
> + /* assume hugepage size is 1GB, try again */
> + hugepagesize = 1024 * 1024 * 1024;
> + size = (size + hugepagesize - 1) / hugepagesize * hugepagesize;
> + }
size should be same as mmap and
guest_mem -= (memory.regions[i].mmap_offset / sizeof(*guest_mem));
> + ret = munmap(guest_mem, size);
> + fprintf(stderr, "munmap=%p, size=%lu, ret=%d\n",
> + guest_mem, size, ret);
> }
> 
> g_assert_cmpint(1, ==, 1);
> ----------------------------------
> $ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
> region=0, mmap=0x2aaac0000000, size=6291456000
> region=0, munmap=0x2aab80000000, size=3070230528, ret=-1 << failed
> region=0, munmap=0x2aab80000000, size=3221225472, ret=0
> region=1, mmap=0x2aab80000000, size=655360
> region=1, munmap=0x2aab80000000, size=655360, ret=-1 << failed
> region=1, munmap=0x2aab80000000, size=1073741824, ret=0
> 
> 
> Thanks,
> Tetsuya
> 
> .
> 

-- 
Regards,
Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  3:13       ` Linhaifeng
@ 2014-11-14  3:40         ` Tetsuya Mukawa
  2014-11-14  4:05           ` Tetsuya Mukawa
  2014-11-14  4:42           ` Linhaifeng
  0 siblings, 2 replies; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-14  3:40 UTC (permalink / raw)
  To: Linhaifeng, Xie, Huawei; +Cc: dev

Hi Lin,

(2014/11/14 12:13), Linhaifeng wrote:
>
> size should be same as mmap and
> guest_mem -= (memory.regions[i].mmap_offset / sizeof(*guest_mem));
>

Thanks. It should be.
How about following patch?

-------------------------------------------------------
diff --git a/tests/vhost-user-test.c b/tests/vhost-user-test.c
index 75fedf0..be4b171 100644
--- a/tests/vhost-user-test.c
+++ b/tests/vhost-user-test.c
@@ -37,7 +37,7 @@
#endif

#define QEMU_CMD_ACCEL " -machine accel=tcg"
-#define QEMU_CMD_MEM " -m 512 -object
memory-backend-file,id=mem,size=512M,"\
+#define QEMU_CMD_MEM " -m 6000 -object
memory-backend-file,id=mem,size=6000M,"\
"mem-path=%s,share=on -numa node,memdev=mem"
#define QEMU_CMD_CHR " -chardev socket,id=chr0,path=%s"
#define QEMU_CMD_NETDEV " -netdev
vhost-user,id=net0,chardev=chr0,vhostforce"
@@ -221,13 +221,16 @@ static void read_guest_mem(void)

/* check for sanity */
g_assert_cmpint(fds_num, >, 0);
- g_assert_cmpint(fds_num, ==, memory.nregions);
+ //g_assert_cmpint(fds_num, ==, memory.nregions);

+ fprintf(stderr, "%s(%d)\n", __func__, __LINE__);
/* iterate all regions */
for (i = 0; i < fds_num; i++) {
+ int ret = 0;

/* We'll check only the region statring at 0x0*/
- if (memory.regions[i].guest_phys_addr != 0x0) {
+ if (memory.regions[i].guest_phys_addr == 0x0) {
+ close(fds[i]);
continue;
}

@@ -237,6 +240,7 @@ static void read_guest_mem(void)

guest_mem = mmap(0, size, PROT_READ | PROT_WRITE,
MAP_SHARED, fds[i], 0);
+ fprintf(stderr, "region=%d, mmap=%p, size=%lu\n", i, guest_mem, size);

g_assert(guest_mem != MAP_FAILED);
guest_mem += (memory.regions[i].mmap_offset / sizeof(*guest_mem));
@@ -247,8 +251,10 @@ static void read_guest_mem(void)

g_assert_cmpint(a, ==, b);
}
-
- munmap(guest_mem, memory.regions[i].memory_size);
+ guest_mem -= (memory.regions[i].mmap_offset / sizeof(*guest_mem));
+ ret = munmap(guest_mem, memory.regions[i].memory_size);
+ fprintf(stderr, "region=%d, munmap=%p, size=%lu, ret=%d\n",
+ i, guest_mem, size, ret);
}

g_assert_cmpint(1, ==, 1);
-------------------------------------------------------
I am using 1GB hugepage size.

$ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
region=0, mmap=0x2aaac0000000, size=6291456000
region=0, munmap=0x2aaac0000000, size=6291456000, ret=-1 << failed

6291456000 is not aligned by 1GB.
When I specify 4096MB as guest memory size, munmap() doesn't return
error like following.

$ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
region=0, mmap=0x2aaac0000000, size=4294967296
region=0, munmap=0x2aaac0000000, size=4294967296, ret=0

Thanks,
Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  3:40         ` Tetsuya Mukawa
@ 2014-11-14  4:05           ` Tetsuya Mukawa
  2014-11-14  4:42           ` Linhaifeng
  1 sibling, 0 replies; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-14  4:05 UTC (permalink / raw)
  To: Linhaifeng, Xie, Huawei; +Cc: dev

(2014/11/14 12:40), Tetsuya Mukawa wrote:
> I am using 1GB hugepage size.
>
> $ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
> region=0, mmap=0x2aaac0000000, size=6291456000
> region=0, munmap=0x2aaac0000000, size=6291456000, ret=-1 << failed
>
> 6291456000 is not aligned by 1GB.
> When I specify 4096MB as guest memory size, munmap() doesn't return
> error like following.
>
> $ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
> region=0, mmap=0x2aaac0000000, size=4294967296
> region=0, munmap=0x2aaac0000000, size=4294967296, ret=0
>
>
Also I've checked mmap2 and munmap implementation of current linux kernel.
When a file on hugetlbfs is mapped, 'size' will be aligned by hugepages
size in some case.
But when munmap is called, 'size' will be aligned by PAGE_SIZE.
It mean we cannot use same 'size' value for mmap and munmap in some case.
I guess this implementation or specification cases the munmap issue.

Thanks,
Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  3:40         ` Tetsuya Mukawa
  2014-11-14  4:05           ` Tetsuya Mukawa
@ 2014-11-14  4:42           ` Linhaifeng
  2014-11-14  5:12             ` Tetsuya Mukawa
  1 sibling, 1 reply; 22+ messages in thread
From: Linhaifeng @ 2014-11-14  4:42 UTC (permalink / raw)
  To: Tetsuya Mukawa, Xie, Huawei; +Cc: dev



On 2014/11/14 11:40, Tetsuya Mukawa wrote:
> Hi Lin,
> 
> (2014/11/14 12:13), Linhaifeng wrote:
>>
>> size should be same as mmap and
>> guest_mem -= (memory.regions[i].mmap_offset / sizeof(*guest_mem));
>>
> 
> Thanks. It should be.
> How about following patch?
> 
> -------------------------------------------------------
> diff --git a/tests/vhost-user-test.c b/tests/vhost-user-test.c
> index 75fedf0..be4b171 100644
> --- a/tests/vhost-user-test.c
> +++ b/tests/vhost-user-test.c
> @@ -37,7 +37,7 @@
> #endif
> 
> #define QEMU_CMD_ACCEL " -machine accel=tcg"
> -#define QEMU_CMD_MEM " -m 512 -object
> memory-backend-file,id=mem,size=512M,"\
> +#define QEMU_CMD_MEM " -m 6000 -object
> memory-backend-file,id=mem,size=6000M,"\
> "mem-path=%s,share=on -numa node,memdev=mem"
> #define QEMU_CMD_CHR " -chardev socket,id=chr0,path=%s"
> #define QEMU_CMD_NETDEV " -netdev
> vhost-user,id=net0,chardev=chr0,vhostforce"
> @@ -221,13 +221,16 @@ static void read_guest_mem(void)
> 
> /* check for sanity */
> g_assert_cmpint(fds_num, >, 0);
> - g_assert_cmpint(fds_num, ==, memory.nregions);
> + //g_assert_cmpint(fds_num, ==, memory.nregions);
> 
> + fprintf(stderr, "%s(%d)\n", __func__, __LINE__);
> /* iterate all regions */
> for (i = 0; i < fds_num; i++) {
> + int ret = 0;
> 
> /* We'll check only the region statring at 0x0*/
> - if (memory.regions[i].guest_phys_addr != 0x0) {
> + if (memory.regions[i].guest_phys_addr == 0x0) {
> + close(fds[i]);
> continue;
> }
> 
> @@ -237,6 +240,7 @@ static void read_guest_mem(void)
> 
> guest_mem = mmap(0, size, PROT_READ | PROT_WRITE,


How many is size? mmap_size + mmap_offset ?


> MAP_SHARED, fds[i], 0);
> + fprintf(stderr, "region=%d, mmap=%p, size=%lu\n", i, guest_mem, size);
> 
> g_assert(guest_mem != MAP_FAILED);
> guest_mem += (memory.regions[i].mmap_offset / sizeof(*guest_mem));
> @@ -247,8 +251,10 @@ static void read_guest_mem(void)
> 
> g_assert_cmpint(a, ==, b);
> }
> -
> - munmap(guest_mem, memory.regions[i].memory_size);
> + guest_mem -= (memory.regions[i].mmap_offset / sizeof(*guest_mem));
> + ret = munmap(guest_mem, memory.regions[i].memory_size);

memory.regions[i].memory_size --> memory.regions[i].memory_size + memory.regions[i].memory_offset

check you have apply qemu's patch: [PATCH] vhost-user: fix mmap offset calculation

> + fprintf(stderr, "region=%d, munmap=%p, size=%lu, ret=%d\n",
> + i, guest_mem, size, ret);
> }
> 
> g_assert_cmpint(1, ==, 1);
> -------------------------------------------------------
> I am using 1GB hugepage size.
> 
> $ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
> region=0, mmap=0x2aaac0000000, size=6291456000
> region=0, munmap=0x2aaac0000000, size=6291456000, ret=-1 << failed
> 
> 6291456000 is not aligned by 1GB.
> When I specify 4096MB as guest memory size, munmap() doesn't return
> error like following.
> 
> $ sudo QTEST_HUGETLBFS_PATH=/mnt/huge make check
> region=0, mmap=0x2aaac0000000, size=4294967296
> region=0, munmap=0x2aaac0000000, size=4294967296, ret=0
> 
> Thanks,
> Tetsuya
> 
> .
> 

-- 
Regards,
Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  4:42           ` Linhaifeng
@ 2014-11-14  5:12             ` Tetsuya Mukawa
  2014-11-14  5:30               ` Linhaifeng
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-14  5:12 UTC (permalink / raw)
  To: Linhaifeng, Xie, Huawei; +Cc: dev

Hi Lin,

(2014/11/14 13:42), Linhaifeng wrote:
>
> On 2014/11/14 11:40, Tetsuya Mukawa wrote:
>> Hi Lin,
>>
>> (2014/11/14 12:13), Linhaifeng wrote:
>>> size should be same as mmap and
>>> guest_mem -= (memory.regions[i].mmap_offset / sizeof(*guest_mem));
>>>
>> Thanks. It should be.
>> How about following patch?
>>
>> -------------------------------------------------------
>> diff --git a/tests/vhost-user-test.c b/tests/vhost-user-test.c
>> index 75fedf0..be4b171 100644
>> --- a/tests/vhost-user-test.c
>> +++ b/tests/vhost-user-test.c
>> @@ -37,7 +37,7 @@
>> #endif
>>
>> #define QEMU_CMD_ACCEL " -machine accel=tcg"
>> -#define QEMU_CMD_MEM " -m 512 -object
>> memory-backend-file,id=mem,size=512M,"\
>> +#define QEMU_CMD_MEM " -m 6000 -object
>> memory-backend-file,id=mem,size=6000M,"\
>> "mem-path=%s,share=on -numa node,memdev=mem"
>> #define QEMU_CMD_CHR " -chardev socket,id=chr0,path=%s"
>> #define QEMU_CMD_NETDEV " -netdev
>> vhost-user,id=net0,chardev=chr0,vhostforce"
>> @@ -221,13 +221,16 @@ static void read_guest_mem(void)
>>
>> /* check for sanity */
>> g_assert_cmpint(fds_num, >, 0);
>> - g_assert_cmpint(fds_num, ==, memory.nregions);
>> + //g_assert_cmpint(fds_num, ==, memory.nregions);
>>
>> + fprintf(stderr, "%s(%d)\n", __func__, __LINE__);
>> /* iterate all regions */
>> for (i = 0; i < fds_num; i++) {
>> + int ret = 0;
>>
>> /* We'll check only the region statring at 0x0*/
>> - if (memory.regions[i].guest_phys_addr != 0x0) {
>> + if (memory.regions[i].guest_phys_addr == 0x0) {
>> + close(fds[i]);
>> continue;
>> }
>>
>> @@ -237,6 +240,7 @@ static void read_guest_mem(void)
>>
>> guest_mem = mmap(0, size, PROT_READ | PROT_WRITE,
>
> How many is size? mmap_size + mmap_offset ?
In this case, guest memory length is the size.
I added messages from this program within last email.
Could you please also check it?

>
>
>> MAP_SHARED, fds[i], 0);
>> + fprintf(stderr, "region=%d, mmap=%p, size=%lu\n", i, guest_mem, size);
>>
>> g_assert(guest_mem != MAP_FAILED);
>> guest_mem += (memory.regions[i].mmap_offset / sizeof(*guest_mem));
>> @@ -247,8 +251,10 @@ static void read_guest_mem(void)
>>
>> g_assert_cmpint(a, ==, b);
>> }
>> -
>> - munmap(guest_mem, memory.regions[i].memory_size);
>> + guest_mem -= (memory.regions[i].mmap_offset / sizeof(*guest_mem));
>> + ret = munmap(guest_mem, memory.regions[i].memory_size);
> memory.regions[i].memory_size --> memory.regions[i].memory_size + memory.regions[i].memory_offset
>
> check you have apply qemu's patch: [PATCH] vhost-user: fix mmap offset calculation
I checked it using latest QEMU code.
So the patch you mentioned is included.

I guess you can munmap a file, because 'size' is aligned by hugepage
size like 2GB.
Could you please try another value like 6000MB?

Thanks,
Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  5:12             ` Tetsuya Mukawa
@ 2014-11-14  5:30               ` Linhaifeng
  2014-11-14  6:57                 ` Tetsuya Mukawa
  0 siblings, 1 reply; 22+ messages in thread
From: Linhaifeng @ 2014-11-14  5:30 UTC (permalink / raw)
  To: Tetsuya Mukawa, Xie, Huawei; +Cc: dev



On 2014/11/14 13:12, Tetsuya Mukawa wrote:
> ease try another value like 6000MB

i have try this value 6000MB.I can munmap success.

you mmap with size "memory_size + memory_offset" should also munmap with this size.

-- 
Regards,
Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  2:24     ` Linhaifeng
  2014-11-14  2:35       ` Tetsuya Mukawa
@ 2014-11-14  6:24       ` Xie, Huawei
  1 sibling, 0 replies; 22+ messages in thread
From: Xie, Huawei @ 2014-11-14  6:24 UTC (permalink / raw)
  To: Linhaifeng, 'Tetsuya Mukawa', dev, lilijun, zhangkun



> -----Original Message-----
> From: Linhaifeng [mailto:haifeng.lin@huawei.com]
> Sent: Thursday, November 13, 2014 7:24 PM
> To: Xie, Huawei; 'Tetsuya Mukawa'; dev@dpdk.org; lilijun; zhangkun
> Subject: Re: [dpdk-dev] vhost-user technical isssues
> 
> 
> 
> On 2014/11/14 9:28, Xie, Huawei wrote:
> >
> >
> >> -----Original Message-----
> >> From: Linhaifeng [mailto:haifeng.lin@huawei.com]
> >> Sent: Wednesday, November 12, 2014 11:28 PM
> >> To: Xie, Huawei; 'Tetsuya Mukawa'; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] vhost-user technical isssues
> >>
> >>
> >>
> >> On 2014/11/12 5:37, Xie, Huawei wrote:
> >>> Hi Tetsuya:
> >>> There are two major technical issues in my mind for vhost-user
> >> implementation.
> >>>
> >>> 1) memory region map
> >>> Vhost-user passes us file fd and offset for each memory region.
> Unfortunately
> >> the mmap offset is "very" wrong. I discovered this issue long time ago, and
> also
> >> found
> >>> that I couldn't mmap the huge page file even with correct offset(need
> double
> >> check).
> >>> Just now I find that people reported this issue on Nov 3.
> >>> [Qemu-devel] [PULL 27/29] vhost-user: fix mmap offset calculation
> >>> Anyway, I turned to the same idea used in our DPDK vhost-cuse: only use the
> fd
> >> for region(0) to map the  whole file.
> >>> I think we should use this way temporarily to support qemu-2.1 as it has that
> >> bug.
> >>>
> >>
> >> the size of region 0 is not same as the file size. may be you should mmap the
> >> other region.
> > Haifeng:
> >
> > Will calculate the maximum memory size, and use any file fd to mmap it.
> > Here we assume the fds for different regions actually point to the same file.
> 
> actually there may be two hugepage files created by qemu.
> one day i create a 4G VM found qemu create 2 hugepage file and send them to
> vhost-user.
> you can try to test it.
Ok, if that is the case, we need to fix vhost-cuse as well.

> 
> >
> > In theory we should use the fd for each region to map each memory region.
> > In fact we could map once. This will also save address space for 1GB huge page
> > due to mmap alignment requirement.
> >>
> >> region 0:
> >>         gpa = 0x0
> >>         size = 655360
> >>         ua = 0x2aaaaac00000
> >>         offset = 0
> >>
> >> region 1:// use this region to mmap.BTW how to avoid mmap twice when
> there
> >> are two devices?
> >>         gpa = 0xC0000
> >>         size = 2146697216
> >>         ua = 0x2aaaaacc0000
> >>         offset = 786432
> >
> > What do you mean by two devices?
> >>
> >>
> 
> e.g there are two vhost-user backends in a VM, we will receive two
> SET_MEM_TABLE messages, actually we only need mmap once in one message.
> 
> I think qemu should add a new message to send all hugepage fd and size once.
> as this we not need to mmap and calculate memory in set_mem_table message.
> 
> >>
> >>> 2) what message is the indicator for vhost start/release?
> >>> Previously  for vhost-cuse, it has SET_BACKEND message.
> >>> What we should do for vhost-user?
> >>> SET_VRING_KICK for start?
> >>> What about for release?
> >>> Unlike the kernel virtio, the DPDK virtio in guest could be restarted.
> >>>
> >>> Thoughts?
> >>>
> >>> -huawei
> >>>
> >>>
> >>
> >> --
> >> Regards,
> >> Haifeng
> >
> >
> > .
> >
> 
> --
> Regards,
> Haifeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  5:30               ` Linhaifeng
@ 2014-11-14  6:57                 ` Tetsuya Mukawa
  2014-11-14 10:59                   ` Xie, Huawei
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-14  6:57 UTC (permalink / raw)
  To: Linhaifeng, Xie, Huawei; +Cc: dev

Hi Lin,
(2014/11/14 14:30), Linhaifeng wrote:
>
> On 2014/11/14 13:12, Tetsuya Mukawa wrote:
>> ease try another value like 6000MB
> i have try this value 6000MB.I can munmap success.
>
> you mmap with size "memory_size + memory_offset" should also munmap with this size.
>
I appreciate for your testing and sugesstions. :)
I am not sure what is difference between your environment and my
environment.

Here is my code and message from the code.
---------------------------------------------
[code]
---------------------------------------------
size = memory.regions[i].memory_size + memory.regions[i].mmap_offset;

guest_mem = mmap(0, size, PROT_READ | PROT_WRITE,
MAP_SHARED, fds[i], 0);

fprintf(stderr, "region=%d, mmap=%p, size=%lu\n", i, guest_mem, size);

g_assert(guest_mem != MAP_FAILED);

ret = munmap(guest_mem, size);

fprintf(stderr, "region=%d, munmap=%p, size=%lu, ret=%d\n",
i, guest_mem, size, ret);

---------------------------------------------
[messages]
---------------------------------------------
region=0, mmap=0x2aaac0000000, size=6291456000
region=0, munmap=0x2aaac0000000, size=6291456000, ret=-1

With your environment, 'ret' will be 0.
In my environment, 'size' should be aligned not to get error.
Anyway, it's nice to implement more simple.
When munmap failure occurs, let's think it again.

Thanks,
Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  6:57                 ` Tetsuya Mukawa
@ 2014-11-14 10:59                   ` Xie, Huawei
  2014-11-17  6:14                     ` Tetsuya Mukawa
  0 siblings, 1 reply; 22+ messages in thread
From: Xie, Huawei @ 2014-11-14 10:59 UTC (permalink / raw)
  To: Tetsuya Mukawa, Linhaifeng; +Cc: dev

I tested with latest qemu(with offset fix) in vhost app(not with test case), unmap succeeds only when the size is aligned to 1GB(hugepage size).

Another important thing is  could we do mmap(0, region[i].memory_size, PROT_XX, mmap_offset) rather than with offset 0? With the region above 4GB, we will waste 4GB address space. Or we at least need to round down offset to nearest 1GB, and round up memory size to upper 1GB, to save some address space waste.

Anyway, this is ugly. Kernel doesn't take care of us, do those alignment for us automatically.

> -----Original Message-----
> From: Tetsuya Mukawa [mailto:mukawa@igel.co.jp]
> Sent: Thursday, November 13, 2014 11:57 PM
> To: Linhaifeng; Xie, Huawei
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] vhost-user technical isssues
> 
> Hi Lin,
> (2014/11/14 14:30), Linhaifeng wrote:
> >
> > On 2014/11/14 13:12, Tetsuya Mukawa wrote:
> >> ease try another value like 6000MB
> > i have try this value 6000MB.I can munmap success.
> >
> > you mmap with size "memory_size + memory_offset" should also munmap
> with this size.
> >
> I appreciate for your testing and sugesstions. :)
> I am not sure what is difference between your environment and my
> environment.
> 
> Here is my code and message from the code.
> ---------------------------------------------
> [code]
> ---------------------------------------------
> size = memory.regions[i].memory_size + memory.regions[i].mmap_offset;
> 
> guest_mem = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_SHARED, fds[i], 0);
> 
> fprintf(stderr, "region=%d, mmap=%p, size=%lu\n", i, guest_mem, size);
> 
> g_assert(guest_mem != MAP_FAILED);
> 
> ret = munmap(guest_mem, size);
> 
> fprintf(stderr, "region=%d, munmap=%p, size=%lu, ret=%d\n",
> i, guest_mem, size, ret);
> 
> ---------------------------------------------
> [messages]
> ---------------------------------------------
> region=0, mmap=0x2aaac0000000, size=6291456000
> region=0, munmap=0x2aaac0000000, size=6291456000, ret=-1
> 
> With your environment, 'ret' will be 0.
> In my environment, 'size' should be aligned not to get error.
> Anyway, it's nice to implement more simple.
> When munmap failure occurs, let's think it again.
> 
> Thanks,
> Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14  2:52     ` Tetsuya Mukawa
@ 2014-11-15  1:42       ` Xie, Huawei
  0 siblings, 0 replies; 22+ messages in thread
From: Xie, Huawei @ 2014-11-15  1:42 UTC (permalink / raw)
  To: Tetsuya Mukawa, dev

> -----Original Message-----
> From: Tetsuya Mukawa [mailto:mukawa@igel.co.jp]
> Sent: Thursday, November 13, 2014 7:53 PM
> To: Xie, Huawei; dev@dpdk.org
> Cc: Long, Thomas
> Subject: Re: vhost-user technical isssues
> 
> Hi Xie,
> 
> (2014/11/14 9:22), Xie, Huawei wrote:
> >> I think so. I guess we need to consider 2 types of restarting. One is
> >> virtio-net driver restarting, the other is vhost-user backend
> >> restarting. But, so far, it's nice to start to think about virtio-net
> >> driver restarting first. Probably we need to implement a way to let
> >> vhost-user backend know virtio-net driver is restarted. I am not sure
> >> what is good way to let vhost-user backend know it. But how about
> >> followings RFC?
> > I checked your code today, and didn't find the logic to deal with virtio
> reconfiguration.
> Yes.
> I guess the first implementation of librte_vhost may just replace
> vhost-example function.
> Probably vhost-example doesn't think about restarting.
> Because of this, I haven't implemented.
> 
> > My thought without new message support: When vhost-user receives
> > another configuration message since last time it is ready for
> > processing, then we could release it from data core, and process the
> > next reconfiguration message, and then re-add it to data core when it
> > is ready again(check new kick message as before). The candidate
> > message is set_mem_table. It is ok we keep the device on data core
> > until we receive the new reconfiguration message. Just waste vhost
> > some cycles checking the avail idx.
> 
> For example, let's assume DPDK app1 is started on guest with virtio-net
> device port.
> If DPDK app1 on guest is stopped, and other DPDK app2 on guest is
> started without virtio-net device port.
> Hugepages DPDK app1 used will be used by DPDK app2.
> It means the memory accessed by vhost-user backend might be changed by
> DPDK app2.
> And vhost-user backend will be crashed.
> So I guess we need some kinds of reset message.
> 

Virtio DPDK app crashes silently is an issue.
Let us check if there is possibility guest could crash vhost backend.
Even with kernel virtio, I think the basic principle is host shouldn't trust guest totally.
If vhost could be crashed by virtio guest corrupting the ring,  it is the design of our vhost backend. :). Hope I make me clear.
Let us check case by case later. I understand some security check might slow the performance.

I think the real problem is on the contrary, host could crash guest app or even kernel if guest is using the old memory that vhost is also using.

Btw, I checked qemu code, based on current qemu implementation, VHOST_USER_GET_VRING_BASE message is sent and only sent during vring stop. So this message could be used to remove virtio device from data core temporarily. However this sounds not reasonable from the name of this message.
I did an implementation based on this message, virtio PMD could run now. 
Previously we couldn't switch virtio from kernel driver to igb_uio. Now they could switch smoothly between those two.
Check the RFC patch.
Besides, as stated in the patch, I think we should only leave the most common operation in virtio layer, and move the control handling related to cuse/fuse layer. It is difficult to handle all the differences in message flow.

> Thanks,
> Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] vhost-user technical isssues
  2014-11-14 10:59                   ` Xie, Huawei
@ 2014-11-17  6:14                     ` Tetsuya Mukawa
  0 siblings, 0 replies; 22+ messages in thread
From: Tetsuya Mukawa @ 2014-11-17  6:14 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev

Hi Xie,

(2014/11/14 19:59), Xie, Huawei wrote:
> I tested with latest qemu(with offset fix) in vhost app(not with test case), unmap succeeds only when the size is aligned to 1GB(hugepage size).
I appreciate for your testing.

> Another important thing is  could we do mmap(0, region[i].memory_size, PROT_XX, mmap_offset) rather than with offset 0? With the region above 4GB, we will waste 4GB address space. Or we at least need to round down offset to nearest 1GB, and round up memory size to upper 1GB, to save some address space waste.
>
> Anyway, this is ugly. Kernel doesn't take care of us, do those alignment for us automatically.
>

It seems 'offset' also should be aligned by hugepage size also.
But it might be a specification of mmap. Manpage of mmap says 'offset'
should be aligned by sysconf(_SC_PAGE_SIZE).
If the target file is on hugetlbfs, I guess hugepage size is used as
alignment size.

Thanks,
Tetsuya

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-11-17  6:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-11 21:37 [dpdk-dev] vhost-user technical isssues Xie, Huawei
2014-11-12  4:12 ` Tetsuya Mukawa
2014-11-13  6:30   ` Linhaifeng
2014-11-14  2:30     ` Tetsuya Mukawa
2014-11-14  3:13       ` Linhaifeng
2014-11-14  3:40         ` Tetsuya Mukawa
2014-11-14  4:05           ` Tetsuya Mukawa
2014-11-14  4:42           ` Linhaifeng
2014-11-14  5:12             ` Tetsuya Mukawa
2014-11-14  5:30               ` Linhaifeng
2014-11-14  6:57                 ` Tetsuya Mukawa
2014-11-14 10:59                   ` Xie, Huawei
2014-11-17  6:14                     ` Tetsuya Mukawa
2014-11-14  0:22   ` Xie, Huawei
2014-11-14  2:52     ` Tetsuya Mukawa
2014-11-15  1:42       ` Xie, Huawei
2014-11-13  6:12 ` Linhaifeng
2014-11-13  6:27 ` Linhaifeng
2014-11-14  1:28   ` Xie, Huawei
2014-11-14  2:24     ` Linhaifeng
2014-11-14  2:35       ` Tetsuya Mukawa
2014-11-14  6:24       ` Xie, Huawei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).