DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile
@ 2013-12-13 22:04 James Yu
  2013-12-13 23:01 ` Stephen Hemminger
  0 siblings, 1 reply; 6+ messages in thread
From: James Yu @ 2013-12-13 22:04 UTC (permalink / raw)
  To: dev

Resending it due to missing [dpdk-dev] in the subject line.

I am using Spirent to send a 2Gbps traffic to a 10G port that are looped
back by l2fwd+DPDK+virtio in a CentOS 32-bit and receive on the other port
only at 700 Mbps.   The CentOS 32-bit is on a Fedora 18 KVM host. The
virtual interfaces are configured as virtio port type, not e1000. vhost-net
was automatically used in qemu-kvm when virtio ports are used in the guest.

The questions are
A. Why it can only reach 2Gbps
B. Why outw() is using 40% of the entire measurement when it only try to
write 2 bytes to the IO port using assembly outw command ? Is it a blocking
call ? or it wastes time is mapping from the IO address of the guest to the
physical address of the IO port on the host ?
C. any way to improve it ?
D. vmxnet PMD codes are using memory mapped IO address, not port IO
address. Will it be faster to use memory mapped IO address ?

Any pointers or feedback will help.
Thanks

James

---
While the traffic is on, I run a oprofile and oreport using the following
scripts on a seperate xterm window.
1. ./oprofile_start.sh
2. wait for 10 seconds
3. ./oprofile_stop.sh
::::::::::::::
oprofile_start.sh
::::::::::::::
#!/bin/bash
opcontrol --reset
opcontrol --deinit
modprobe oprofile timer=1
opcontrol --no-vmlinux --separate=cpu,thread --callgraph=10
--separate=kernel
opcontrol --session-dir=/root
opcontrol --start

::::::::::::::
oprofile_stop.sh
::::::::::::::
opcontrol --dump
opcontrol --stop
opcontrol --shutdown
opreport --session-dir=/root --details --merge tgid --symbols
/root/dpdk/dpdk-1.3.1r2/examples/l2fwd/build/l2fwd

Profiling through timer interrupt
vma      samples  %        image name               symbol name
00000d36 5445     40.1105  librte_pmd_virtio.so     outw
  00000d54 5442     99.9449
  00000d55 3         0.0551
00003032 3513     25.8785  librte_pmd_virtio.so     virtio_recv_buf

---
static void outw_jyu1(unsigned short int value, unsigned short int __port){
  __asm__ __volatile__ ("outw %w0,%w1": :"a" (value), "Nd" (__port));
}
---
This link
http://www.cs.nthu.edu.tw/~ychung/slides/Virtualization/VM-Lecture-2-3-IO%20Virtualization.pptx<http://www.cs.nthu.edu.tw/%7Eychung/slides/Virtualization/VM-Lecture-2-3-IO%20Virtualization.pptx>(page
17 – 22) described about the how IO ports can be accessed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile
  2013-12-13 22:04 [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile James Yu
@ 2013-12-13 23:01 ` Stephen Hemminger
  2013-12-16 23:35   ` James Yu
  0 siblings, 1 reply; 6+ messages in thread
From: Stephen Hemminger @ 2013-12-13 23:01 UTC (permalink / raw)
  To: James Yu; +Cc: dev

On Fri, 13 Dec 2013 14:04:35 -0800
James Yu <ypyu2011@gmail.com> wrote:

> Resending it due to missing [dpdk-dev] in the subject line.
> 
> I am using Spirent to send a 2Gbps traffic to a 10G port that are looped
> back by l2fwd+DPDK+virtio in a CentOS 32-bit and receive on the other port
> only at 700 Mbps.   The CentOS 32-bit is on a Fedora 18 KVM host. The
> virtual interfaces are configured as virtio port type, not e1000. vhost-net
> was automatically used in qemu-kvm when virtio ports are used in the guest.
> 
> The questions are
> A. Why it can only reach 2Gbps
> B. Why outw() is using 40% of the entire measurement when it only try to
> write 2 bytes to the IO port using assembly outw command ? Is it a blocking
> call ? or it wastes time is mapping from the IO address of the guest to the
> physical address of the IO port on the host ?
> C. any way to improve it ?
> D. vmxnet PMD codes are using memory mapped IO address, not port IO
> address. Will it be faster to use memory mapped IO address ?
> 
> Any pointers or feedback will help.
> Thanks
> 
> James

The outw is a VM exit to the hypervisor. It informs the hypervisor that data
is ready to send and it runs then. To really get better performance, virtio
needs to be able to do multiple packets per send. For bulk throughput
GSO support would help, but that is a generic DPDK issues.

Virtio use I/O to signal hypervisor (there is talk of using MMIO in later
versions but it won't be faster.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile
  2013-12-13 23:01 ` Stephen Hemminger
@ 2013-12-16 23:35   ` James Yu
  2013-12-16 23:58     ` Stephen Hemminger
  0 siblings, 1 reply; 6+ messages in thread
From: James Yu @ 2013-12-16 23:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

(A) The packets I sent are 64-bytes, not big packet. I am not sure GSO will
help. For bigger packet, it will help.

(B) What you do mean "multiple packets per second" ? Do you mean multiple
queue support to send/receive parallel in multiple cores to speed it up ?
Is it supported in DPDK 1.3.1r2 ?

(C)
There are two places using dpdk_ring_doorbell() in virtio_user.c,
eth_tx_burst() and virtio_alloc_rxq() which is called in virtio_recv_buf().
I looked at them further using "top perf -C 0". It could even occupies 80%
of the logical core 0 on a CentOS 32-bit VM. Here is the implementation of
outw() using gcc preprocessing (-E)
static void outw(unsigned short int value, unsigned short int __port){
  __asm__ __volatile__ ("outw %w0,%w1": :"a" (value), "Nd" (__port));
}
Is outw command a blocking call ?
Based on this link http://wiki.osdev.org/Inline_Assembly/Examples, I am not
sure it is blocked/waiting.

The question is what's causing it to be blocked during the outw operation ?
Is it normal ?
If it is simply because of the IO virtualization to map guest physical port
address to host physical port addresses, how can it be improved with using
VT-d ? Using the MMIO as described below ?

There are two components in vmxnet3: the user PMD codes and the kernel
driver. It actually uses MMIO to access to the memory.

vmxnet3.ko

vmxnet3_alloc_pci_resources -> compat_pci_resource_start -> ioremap



userspace -> PMD

vmxnet3_init_adapter:: adapter->hw_addr1 = (unsigned char*) mmap()
The first time of virtual address fault, the vmxnet3 driver to find the
mapped IO and cache it. Subsequent access to the virtual address will be
faster.

I wonder how Virtio using outw() handles the access to the IO port address.
Does it have to  map from the IO port address in the VM to the physical
port address in the host for EVERY access ? If that's the case,some
improvement can be done if we use similar way as the vmxnet3 model ?


Thanks

James




On Fri, Dec 13, 2013 at 3:01 PM, Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Fri, 13 Dec 2013 14:04:35 -0800
> James Yu <ypyu2011@gmail.com> wrote:
>
> > Resending it due to missing [dpdk-dev] in the subject line.
> >
> > I am using Spirent to send a 2Gbps traffic to a 10G port that are looped
> > back by l2fwd+DPDK+virtio in a CentOS 32-bit and receive on the other
> port
> > only at 700 Mbps.   The CentOS 32-bit is on a Fedora 18 KVM host. The
> > virtual interfaces are configured as virtio port type, not e1000.
> vhost-net
> > was automatically used in qemu-kvm when virtio ports are used in the
> guest.
> >
> > The questions are
> > A. Why it can only reach 2Gbps
> > B. Why outw() is using 40% of the entire measurement when it only try to
> > write 2 bytes to the IO port using assembly outw command ? Is it a
> blocking
> > call ? or it wastes time is mapping from the IO address of the guest to
> the
> > physical address of the IO port on the host ?
> > C. any way to improve it ?
> > D. vmxnet PMD codes are using memory mapped IO address, not port IO
> > address. Will it be faster to use memory mapped IO address ?
> >
> > Any pointers or feedback will help.
> > Thanks
> >
> > James
>
> The outw is a VM exit to the hypervisor. It informs the hypervisor that
> data
> is ready to send and it runs then. To really get better performance, virtio
> needs to be able to do multiple packets per send. For bulk throughput
> GSO support would help, but that is a generic DPDK issues.
>
> Virtio use I/O to signal hypervisor (there is talk of using MMIO in later
> versions but it won't be faster.
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile
  2013-12-16 23:35   ` James Yu
@ 2013-12-16 23:58     ` Stephen Hemminger
  0 siblings, 0 replies; 6+ messages in thread
From: Stephen Hemminger @ 2013-12-16 23:58 UTC (permalink / raw)
  To: James Yu; +Cc: dev

On Mon, 16 Dec 2013 15:35:27 -0800
James Yu <ypyu2011@gmail.com> wrote:

> (A) The packets I sent are 64-bytes, not big packet. I am not sure GSO will
> help. For bigger packet, it will help.

It will not help with small packets.

> (B) What you do mean "multiple packets per second" ? Do you mean multiple
> queue support to send/receive parallel in multiple cores to speed it up ?
> Is it supported in DPDK 1.3.1r2 ?

With some cases it is possible to get multple packets per send.
This happens if rte_tx_burst is called with more than one packet.

> (C)
> There are two places using dpdk_ring_doorbell() in virtio_user.c,
> eth_tx_burst() and virtio_alloc_rxq() which is called in virtio_recv_buf().
> I looked at them further using "top perf -C 0". It could even occupies 80%
> of the logical core 0 on a CentOS 32-bit VM. Here is the implementation of
> outw() using gcc preprocessing (-E)
> static void outw(unsigned short int value, unsigned short int __port){
>   __asm__ __volatile__ ("outw %w0,%w1": :"a" (value), "Nd" (__port));
> }
> Is outw command a blocking call ?
> Based on this link http://wiki.osdev.org/Inline_Assembly/Examples, I am not
> sure it is blocked/waiting.

Out word causes a VM trap back to hypervisor. Since it is not allowed
as normally by guest, and is used to notify host. Vmware uses memory
in a similar manner as a wakeup.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile
@ 2013-12-13 22:02 James Yu
  0 siblings, 0 replies; 6+ messages in thread
From: James Yu @ 2013-12-13 22:02 UTC (permalink / raw)
  To: dev

I am using Spirent to send a 2Gbps traffic to a 10G port that are looped
back by l2fwd+DPDK+virtio in a CentOS 32-bit and receive on the other port
only at 700 Mbps.   The CentOS 32-bit is on a Fedora 18 KVM host. The
virtual interfaces are configured as virtio port type, not e1000. vhost-net
was automatically used in qemu-kvm when virtio ports are used in the guest.

The questions are
A. Why it can only reach 2Gbps
B. Why outw() is using 40% of the entire measurement when it only try to
write 2 bytes to the IO port using assembly outw command ? Is it a blocking
call ? or it wastes time is mapping from the IO address of the guest to the
physical address of the IO port on the host ?
C. any way to improve it ?
D. vmxnet PMD codes are using memory mapped IO address, not port IO
address. Will it be faster to use memory mapped IO address ?

Any pointers or feedback will help.
Thanks

James

---
While the traffic is on, I run a oprofile and oreport using the following
scripts on a seperate xterm window.
1. ./oprofile_start.sh
2. wait for 10 seconds
3. ./oprofile_stop.sh
::::::::::::::
oprofile_start.sh
::::::::::::::
#!/bin/bash
opcontrol --reset
opcontrol --deinit
modprobe oprofile timer=1
opcontrol --no-vmlinux --separate=cpu,thread --callgraph=10
--separate=kernel
opcontrol --session-dir=/root
opcontrol --start

::::::::::::::
oprofile_stop.sh
::::::::::::::
opcontrol --dump
opcontrol --stop
opcontrol --shutdown
opreport --session-dir=/root --details --merge tgid --symbols /root/dpdk/
dpdk-1.3.1r2/examples/l2fwd/build/l2fwd

Profiling through timer interrupt
vma      samples  %        image name               symbol name
00000d36 5445     40.1105  librte_pmd_virtio.so     outw
  00000d54 5442     99.9449
  00000d55 3         0.0551
00003032 3513     25.8785  librte_pmd_virtio.so     virtio_recv_buf

---
static void outw_jyu1(unsigned short int value, unsigned short int __port){
  __asm__ __volatile__ ("outw %w0,%w1": :"a" (value), "Nd" (__port));
}
---
This link
http://www.cs.nthu.edu.tw/~ychung/slides/Virtualization/VM-Lecture-2-3-IO%20Virtualization.pptx<http://www.cs.nthu.edu.tw/%7Eychung/slides/Virtualization/VM-Lecture-2-3-IO%20Virtualization.pptx>(page
17 – 22) described about the how IO ports can be accessed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile
@ 2013-12-13 21:59 James Yu
  0 siblings, 0 replies; 6+ messages in thread
From: James Yu @ 2013-12-13 21:59 UTC (permalink / raw)
  To: dev

I am using Spirent to send a 2Gbps traffic to a 10G port that are looped
back by l2fwd+DPDK+virtio in a CentOS 32-bit and receive on the other port
only at 700 Mbps.   The CentOS 32-bit is on a Fedora 18 KVM host. The
virtual interfaces are configured as virtio port type, not e1000. vhost-net
was automatically used in qemu-kvm when virtio ports are used in the guest.

The questions are
A. Why it can only reach 2Gbps
B. Why outw() is using 40% of the entire measurement when it only try to
write 2 bytes to the IO port using assembly outw command ? Is it a blocking
call ? or it wastes time is mapping from the IO address of the guest to the
physical address of the IO port on the host ?
C. any way to improve it ?
D. vmxnet PMD codes are using memory mapped IO address, not port IO
address. Will it be faster to use memory mapped IO address ?

Any pointers or feedback will help.
Thanks

James

---
While the traffic is on, I run a oprofile and oreport using the following
scripts on a seperate xterm window.
1. ./oprofile_start.sh
2. wait for 10 seconds
3. ./oprofile_stop.sh
::::::::::::::
oprofile_start.sh
::::::::::::::
#!/bin/bash
opcontrol --reset
opcontrol --deinit
modprobe oprofile timer=1
opcontrol --no-vmlinux --separate=cpu,thread --callgraph=10
--separate=kernel
opcontrol --session-dir=/root
opcontrol --start

::::::::::::::
oprofile_stop.sh
::::::::::::::
opcontrol --dump
opcontrol --stop
opcontrol --shutdown
opreport --session-dir=/root --details --merge tgid --symbols
/root/dpdk/dpdk-1.3.1r2/examples/l2fwd/build/l2fwd

Profiling through timer interrupt
vma      samples  %        image name               symbol name
00000d36 5445     40.1105  librte_pmd_virtio.so     outw
  00000d54 5442     99.9449
  00000d55 3         0.0551
00003032 3513     25.8785  librte_pmd_virtio.so     virtio_recv_buf

---
static void outw_jyu1(unsigned short int value, unsigned short int __port){
  __asm__ __volatile__ ("outw %w0,%w1": :"a" (value), "Nd" (__port));
}
---
This link
http://www.cs.nthu.edu.tw/~ychung/slides/Virtualization/VM-Lecture-2-3-IO%20Virtualization.pptx(page
17 – 22) described about the how IO ports can be accessed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-12-16 23:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-13 22:04 [dpdk-dev] outw() in virtio_ring_doorbell() in DPDK+virtio consume 40% of the CPU in oprofile James Yu
2013-12-13 23:01 ` Stephen Hemminger
2013-12-16 23:35   ` James Yu
2013-12-16 23:58     ` Stephen Hemminger
  -- strict thread matches above, loose matches on Subject: below --
2013-12-13 22:02 James Yu
2013-12-13 21:59 James Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).