* [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
@ 2020-05-06  5:14 Pavel Vajarov
  2020-05-06 14:54 ` Stephen Hemminger
  2020-05-26 16:50 ` Vincent Li
  0 siblings, 2 replies; 12+ messages in thread
From: Pavel Vajarov @ 2020-05-06  5:14 UTC (permalink / raw)
  To: users
Hi there,
We are trying to compare the performance of DPDK+FreeBSD networking stack
vs standard Linux kernel and we have problems finding out why the former is
slower. The details are below.
There is a project called F-Stack <https://github.com/F-Stack/f-stack>.
It glues the networking stack from
FreeBSD 11.01 over DPDK. We made a setup to test the performance of
transparent
TCP proxy based on F-Stack and another one running on Standard Linux
kernel.
We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @
2.30GHz)
and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
The application level code, the one which handles epoll notifications and
memcpy data between the sockets, of the both proxy applications is 100% the
same. Both proxy applications are single threaded and in all tests we
pinned the applications on core 1. The interrupts from the network card
were pinned to the same core 1 for the test with the standard Linux
application.
Here are the test results:
1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
started to throttle the traffic. No visible CPU usage was observed on core
0 during the tests, only core 1, where the application and the IRQs were
pinned, took the load.
2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
started to throttle the traffic. No visible CPU usage was observed on core
0 during the tests only core 1, where the application was pinned, took the
load. In some of the latter tests I did some changes to the number of read
packets in one call from the network card and the number of handled events
in one call to epoll. With these changes I was able to increase the
throughput
to 900-1000 Mbps but couldn't increase it more.
3. We did another test with the DPDK+FreeBSD proxy just to give us some
more info about the problem. We disabled the TCP proxy functionality and
let the packets be simply ip forwarded by the FreeBSD stack. In this test
we reached up to 5Gbps without being able to throttle the traffic. We just
don't have more traffic to redirect there at the moment. So the bottlneck
seem to be either in the upper level of the network stack or in the
application
code.
There is a huawei switch which redirects the traffic to this server. It
regularly
sends arping and if the server doesn't respond it stops the redirection.
So we assumed that when the redirection stops it's because the server
throttles the traffic and drops packets and can't respond to the arping
because
of the packets drop.
The whole application can be very roughly represented in the following way:
 - Write pending outgoing packets to the network card
- Read incoming packets from the network card
 - Push the incoming packets to the FreeBSD stack
 - Call epoll_wait/kevent without waiting
 - Handle the events
 - loop from the beginning
According to the performance profiling that we did, aside from packet
processing,
 about 25-30% of the application time seems to be spent in the
epoll_wait/kevent
even though the `timeout` parameter of this call is set to 0 i.e.
it shouldn't block waiting for events if there is none.
I can give you much more details and code for everything, if needed.
My questions are:
1. Does somebody have observations or educated guesses about what amount of
traffic should I expect the DPDK + FreeBSD stack + kevent to process in the
above
scenario? Are the numbers low or expected?
We've expected to see better performance than the standard Linux kernel one
but
so far we can't get this performance.
2. Do you think the diffrence comes because of the time spending handling
packets
and handling epoll in both of the tests? What do I mean. For the standard
Linux tests
the interrupts handling has higher priority than the epoll handling and
thus the application
can spend much more time handling packets and processing them in the kernel
than
handling epoll events in the user space. For the DPDK+FreeBSD case the time
for
handling packets and the time for processing epolls is kind of equal. I
think, that this was
the reason why we were able to get more performance increasing the number
of read
packets at one go and decreasing the epoll events. However, we couldn't
increase the
throughput enough with these tweaks.
3. Can you suggest something else that we can test/measure/profile to get
better idea
what exactly is happening here and to improve the performance more?
Any help is appreciated!
Thanks in advance,
Pavel.
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-06  5:14 [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK Pavel Vajarov
@ 2020-05-06 14:54 ` Stephen Hemminger
  2020-05-07 10:47   ` Pavel Vajarov
  2020-05-26 16:50 ` Vincent Li
  1 sibling, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2020-05-06 14:54 UTC (permalink / raw)
  To: Pavel Vajarov; +Cc: users
On Wed, 6 May 2020 08:14:20 +0300
Pavel Vajarov <freakpv@gmail.com> wrote:
> Hi there,
> 
> We are trying to compare the performance of DPDK+FreeBSD networking stack
> vs standard Linux kernel and we have problems finding out why the former is
> slower. The details are below.
> 
> There is a project called F-Stack <https://github.com/F-Stack/f-stack>.
> It glues the networking stack from
> FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> transparent
> TCP proxy based on F-Stack and another one running on Standard Linux
> kernel.
> We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @
> 2.30GHz)
> and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
> The application level code, the one which handles epoll notifications and
> memcpy data between the sockets, of the both proxy applications is 100% the
> same. Both proxy applications are single threaded and in all tests we
> pinned the applications on core 1. The interrupts from the network card
> were pinned to the same core 1 for the test with the standard Linux
> application.
> 
> Here are the test results:
> 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
> started to throttle the traffic. No visible CPU usage was observed on core
> 0 during the tests, only core 1, where the application and the IRQs were
> pinned, took the load.
> 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
> started to throttle the traffic. No visible CPU usage was observed on core
> 0 during the tests only core 1, where the application was pinned, took the
> load. In some of the latter tests I did some changes to the number of read
> packets in one call from the network card and the number of handled events
> in one call to epoll. With these changes I was able to increase the
> throughput
> to 900-1000 Mbps but couldn't increase it more.
> 3. We did another test with the DPDK+FreeBSD proxy just to give us some
> more info about the problem. We disabled the TCP proxy functionality and
> let the packets be simply ip forwarded by the FreeBSD stack. In this test
> we reached up to 5Gbps without being able to throttle the traffic. We just
> don't have more traffic to redirect there at the moment. So the bottlneck
> seem to be either in the upper level of the network stack or in the
> application
> code.
> 
> There is a huawei switch which redirects the traffic to this server. It
> regularly
> sends arping and if the server doesn't respond it stops the redirection.
> So we assumed that when the redirection stops it's because the server
> throttles the traffic and drops packets and can't respond to the arping
> because
> of the packets drop.
> 
> The whole application can be very roughly represented in the following way:
>  - Write pending outgoing packets to the network card
> - Read incoming packets from the network card
>  - Push the incoming packets to the FreeBSD stack
>  - Call epoll_wait/kevent without waiting
>  - Handle the events
>  - loop from the beginning
> According to the performance profiling that we did, aside from packet
> processing,
>  about 25-30% of the application time seems to be spent in the
> epoll_wait/kevent
> even though the `timeout` parameter of this call is set to 0 i.e.
> it shouldn't block waiting for events if there is none.
> 
> I can give you much more details and code for everything, if needed.
> 
> My questions are:
> 1. Does somebody have observations or educated guesses about what amount of
> traffic should I expect the DPDK + FreeBSD stack + kevent to process in the
> above
> scenario? Are the numbers low or expected?
> We've expected to see better performance than the standard Linux kernel one
> but
> so far we can't get this performance.
> 2. Do you think the diffrence comes because of the time spending handling
> packets
> and handling epoll in both of the tests? What do I mean. For the standard
> Linux tests
> the interrupts handling has higher priority than the epoll handling and
> thus the application
> can spend much more time handling packets and processing them in the kernel
> than
> handling epoll events in the user space. For the DPDK+FreeBSD case the time
> for
> handling packets and the time for processing epolls is kind of equal. I
> think, that this was
> the reason why we were able to get more performance increasing the number
> of read
> packets at one go and decreasing the epoll events. However, we couldn't
> increase the
> throughput enough with these tweaks.
> 3. Can you suggest something else that we can test/measure/profile to get
> better idea
> what exactly is happening here and to improve the performance more?
> 
> Any help is appreciated!
> 
> Thanks in advance,
> Pavel.
First off, if you are testing on KVM, are you using PCI pass thru or SR-IOV
to make the device available to the guest directly. The default mode uses
a Linux bridge, and this results in multiple copies and context switches.
You end up testing Linux bridge and virtio performance, not TCP.
To get full speed with TCP and most software stacks you need TCP segmentation
offload.
Also software queue discipline, kernel version, and TCP congestion control
can have a big role in your result.
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-06 14:54 ` Stephen Hemminger
@ 2020-05-07 10:47   ` Pavel Vajarov
  2020-05-07 14:09     ` dave seddon
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Vajarov @ 2020-05-07 10:47 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: users
On Wed, May 6, 2020 at 5:55 PM Stephen Hemminger <stephen@networkplumber.org>
wrote:
> On Wed, 6 May 2020 08:14:20 +0300
> Pavel Vajarov <freakpv@gmail.com> wrote:
>
> > Hi there,
> >
> > We are trying to compare the performance of DPDK+FreeBSD networking stack
> > vs standard Linux kernel and we have problems finding out why the former
> is
> > slower. The details are below.
> >
> > There is a project called F-Stack <https://github.com/F-Stack/f-stack>.
> > It glues the networking stack from
> > FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> > transparent
> > TCP proxy based on F-Stack and another one running on Standard Linux
> > kernel.
> > We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @
> > 2.30GHz)
> > and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
> > The application level code, the one which handles epoll notifications and
> > memcpy data between the sockets, of the both proxy applications is 100%
> the
> > same. Both proxy applications are single threaded and in all tests we
> > pinned the applications on core 1. The interrupts from the network card
> > were pinned to the same core 1 for the test with the standard Linux
> > application.
> >
> > Here are the test results:
> > 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
> > started to throttle the traffic. No visible CPU usage was observed on
> core
> > 0 during the tests, only core 1, where the application and the IRQs were
> > pinned, took the load.
> > 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
> > started to throttle the traffic. No visible CPU usage was observed on
> core
> > 0 during the tests only core 1, where the application was pinned, took
> the
> > load. In some of the latter tests I did some changes to the number of
> read
> > packets in one call from the network card and the number of handled
> events
> > in one call to epoll. With these changes I was able to increase the
> > throughput
> > to 900-1000 Mbps but couldn't increase it more.
> > 3. We did another test with the DPDK+FreeBSD proxy just to give us some
> > more info about the problem. We disabled the TCP proxy functionality and
> > let the packets be simply ip forwarded by the FreeBSD stack. In this test
> > we reached up to 5Gbps without being able to throttle the traffic. We
> just
> > don't have more traffic to redirect there at the moment. So the bottlneck
> > seem to be either in the upper level of the network stack or in the
> > application
> > code.
> >
> > There is a huawei switch which redirects the traffic to this server. It
> > regularly
> > sends arping and if the server doesn't respond it stops the redirection.
> > So we assumed that when the redirection stops it's because the server
> > throttles the traffic and drops packets and can't respond to the arping
> > because
> > of the packets drop.
> >
> > The whole application can be very roughly represented in the following
> way:
> >  - Write pending outgoing packets to the network card
> > - Read incoming packets from the network card
> >  - Push the incoming packets to the FreeBSD stack
> >  - Call epoll_wait/kevent without waiting
> >  - Handle the events
> >  - loop from the beginning
> > According to the performance profiling that we did, aside from packet
> > processing,
> >  about 25-30% of the application time seems to be spent in the
> > epoll_wait/kevent
> > even though the `timeout` parameter of this call is set to 0 i.e.
> > it shouldn't block waiting for events if there is none.
> >
> > I can give you much more details and code for everything, if needed.
> >
> > My questions are:
> > 1. Does somebody have observations or educated guesses about what amount
> of
> > traffic should I expect the DPDK + FreeBSD stack + kevent to process in
> the
> > above
> > scenario? Are the numbers low or expected?
> > We've expected to see better performance than the standard Linux kernel
> one
> > but
> > so far we can't get this performance.
> > 2. Do you think the diffrence comes because of the time spending handling
> > packets
> > and handling epoll in both of the tests? What do I mean. For the standard
> > Linux tests
> > the interrupts handling has higher priority than the epoll handling and
> > thus the application
> > can spend much more time handling packets and processing them in the
> kernel
> > than
> > handling epoll events in the user space. For the DPDK+FreeBSD case the
> time
> > for
> > handling packets and the time for processing epolls is kind of equal. I
> > think, that this was
> > the reason why we were able to get more performance increasing the number
> > of read
> > packets at one go and decreasing the epoll events. However, we couldn't
> > increase the
> > throughput enough with these tweaks.
> > 3. Can you suggest something else that we can test/measure/profile to get
> > better idea
> > what exactly is happening here and to improve the performance more?
> >
> > Any help is appreciated!
> >
> > Thanks in advance,
> > Pavel.
>
> First off, if you are testing on KVM, are you using PCI pass thru or SR-IOV
> to make the device available to the guest directly. The default mode uses
> a Linux bridge, and this results in multiple copies and context switches.
> You end up testing Linux bridge and virtio performance, not TCP.
>
> To get full speed with TCP and most software stacks you need TCP
> segmentation
> offload.
>
> Also software queue discipline, kernel version, and TCP congestion control
> can have a big role in your result.
>
Hi,
Thanks for the response.
We did the tests on Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-96-generic x86_64).
The NIC was given to the guest using SR-IOV.
The TCP segmentation offload was enabled for both tests (standard Linux and
DPDK+FreeBSD).
The congestion control algorithm for both tests was 'cubic'.
What do you mean by 'software queue discipline'?
Regards,
Pavel.
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-07 10:47   ` Pavel Vajarov
@ 2020-05-07 14:09     ` dave seddon
  2020-05-07 20:31       ` Stephen Hemminger
  2020-05-20 19:43       ` Vincent Li
  0 siblings, 2 replies; 12+ messages in thread
From: dave seddon @ 2020-05-07 14:09 UTC (permalink / raw)
  To: Pavel Vajarov; +Cc: Stephen Hemminger, users
tc qdisc
https://linux.die.net/man/8/tc
On Thu, May 7, 2020 at 3:47 AM Pavel Vajarov <freakpv@gmail.com> wrote:
> On Wed, May 6, 2020 at 5:55 PM Stephen Hemminger <
> stephen@networkplumber.org>
> wrote:
>
> > On Wed, 6 May 2020 08:14:20 +0300
> > Pavel Vajarov <freakpv@gmail.com> wrote:
> >
> > > Hi there,
> > >
> > > We are trying to compare the performance of DPDK+FreeBSD networking
> stack
> > > vs standard Linux kernel and we have problems finding out why the
> former
> > is
> > > slower. The details are below.
> > >
> > > There is a project called F-Stack <https://github.com/F-Stack/f-stack
> >.
> > > It glues the networking stack from
> > > FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> > > transparent
> > > TCP proxy based on F-Stack and another one running on Standard Linux
> > > kernel.
> > > We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @
> > > 2.30GHz)
> > > and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
> > > The application level code, the one which handles epoll notifications
> and
> > > memcpy data between the sockets, of the both proxy applications is 100%
> > the
> > > same. Both proxy applications are single threaded and in all tests we
> > > pinned the applications on core 1. The interrupts from the network card
> > > were pinned to the same core 1 for the test with the standard Linux
> > > application.
> > >
> > > Here are the test results:
> > > 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before
> it
> > > started to throttle the traffic. No visible CPU usage was observed on
> > core
> > > 0 during the tests, only core 1, where the application and the IRQs
> were
> > > pinned, took the load.
> > > 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
> > > started to throttle the traffic. No visible CPU usage was observed on
> > core
> > > 0 during the tests only core 1, where the application was pinned, took
> > the
> > > load. In some of the latter tests I did some changes to the number of
> > read
> > > packets in one call from the network card and the number of handled
> > events
> > > in one call to epoll. With these changes I was able to increase the
> > > throughput
> > > to 900-1000 Mbps but couldn't increase it more.
> > > 3. We did another test with the DPDK+FreeBSD proxy just to give us some
> > > more info about the problem. We disabled the TCP proxy functionality
> and
> > > let the packets be simply ip forwarded by the FreeBSD stack. In this
> test
> > > we reached up to 5Gbps without being able to throttle the traffic. We
> > just
> > > don't have more traffic to redirect there at the moment. So the
> bottlneck
> > > seem to be either in the upper level of the network stack or in the
> > > application
> > > code.
> > >
> > > There is a huawei switch which redirects the traffic to this server. It
> > > regularly
> > > sends arping and if the server doesn't respond it stops the
> redirection.
> > > So we assumed that when the redirection stops it's because the server
> > > throttles the traffic and drops packets and can't respond to the arping
> > > because
> > > of the packets drop.
> > >
> > > The whole application can be very roughly represented in the following
> > way:
> > >  - Write pending outgoing packets to the network card
> > > - Read incoming packets from the network card
> > >  - Push the incoming packets to the FreeBSD stack
> > >  - Call epoll_wait/kevent without waiting
> > >  - Handle the events
> > >  - loop from the beginning
> > > According to the performance profiling that we did, aside from packet
> > > processing,
> > >  about 25-30% of the application time seems to be spent in the
> > > epoll_wait/kevent
> > > even though the `timeout` parameter of this call is set to 0 i.e.
> > > it shouldn't block waiting for events if there is none.
> > >
> > > I can give you much more details and code for everything, if needed.
> > >
> > > My questions are:
> > > 1. Does somebody have observations or educated guesses about what
> amount
> > of
> > > traffic should I expect the DPDK + FreeBSD stack + kevent to process in
> > the
> > > above
> > > scenario? Are the numbers low or expected?
> > > We've expected to see better performance than the standard Linux kernel
> > one
> > > but
> > > so far we can't get this performance.
> > > 2. Do you think the diffrence comes because of the time spending
> handling
> > > packets
> > > and handling epoll in both of the tests? What do I mean. For the
> standard
> > > Linux tests
> > > the interrupts handling has higher priority than the epoll handling and
> > > thus the application
> > > can spend much more time handling packets and processing them in the
> > kernel
> > > than
> > > handling epoll events in the user space. For the DPDK+FreeBSD case the
> > time
> > > for
> > > handling packets and the time for processing epolls is kind of equal. I
> > > think, that this was
> > > the reason why we were able to get more performance increasing the
> number
> > > of read
> > > packets at one go and decreasing the epoll events. However, we couldn't
> > > increase the
> > > throughput enough with these tweaks.
> > > 3. Can you suggest something else that we can test/measure/profile to
> get
> > > better idea
> > > what exactly is happening here and to improve the performance more?
> > >
> > > Any help is appreciated!
> > >
> > > Thanks in advance,
> > > Pavel.
> >
> > First off, if you are testing on KVM, are you using PCI pass thru or
> SR-IOV
> > to make the device available to the guest directly. The default mode uses
> > a Linux bridge, and this results in multiple copies and context switches.
> > You end up testing Linux bridge and virtio performance, not TCP.
> >
> > To get full speed with TCP and most software stacks you need TCP
> > segmentation
> > offload.
> >
> > Also software queue discipline, kernel version, and TCP congestion
> control
> > can have a big role in your result.
> >
>
> Hi,
>
> Thanks for the response.
>
> We did the tests on Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-96-generic
> x86_64).
> The NIC was given to the guest using SR-IOV.
> The TCP segmentation offload was enabled for both tests (standard Linux and
> DPDK+FreeBSD).
> The congestion control algorithm for both tests was 'cubic'.
>
> What do you mean by 'software queue discipline'?
>
> Regards,
> Pavel.
>
-- 
Regards,
Dave Seddon
+1 415 857 5102
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-07 14:09     ` dave seddon
@ 2020-05-07 20:31       ` Stephen Hemminger
  2020-05-08  5:03         ` Pavel Vajarov
  2020-05-20 19:43       ` Vincent Li
  1 sibling, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2020-05-07 20:31 UTC (permalink / raw)
  To: dave seddon; +Cc: Pavel Vajarov, users
On Thu, 7 May 2020 07:09:44 -0700
dave seddon <dave.seddon.ca@gmail.com> wrote:
> tc qdisc
> https://linux.die.net/man/8/tc
> 
> On Thu, May 7, 2020 at 3:47 AM Pavel Vajarov <freakpv@gmail.com> wrote:
> 
> > On Wed, May 6, 2020 at 5:55 PM Stephen Hemminger <  
> > stephen@networkplumber.org>  
> > wrote:
> >  
> > > On Wed, 6 May 2020 08:14:20 +0300
> > > Pavel Vajarov <freakpv@gmail.com> wrote:
> > >  
> > > > Hi there,
> > > >
> > > > We are trying to compare the performance of DPDK+FreeBSD networking  
> > stack  
> > > > vs standard Linux kernel and we have problems finding out why the  
> > former  
> > > is  
> > > > slower. The details are below.
> > > >
> > > > There is a project called F-Stack <https://github.com/F-Stack/f-stack  
> > >.  
> > > > It glues the networking stack from
> > > > FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> > > > transparent
> > > > TCP proxy based on F-Stack and another one running on Standard Linux
> > > > kernel.
> > > > We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @
> > > > 2.30GHz)
> > > > and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
> > > > The application level code, the one which handles epoll notifications  
> > and  
> > > > memcpy data between the sockets, of the both proxy applications is 100%  
> > > the  
> > > > same. Both proxy applications are single threaded and in all tests we
> > > > pinned the applications on core 1. The interrupts from the network card
> > > > were pinned to the same core 1 for the test with the standard Linux
> > > > application.
> > > >
> > > > Here are the test results:
> > > > 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before  
> > it  
> > > > started to throttle the traffic. No visible CPU usage was observed on  
> > > core  
> > > > 0 during the tests, only core 1, where the application and the IRQs  
> > were  
> > > > pinned, took the load.
> > > > 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
> > > > started to throttle the traffic. No visible CPU usage was observed on  
> > > core  
> > > > 0 during the tests only core 1, where the application was pinned, took  
> > > the  
> > > > load. In some of the latter tests I did some changes to the number of  
> > > read  
> > > > packets in one call from the network card and the number of handled  
> > > events  
> > > > in one call to epoll. With these changes I was able to increase the
> > > > throughput
> > > > to 900-1000 Mbps but couldn't increase it more.
> > > > 3. We did another test with the DPDK+FreeBSD proxy just to give us some
> > > > more info about the problem. We disabled the TCP proxy functionality  
> > and  
> > > > let the packets be simply ip forwarded by the FreeBSD stack. In this  
> > test  
> > > > we reached up to 5Gbps without being able to throttle the traffic. We  
> > > just  
> > > > don't have more traffic to redirect there at the moment. So the  
> > bottlneck  
> > > > seem to be either in the upper level of the network stack or in the
> > > > application
> > > > code.
> > > >
> > > > There is a huawei switch which redirects the traffic to this server. It
> > > > regularly
> > > > sends arping and if the server doesn't respond it stops the  
> > redirection.  
> > > > So we assumed that when the redirection stops it's because the server
> > > > throttles the traffic and drops packets and can't respond to the arping
> > > > because
> > > > of the packets drop.
> > > >
> > > > The whole application can be very roughly represented in the following  
> > > way:  
> > > >  - Write pending outgoing packets to the network card
> > > > - Read incoming packets from the network card
> > > >  - Push the incoming packets to the FreeBSD stack
> > > >  - Call epoll_wait/kevent without waiting
> > > >  - Handle the events
> > > >  - loop from the beginning
> > > > According to the performance profiling that we did, aside from packet
> > > > processing,
> > > >  about 25-30% of the application time seems to be spent in the
> > > > epoll_wait/kevent
> > > > even though the `timeout` parameter of this call is set to 0 i.e.
> > > > it shouldn't block waiting for events if there is none.
> > > >
> > > > I can give you much more details and code for everything, if needed.
> > > >
> > > > My questions are:
> > > > 1. Does somebody have observations or educated guesses about what  
> > amount  
> > > of  
> > > > traffic should I expect the DPDK + FreeBSD stack + kevent to process in  
> > > the  
> > > > above
> > > > scenario? Are the numbers low or expected?
> > > > We've expected to see better performance than the standard Linux kernel  
> > > one  
> > > > but
> > > > so far we can't get this performance.
> > > > 2. Do you think the diffrence comes because of the time spending  
> > handling  
> > > > packets
> > > > and handling epoll in both of the tests? What do I mean. For the  
> > standard  
> > > > Linux tests
> > > > the interrupts handling has higher priority than the epoll handling and
> > > > thus the application
> > > > can spend much more time handling packets and processing them in the  
> > > kernel  
> > > > than
> > > > handling epoll events in the user space. For the DPDK+FreeBSD case the  
> > > time  
> > > > for
> > > > handling packets and the time for processing epolls is kind of equal. I
> > > > think, that this was
> > > > the reason why we were able to get more performance increasing the  
> > number  
> > > > of read
> > > > packets at one go and decreasing the epoll events. However, we couldn't
> > > > increase the
> > > > throughput enough with these tweaks.
> > > > 3. Can you suggest something else that we can test/measure/profile to  
> > get  
> > > > better idea
> > > > what exactly is happening here and to improve the performance more?
> > > >
> > > > Any help is appreciated!
> > > >
> > > > Thanks in advance,
> > > > Pavel.  
> > >
> > > First off, if you are testing on KVM, are you using PCI pass thru or  
> > SR-IOV  
> > > to make the device available to the guest directly. The default mode uses
> > > a Linux bridge, and this results in multiple copies and context switches.
> > > You end up testing Linux bridge and virtio performance, not TCP.
> > >
> > > To get full speed with TCP and most software stacks you need TCP
> > > segmentation
> > > offload.
> > >
> > > Also software queue discipline, kernel version, and TCP congestion  
> > control  
> > > can have a big role in your result.
> > >  
> >
> > Hi,
> >
> > Thanks for the response.
> >
> > We did the tests on Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-96-generic
> > x86_64).
> > The NIC was given to the guest using SR-IOV.
> > The TCP segmentation offload was enabled for both tests (standard Linux and
> > DPDK+FreeBSD).
> > The congestion control algorithm for both tests was 'cubic'.
> >
> > What do you mean by 'software queue discipline'?
The default qdisc in Ubuntu should be fq_codel (see tc qdisc show)
and that in general has a positive effect on reducing bufferbloat.
F-stack probably doesn't use TSO, you might want to look at TCP stack
from FD.io for comparison.
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-07 20:31       ` Stephen Hemminger
@ 2020-05-08  5:03         ` Pavel Vajarov
  0 siblings, 0 replies; 12+ messages in thread
From: Pavel Vajarov @ 2020-05-08  5:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dave seddon, users
Thanks for the response.
The F-stack has TSO option in the config file which we turned ON for the
tests.
I'll check fd.io.
On Thu, May 7, 2020 at 11:31 PM Stephen Hemminger <
stephen@networkplumber.org> wrote:
> On Thu, 7 May 2020 07:09:44 -0700
> dave seddon <dave.seddon.ca@gmail.com> wrote:
>
> > tc qdisc
> > https://linux.die.net/man/8/tc
> >
> > On Thu, May 7, 2020 at 3:47 AM Pavel Vajarov <freakpv@gmail.com> wrote:
> >
> > > On Wed, May 6, 2020 at 5:55 PM Stephen Hemminger <
> > > stephen@networkplumber.org>
> > > wrote:
> > >
> > > > On Wed, 6 May 2020 08:14:20 +0300
> > > > Pavel Vajarov <freakpv@gmail.com> wrote:
> > > >
> > > > > Hi there,
> > > > >
> > > > > We are trying to compare the performance of DPDK+FreeBSD
> networking
> > > stack
> > > > > vs standard Linux kernel and we have problems finding out why the
> > > former
> > > > is
> > > > > slower. The details are below.
> > > > >
> > > > > There is a project called F-Stack <
> https://github.com/F-Stack/f-stack
> > > >.
> > > > > It glues the networking stack from
> > > > > FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> > > > > transparent
> > > > > TCP proxy based on F-Stack and another one running on Standard
> Linux
> > > > > kernel.
> > > > > We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139
> CPU @
> > > > > 2.30GHz)
> > > > > and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
> > > > > The application level code, the one which handles epoll
> notifications
> > > and
> > > > > memcpy data between the sockets, of the both proxy applications is
> 100%
> > > > the
> > > > > same. Both proxy applications are single threaded and in all tests
> we
> > > > > pinned the applications on core 1. The interrupts from the network
> card
> > > > > were pinned to the same core 1 for the test with the standard Linux
> > > > > application.
> > > > >
> > > > > Here are the test results:
> > > > > 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps
> before
> > > it
> > > > > started to throttle the traffic. No visible CPU usage was observed
> on
> > > > core
> > > > > 0 during the tests, only core 1, where the application and the
> IRQs
> > > were
> > > > > pinned, took the load.
> > > > > 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before
> it
> > > > > started to throttle the traffic. No visible CPU usage was observed
> on
> > > > core
> > > > > 0 during the tests only core 1, where the application was pinned,
> took
> > > > the
> > > > > load. In some of the latter tests I did some changes to the number
> of
> > > > read
> > > > > packets in one call from the network card and the number of
> handled
> > > > events
> > > > > in one call to epoll. With these changes I was able to increase the
> > > > > throughput
> > > > > to 900-1000 Mbps but couldn't increase it more.
> > > > > 3. We did another test with the DPDK+FreeBSD proxy just to give us
> some
> > > > > more info about the problem. We disabled the TCP proxy
> functionality
> > > and
> > > > > let the packets be simply ip forwarded by the FreeBSD stack. In
> this
> > > test
> > > > > we reached up to 5Gbps without being able to throttle the traffic.
> We
> > > > just
> > > > > don't have more traffic to redirect there at the moment. So the
> > > bottlneck
> > > > > seem to be either in the upper level of the network stack or in the
> > > > > application
> > > > > code.
> > > > >
> > > > > There is a huawei switch which redirects the traffic to this
> server. It
> > > > > regularly
> > > > > sends arping and if the server doesn't respond it stops the
> > > redirection.
> > > > > So we assumed that when the redirection stops it's because the
> server
> > > > > throttles the traffic and drops packets and can't respond to the
> arping
> > > > > because
> > > > > of the packets drop.
> > > > >
> > > > > The whole application can be very roughly represented in the
> following
> > > > way:
> > > > >  - Write pending outgoing packets to the network card
> > > > > - Read incoming packets from the network card
> > > > >  - Push the incoming packets to the FreeBSD stack
> > > > >  - Call epoll_wait/kevent without waiting
> > > > >  - Handle the events
> > > > >  - loop from the beginning
> > > > > According to the performance profiling that we did, aside from
> packet
> > > > > processing,
> > > > >  about 25-30% of the application time seems to be spent in the
> > > > > epoll_wait/kevent
> > > > > even though the `timeout` parameter of this call is set to 0 i.e.
> > > > > it shouldn't block waiting for events if there is none.
> > > > >
> > > > > I can give you much more details and code for everything, if
> needed.
> > > > >
> > > > > My questions are:
> > > > > 1. Does somebody have observations or educated guesses about what
> > > amount
> > > > of
> > > > > traffic should I expect the DPDK + FreeBSD stack + kevent to
> process in
> > > > the
> > > > > above
> > > > > scenario? Are the numbers low or expected?
> > > > > We've expected to see better performance than the standard Linux
> kernel
> > > > one
> > > > > but
> > > > > so far we can't get this performance.
> > > > > 2. Do you think the diffrence comes because of the time spending
> > > handling
> > > > > packets
> > > > > and handling epoll in both of the tests? What do I mean. For the
> > > standard
> > > > > Linux tests
> > > > > the interrupts handling has higher priority than the epoll
> handling and
> > > > > thus the application
> > > > > can spend much more time handling packets and processing them in
> the
> > > > kernel
> > > > > than
> > > > > handling epoll events in the user space. For the DPDK+FreeBSD case
> the
> > > > time
> > > > > for
> > > > > handling packets and the time for processing epolls is kind of
> equal. I
> > > > > think, that this was
> > > > > the reason why we were able to get more performance increasing
> the
> > > number
> > > > > of read
> > > > > packets at one go and decreasing the epoll events. However, we
> couldn't
> > > > > increase the
> > > > > throughput enough with these tweaks.
> > > > > 3. Can you suggest something else that we can test/measure/profile
> to
> > > get
> > > > > better idea
> > > > > what exactly is happening here and to improve the performance more?
> > > > >
> > > > > Any help is appreciated!
> > > > >
> > > > > Thanks in advance,
> > > > > Pavel.
> > > >
> > > > First off, if you are testing on KVM, are you using PCI pass thru
> or
> > > SR-IOV
> > > > to make the device available to the guest directly. The default mode
> uses
> > > > a Linux bridge, and this results in multiple copies and context
> switches.
> > > > You end up testing Linux bridge and virtio performance, not TCP.
> > > >
> > > > To get full speed with TCP and most software stacks you need TCP
> > > > segmentation
> > > > offload.
> > > >
> > > > Also software queue discipline, kernel version, and TCP congestion
> > > control
> > > > can have a big role in your result.
> > > >
> > >
> > > Hi,
> > >
> > > Thanks for the response.
> > >
> > > We did the tests on Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-96-generic
> > > x86_64).
> > > The NIC was given to the guest using SR-IOV.
> > > The TCP segmentation offload was enabled for both tests (standard
> Linux and
> > > DPDK+FreeBSD).
> > > The congestion control algorithm for both tests was 'cubic'.
> > >
> > > What do you mean by 'software queue discipline'?
>
> The default qdisc in Ubuntu should be fq_codel (see tc qdisc show)
> and that in general has a positive effect on reducing bufferbloat.
>
> F-stack probably doesn't use TSO, you might want to look at TCP stack
> from FD.io for comparison.
>
>
>
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-07 14:09     ` dave seddon
  2020-05-07 20:31       ` Stephen Hemminger
@ 2020-05-20 19:43       ` Vincent Li
  2020-05-21  8:09         ` Pavel Vajarov
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Li @ 2020-05-20 19:43 UTC (permalink / raw)
  To: dave seddon; +Cc: Pavel Vajarov, Stephen Hemminger, users
On Thu, 7 May 2020, dave seddon wrote:
> > We did the tests on Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-96-generic
> > x86_64).
> > The NIC was given to the guest using SR-IOV.
I am curious how you get F-Stack working with KVM VM with SR-IOV. The 
author of F-Stack mentioned F-Stack not tested working for KVM VM. I had 
an issue report here: https://github.com/F-Stack/f-stack/issues/489
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-20 19:43       ` Vincent Li
@ 2020-05-21  8:09         ` Pavel Vajarov
  2020-05-21 16:31           ` Vincent Li
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Vajarov @ 2020-05-21  8:09 UTC (permalink / raw)
  To: Vincent Li; +Cc: dave seddon, Stephen Hemminger, users
>
> I am curious how you get F-Stack working with KVM VM with SR-IOV. The
> author of F-Stack mentioned F-Stack not tested working for KVM VM. I had
> an issue report here: https://github.com/F-Stack/f-stack/issues/489
Hi there,
I asked our admin who setup the KVM for the tests.
He said that this parameter has been given to the QEMU:
 -device vfio-pci,host=af:00.2,id=hostdev0,bus=pci.0,addr=0x9
This is how the device configuration looks in the XML:
<hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0xaf' slot='0x00' function='0x2'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09'
function='0x0'/>
    </hostdev>
Here is how the things are seen from the guest machine:
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device    <-----
This device is used for SSH access to the server
00:09.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for
10GbE SFP+ (rev 02)   <------ This is the 10Gpbs card used for the tests.
Network devices using DPDK-compatible driver
============================================
0000:00:09.0 'Ethernet Controller X710 for 10GbE SFP+ 1572' drv=igb_uio
unused=i40e
Network devices using kernel driver
===================================
0000:00:03.0 'Virtio network device 1000' if=ens3 drv=virtio-pci
unused=igb_uio *Active*
Hope that helps.
Regards,
Pavel.
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-21  8:09         ` Pavel Vajarov
@ 2020-05-21 16:31           ` Vincent Li
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Li @ 2020-05-21 16:31 UTC (permalink / raw)
  To: Pavel Vajarov; +Cc: Vincent Li, dave seddon, Stephen Hemminger, users
On Thu, 21 May 2020, Pavel Vajarov wrote:
>       I am curious how you get F-Stack working with KVM VM with SR-IOV. The
>       author of F-Stack mentioned F-Stack not tested working for KVM VM. I had
>       an issue report here: https://github.com/F-Stack/f-stack/issues/489
> 
> 
> Hi there,
> 
> I asked our admin who setup the KVM for the tests.
> He said that this parameter has been given to the QEMU:
>  -device vfio-pci,host=af:00.2,id=hostdev0,bus=pci.0,addr=0x9
> 
> This is how the device configuration looks in the XML:
> <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0xaf' slot='0x00' function='0x2'/>
>       </source>
>       <alias name='hostdev0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
>     </hostdev>
> 
> Here is how the things are seen from the guest machine:
> 00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device    <----- This device is used for SSH access to the server
> 00:09.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)   <------ This is the 10Gpbs card used for the tests.
this seems to be passing the PF to guest, not SR-IOV VF. I am wondering 
what happens if you use SR-IOV and pass VF to guest, you may run into same 
issue as I did.
> 
> Network devices using DPDK-compatible driver
> ============================================
> 0000:00:09.0 'Ethernet Controller X710 for 10GbE SFP+ 1572' drv=igb_uio unused=i40e
> 
> Network devices using kernel driver
> ===================================
> 0000:00:03.0 'Virtio network device 1000' if=ens3 drv=virtio-pci unused=igb_uio *Active*
> 
> Hope that helps.
> 
> Regards,
> Pavel.
> 
> 
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-06  5:14 [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK Pavel Vajarov
  2020-05-06 14:54 ` Stephen Hemminger
@ 2020-05-26 16:50 ` Vincent Li
  2020-05-27  5:11   ` Pavel Vajarov
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Li @ 2020-05-26 16:50 UTC (permalink / raw)
  To: Pavel Vajarov; +Cc: users
On Wed, 6 May 2020, Pavel Vajarov wrote:
> Hi there,
> 
> We are trying to compare the performance of DPDK+FreeBSD networking stack
> vs standard Linux kernel and we have problems finding out why the former is
> slower. The details are below.
> 
> There is a project called F-Stack <https://github.com/F-Stack/f-stack>.
> It glues the networking stack from
> FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> transparent
> TCP proxy based on F-Stack and another one running on Standard Linux
> kernel.
I assume you wrote your own TCP proxy based on F-Stack library?
> 
> Here are the test results:
> 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
> started to throttle the traffic. No visible CPU usage was observed on core
> 0 during the tests, only core 1, where the application and the IRQs were
> pinned, took the load.
> 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
> started to throttle the traffic. No visible CPU usage was observed on core
> 0 during the tests only core 1, where the application was pinned, took the
> load. In some of the latter tests I did some changes to the number of read
> packets in one call from the network card and the number of handled events
> in one call to epoll. With these changes I was able to increase the
> throughput
> to 900-1000 Mbps but couldn't increase it more.
> 3. We did another test with the DPDK+FreeBSD proxy just to give us some
> more info about the problem. We disabled the TCP proxy functionality and
> let the packets be simply ip forwarded by the FreeBSD stack. In this test
> we reached up to 5Gbps without being able to throttle the traffic. We just
> don't have more traffic to redirect there at the moment. So the bottlneck
> seem to be either in the upper level of the network stack or in the
> application
> code.
> 
I once tested F-Stack ported Nginx and used Nginx TCP proxy, I could 
achieve above 6Gbps with iperf. After seeing your email, I setup PCI 
passthrough to KVM VM and ran F-Stack Nginx as webserver 
with http load test, no proxy, I could  achieve about 6.5Gbps
> There is a huawei switch which redirects the traffic to this server. It
> regularly
> sends arping and if the server doesn't respond it stops the redirection.
> So we assumed that when the redirection stops it's because the server
> throttles the traffic and drops packets and can't respond to the arping
> because
> of the packets drop.
I did have some weird issue with ARPing of F-Stack, I manually added 
static ARP for F-Stack interface for each F-Stack process, not sure if it 
is related to your ARPing, see https://github.com/F-Stack/f-stack/issues/515 
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-26 16:50 ` Vincent Li
@ 2020-05-27  5:11   ` Pavel Vajarov
  2020-05-27 16:44     ` Vincent Li
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Vajarov @ 2020-05-27  5:11 UTC (permalink / raw)
  To: Vincent Li; +Cc: users
>
> > Hi there,
> >
> > We are trying to compare the performance of DPDK+FreeBSD networking stack
> > vs standard Linux kernel and we have problems finding out why the former
> is
> > slower. The details are below.
> >
> > There is a project called F-Stack <https://github.com/F-Stack/f-stack>.
> > It glues the networking stack from
> > FreeBSD 11.01 over DPDK. We made a setup to test the performance of
> > transparent
> > TCP proxy based on F-Stack and another one running on Standard Linux
> > kernel.
>
> I assume you wrote your own TCP proxy based on F-Stack library?
>
Yes, I wrote transparent TCP proxy based on the F-Stack library for the
tests.
The thing is that we have our transparent caching proxy running on Linux and
now we try to find a ways to improve its performance and hardware
requirements.
> >
> > Here are the test results:
> > 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
> > started to throttle the traffic. No visible CPU usage was observed on
> core
> > 0 during the tests, only core 1, where the application and the IRQs were
> > pinned, took the load.
> > 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
> > started to throttle the traffic. No visible CPU usage was observed on
> core
> > 0 during the tests only core 1, where the application was pinned, took
> the
> > load. In some of the latter tests I did some changes to the number of
> read
> > packets in one call from the network card and the number of handled
> events
> > in one call to epoll. With these changes I was able to increase the
> > throughput
> > to 900-1000 Mbps but couldn't increase it more.
> > 3. We did another test with the DPDK+FreeBSD proxy just to give us some
> > more info about the problem. We disabled the TCP proxy functionality and
> > let the packets be simply ip forwarded by the FreeBSD stack. In this test
> > we reached up to 5Gbps without being able to throttle the traffic. We
> just
> > don't have more traffic to redirect there at the moment. So the bottlneck
> > seem to be either in the upper level of the network stack or in the
> > application
> > code.
> >
>
> I once tested F-Stack ported Nginx and used Nginx TCP proxy, I could
> achieve above 6Gbps with iperf. After seeing your email, I setup PCI
> passthrough to KVM VM and ran F-Stack Nginx as webserver
> with http load test, no proxy, I could  achieve about 6.5Gbps
>
Can I ask on how many cores you run the Nginx?
The results from our tests are from single core. We are trying to reach
max performance on single core because we know that the F-stack soulution
has linear scalability. We tested in on 3 cores and got around 3 Gbps which
is 3 times the result on single core.
Also we test with traffic from one internet service provider. We just
redirect few ip pools to the test machine for the duration of the tests and
see
at which point the proxy will start choking the traffic and the switch the
traffic back.
> There is a huawei switch which redirects the traffic to this server. It
> > regularly
> > sends arping and if the server doesn't respond it stops the redirection.
> > So we assumed that when the redirection stops it's because the server
> > throttles the traffic and drops packets and can't respond to the arping
> > because
> > of the packets drop.
>
> I did have some weird issue with ARPing of F-Stack, I manually added
> static ARP for F-Stack interface for each F-Stack process, not sure if it
> is related to your ARPing, see
> https://github.com/F-Stack/f-stack/issues/515
>
Hmm, I've missed that. Thanks a lot for it because it may help for the
tests and
for the next stage.
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK.
  2020-05-27  5:11   ` Pavel Vajarov
@ 2020-05-27 16:44     ` Vincent Li
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Li @ 2020-05-27 16:44 UTC (permalink / raw)
  To: Pavel Vajarov; +Cc: Vincent Li, users
On Wed, 27 May 2020, Pavel Vajarov wrote:
>       > Hi there,
>       >
>       > We are trying to compare the performance of DPDK+FreeBSD networking stack
>       > vs standard Linux kernel and we have problems finding out why the former is
>       > slower. The details are below.
>       >
>       > There is a project called F-Stack <https://github.com/F-Stack/f-stack>.
>       > It glues the networking stack from
>       > FreeBSD 11.01 over DPDK. We made a setup to test the performance of
>       > transparent
>       > TCP proxy based on F-Stack and another one running on Standard Linux
>       > kernel.
> 
>       I assume you wrote your own TCP proxy based on F-Stack library?
> 
> 
> Yes, I wrote transparent TCP proxy based on the F-Stack library for the tests.
> The thing is that we have our transparent caching proxy running on Linux and
> now we try to find a ways to improve its performance and hardware requirements.
>  
>       >
>       > Here are the test results:
>       > 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
>       > started to throttle the traffic. No visible CPU usage was observed on core
>       > 0 during the tests, only core 1, where the application and the IRQs were
>       > pinned, took the load.
>       > 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
>       > started to throttle the traffic. No visible CPU usage was observed on core
>       > 0 during the tests only core 1, where the application was pinned, took the
>       > load. In some of the latter tests I did some changes to the number of read
>       > packets in one call from the network card and the number of handled events
>       > in one call to epoll. With these changes I was able to increase the
>       > throughput
>       > to 900-1000 Mbps but couldn't increase it more.
>       > 3. We did another test with the DPDK+FreeBSD proxy just to give us some
>       > more info about the problem. We disabled the TCP proxy functionality and
>       > let the packets be simply ip forwarded by the FreeBSD stack. In this test
>       > we reached up to 5Gbps without being able to throttle the traffic. We just
>       > don't have more traffic to redirect there at the moment. So the bottlneck
>       > seem to be either in the upper level of the network stack or in the
>       > application
>       > code.
>       >
> 
>       I once tested F-Stack ported Nginx and used Nginx TCP proxy, I could
>       achieve above 6Gbps with iperf. After seeing your email, I setup PCI
>       passthrough to KVM VM and ran F-Stack Nginx as webserver
>       with http load test, no proxy, I could  achieve about 6.5Gbps
> 
> Can I ask on how many cores you run the Nginx?
I used 4 cores on the VM
 
> The results from our tests are from single core. We are trying to reach 
> max performance on single core because we know that the F-stack soulution 
> has linear scalability. We tested in on 3 cores and got around 3 Gbps which
> is 3 times the result on single core.
> Also we test with traffic from one internet service provider. We just 
> redirect few ip pools to the test machine for the duration of the tests and see
> at which point the proxy will start choking the traffic and the switch the traffic back.
I used mTCP ported apache bench to do load test, since the F-Stack and the 
apache bench are directed connected machine with cable and running 
capture on mTCP and F-Stack would affect performance, I do not have 
capture to see if there are significant packet drops or not when achieving 
6.5Gbps 
> 
>       > There is a huawei switch which redirects the traffic to this server. It
>       > regularly
>       > sends arping and if the server doesn't respond it stops the redirection.
>       > So we assumed that when the redirection stops it's because the server
>       > throttles the traffic and drops packets and can't respond to the arping
>       > because
>       > of the packets drop.
> 
>       I did have some weird issue with ARPing of F-Stack, I manually added
>       static ARP for F-Stack interface for each F-Stack process, not sure if it
>       is related to your ARPing, see https://github.com/F-Stack/f-stack/issues/515
> 
> Hmm, I've missed that. Thanks a lot for it because it may help for the tests and
> for the next stage.
> 
>  
> 
> 
^ permalink raw reply	[flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-05-27 16:44 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-06  5:14 [dpdk-users] Peformance troubleshouting of TCP/IP stack over DPDK Pavel Vajarov
2020-05-06 14:54 ` Stephen Hemminger
2020-05-07 10:47   ` Pavel Vajarov
2020-05-07 14:09     ` dave seddon
2020-05-07 20:31       ` Stephen Hemminger
2020-05-08  5:03         ` Pavel Vajarov
2020-05-20 19:43       ` Vincent Li
2020-05-21  8:09         ` Pavel Vajarov
2020-05-21 16:31           ` Vincent Li
2020-05-26 16:50 ` Vincent Li
2020-05-27  5:11   ` Pavel Vajarov
2020-05-27 16:44     ` Vincent Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).