[dpdk-users] What is TCP read performance by using DPDK?

DPDK usage discussions
 help / color / mirror / Atom feed

* [dpdk-users] What is TCP read performance by using DPDK?
@ 2021-03-23 23:06 Hao Chen
  2021-03-29  8:38 ` Pavel Vazharov
  0 siblings, 1 reply; 6+ messages in thread
From: Hao Chen @ 2021-03-23 23:06 UTC (permalink / raw)
  To: users

Hi experts,

1.
Did you use DPDK to implement TCP server (read TCP data and then discard)?

If yes, what is the max performance you can achieve on Intel X710 10Gbps NIC?
(I am not using VPP TCP stack)

2.
My feeling is that DPDK can only be used for layer 3.
For layer 4, DPDK can only look at TCP header.
Once DPDK look into TCP payload, performance drop down drastically.

3.
If DPDK can only be used for layer 3, Linux eBPF is another choice.
https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/
[https://blog.cloudflare.com/content/images/2020/09/image3-2.png]<https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/>
Unimog - Cloudflare’s edge load balancer<https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/>
Unimog is the Layer 4 Load Balancer for Cloudflare’s edge data centers. This post explains the problems it solves and how it works.
blog.cloudflare.com

Thanks and regards

Hao Chen

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] What is TCP read performance by using DPDK?
  2021-03-23 23:06 [dpdk-users] What is TCP read performance by using DPDK? Hao Chen
@ 2021-03-29  8:38 ` Pavel Vazharov
  2021-04-15  5:59   ` Hao Chen
  0 siblings, 1 reply; 6+ messages in thread
From: Pavel Vazharov @ 2021-03-29  8:38 UTC (permalink / raw)
  To: Hao Chen; +Cc: users

Hi,

You can look at F-stack <https://github.com/F-Stack/f-stack/> or Seastar
<http://seastar.io/> for some performance measurements.
The first one uses the networking stack from FreeBSD 11, AFAIK. The second
one has their own networking stack. Both of them work over DPDK.

We also started with F-stack but later modified it to extract the DPDK code
out of it so that now we use a separate TCP stack per thread with shared
nothing design.
I can share with you our performance measurements with real traffic from an
ISP provider but they are not very relevant because we work in proxy mode,
not as a server.
However, our experience is roughly the following:
- for TCP (port 80) traffic the proxy over DPDK was about 1.5 times faster
than the same proxy working over standard Linux kernel. The application
layer was the same in both tests - a straightforward TCP proxy of port 80
traffic.
- we observed much bigger performance gain for UDP traffic where the
packets are more but smaller in our use case. Here we tested with UTP
(uTorrent Traffic Protocol). We have our own stack for UTP traffic
processing which works over UDP sockets and again we run the same
application code once over standard Linux  and over DPDK. The DPDK version
was able to handle the same amount of traffic on 2 cores with about 0.7%
packet losses as the Linux version handled on 8 cores with the same packet
losses. So, here we observed about 4 times improvement here.

However, take our measurements with a grain of salt because although the
application layer was the same for the Linux case we do some packet
inspection in the Linux kernel to filter out asymmetric TCP traffic, to
match UTP traffic, etc.
This filtration and matching was done in the DPDK layer of the new proxy so
the code under test was not 100% the same, more like 95% the same.
In addition to that, the UDP measurements were done on two separate
machines with the same type of CPUs and NICs but still they were two
separate machines.

Regards,
Pavel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] What is TCP read performance by using DPDK?
  2021-03-29  8:38 ` Pavel Vazharov
@ 2021-04-15  5:59   ` Hao Chen
  2021-04-15  6:57     ` Pavel Vazharov
  0 siblings, 1 reply; 6+ messages in thread
From: Hao Chen @ 2021-04-15  5:59 UTC (permalink / raw)
  To: Pavel Vazharov; +Cc: users

Hi Pavel,

Appreciate for your help.

1.
You wrote "we work in proxy mode".
Does it mean your code just look at IPHeader and TCPheader without handling TCP payload?
In our use case, we need to decrypt TLS traffic(handle full TCP payload).

2.
You wrote "much bigger performance gain for UDP traffic where the packets are more but smaller in our use case".
Does it mean UDP-payload-size is NOT 1400 bytes (MTU size)? And it is as smaller as 64 bytes for example?

I looked at https://www.bittorrent.org/beps/bep_0029.html for UTP protocol.
Do you handle UTP payload, or just "relay" it like proxy?

3.
Again, appreciate for your response.

Thanks

________________________________
From: Pavel Vazharov <freakpv@gmail.com>
Sent: Monday, March 29, 2021 1:38
To: Hao Chen <earthlovepython@outlook.com>
Cc: users@dpdk.org <users@dpdk.org>
Subject: Re: [dpdk-users] What is TCP read performance by using DPDK?

Hi,

You can look at F-stack<https://github.com/F-Stack/f-stack/> or Seastar<http://seastar.io/> for some performance measurements.
The first one uses the networking stack from FreeBSD 11, AFAIK. The second one has their own networking stack. Both of them work over DPDK.

We also started with F-stack but later modified it to extract the DPDK code out of it so that now we use a separate TCP stack per thread with shared nothing design.
I can share with you our performance measurements with real traffic from an ISP provider but they are not very relevant because we work in proxy mode, not as a server.
However, our experience is roughly the following:
- for TCP (port 80) traffic the proxy over DPDK was about 1.5 times faster than the same proxy working over standard Linux kernel. The application layer was the same in both tests - a straightforward TCP proxy of port 80 traffic.
- we observed much bigger performance gain for UDP traffic where the packets are more but smaller in our use case. Here we tested with UTP (uTorrent Traffic Protocol). We have our own stack for UTP traffic processing which works over UDP sockets and again we run the same application code once over standard Linux  and over DPDK. The DPDK version was able to handle the same amount of traffic on 2 cores with about 0.7% packet losses as the Linux version handled on 8 cores with the same packet losses. So, here we observed about 4 times improvement here.

However, take our measurements with a grain of salt because although the application layer was the same for the Linux case we do some packet inspection in the Linux kernel to filter out asymmetric TCP traffic, to match UTP traffic, etc.
This filtration and matching was done in the DPDK layer of the new proxy so the code under test was not 100% the same, more like 95% the same.
In addition to that, the UDP measurements were done on two separate machines with the same type of CPUs and NICs but still they were two separate machines.

Regards,
Pavel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] What is TCP read performance by using DPDK?
  2021-04-15  5:59   ` Hao Chen
@ 2021-04-15  6:57     ` Pavel Vazharov
  2021-04-15 16:03       ` Hao Chen
  0 siblings, 1 reply; 6+ messages in thread
From: Pavel Vazharov @ 2021-04-15  6:57 UTC (permalink / raw)
  To: Hao Chen; +Cc: users

Hi,

"Does it mean your code just look at IPHeader and TCPheader without
handling TCP payload?"
The proxy works in the application layer. I mean, it works with regular BSD
sockets. As I said we use modified version of F-stack (
https://github.com/F-Stack/f-stack) for this. Basically our version is very
close to the original libuinet (https://github.com/pkelsey/libuinet) but
based on a newer version of the FreeBSD networking stack (FreeBSD 11). Here
is a rough description how it works:
1. Every thread of our application reads packets in bursts from the single
RX queue using the DPDK API.
2. These packets are then passed/injected into the FreeBSD/F-stack
networking stack. We use separate networking stack per thread.
3. The networking stack processes the packets queueing them in the receive
buffers of the TCP sockets. These are regular sockets.
4. Every application thread also calls regularly an epoll_wait API provided
by the F-stack library. It's just a wrapper over the kevent API provided by
the FreeBSD.
5. The application gets the read/write events from the epoll_wait and
reads/writes to the corresponding sockets. Again this is done exactly like
in a regular Linux application where you read/write data from/to the
sockets.
6. Our test proxy application used sockets in pairs and all data read from
a given TCP socket were written to the corresponding TCP socket in the
other direction.
7. The written data to the given socket is put in the send buffers of this
socket and eventually sent out via the given TX queue using the DPDK API.
This happens via callback that's provided to the F-stack. The callback is
called for every single packet that needs to be send out by the F-stack and
our application implements this callback using the DPDK functionality. In
our design the F-stack/FreeBSD stack doesn't know about the DPDK it can
work with different packet processing framework.

"Does it mean UDP-payload-size is NOT 1400 bytes (MTU size)? And it is as
smaller as 64 bytes for example?"
My personal observation is that for the same amount of traffic the UTP
traffic generates much more packets per second than the corresponding HTTP
traffic running over TCP. These are the two tests that we did. However, I
can't provide you numbers about this at the moment but there are lots of
packets smaller than the MTU size usually. I think they come from things
like the internal ACK packets which seem to be send more frequently than
TCP. Also the request, cancel, have, etc messages, from the BitTorrent
protocol, are most of the times sent in smaller packets.

"Do you handle UTP payload, or just "relay" it like proxy?"
Our proxies always work with sockets. We have application business logic
built over the socket layer. For the test case we just proxied the data
between pairs of UTP sockets in the same way we did it for the TCP proxy
above.
We have implementation of the UTP protocol which provides a socket API
similar to the BSD socket API with read/write/shutdown/close/etc functions.
As you probably may have read, the UTP protocol is, kind of, a simplified
version of the TCP protocol but also more suitable for the needs of the
BitTorrent traffic. So this is a reliable protocol and this means that
there is a need for socket buffers. Our implementation is built over the
UDP sockets provided by the F-stack. The data are read from the UDP sockets
and put into the buffers of the corresponding UTP socket. If contiguous
data are collected into the buffers, the implementation fires notification
to the application layer. The write direction works in the opposite way.
The data from the application are first written to the buffers of the UTP
socket and then later send via the internal UDP socket from the F-stack.

So to summarize the above. We handle the TCP/UDP payload using the regular
BSD socket API provided by the F-stack library and our UTP stack library.
For the test we just relayed the data between a few thousands pairs of
sockets. Currently we do much more complex manipulation of this data but
this is still work in progress and the final performance is still not
tested.

Hope the above explanations help.
Pavel.

>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] What is TCP read performance by using DPDK?
  2021-04-15  6:57     ` Pavel Vazharov
@ 2021-04-15 16:03       ` Hao Chen
  2021-04-16  6:00         ` Pavel Vazharov
  0 siblings, 1 reply; 6+ messages in thread
From: Hao Chen @ 2021-04-15 16:03 UTC (permalink / raw)
  To: Pavel Vazharov; +Cc: users

Hello Pavel,

Appreciate for your detailed explanation.  Please bear with my verbose/questions.

Based on your explanation, looks like your application (running on layer 7) uses at least 3 threads. (1). first thread is for DPDK burst read(rte_eth_rx_burst()). After read data from layer 2, put them into Q  (2). second thread (layer 2) read data from Q, and then use F-Stack to handle TCP data. Then put these data (layer 4) to TCP socket buffer. (3). third thread use epoll_wait() to read (layer 7) data from TCP socket buffer. And "forward" them to outgoing TCP socket for rte_eth_tx_burst()

Is my understanding right?

Thanks

________________________________
From: Pavel Vazharov <freakpv@gmail.com>
Sent: Wednesday, April 14, 2021 23:57
To: Hao Chen <earthlovepython@outlook.com>
Cc: users@dpdk.org <users@dpdk.org>
Subject: Re: [dpdk-users] What is TCP read performance by using DPDK?

Hi,

"Does it mean your code just look at IPHeader and TCPheader without handling TCP payload?"
The proxy works in the application layer. I mean, it works with regular BSD sockets. As I said we use modified version of F-stack (https://github.com/F-Stack/f-stack) for this. Basically our version is very close to the original libuinet (https://github.com/pkelsey/libuinet) but based on a newer version of the FreeBSD networking stack (FreeBSD 11). Here is a rough description how it works:
1. Every thread of our application reads packets in bursts from the single RX queue using the DPDK API.
2. These packets are then passed/injected into the FreeBSD/F-stack networking stack. We use separate networking stack per thread.
3. The networking stack processes the packets queueing them in the receive buffers of the TCP sockets. These are regular sockets.
4. Every application thread also calls regularly an epoll_wait API provided by the F-stack library. It's just a wrapper over the kevent API provided by the FreeBSD.
5. The application gets the read/write events from the epoll_wait and reads/writes to the corresponding sockets. Again this is done exactly like in a regular Linux application where you read/write data from/to the sockets.
6. Our test proxy application used sockets in pairs and all data read from a given TCP socket were written to the corresponding TCP socket in the other direction.
7. The written data to the given socket is put in the send buffers of this socket and eventually sent out via the given TX queue using the DPDK API. This happens via callback that's provided to the F-stack. The callback is called for every single packet that needs to be send out by the F-stack and our application implements this callback using the DPDK functionality. In our design the F-stack/FreeBSD stack doesn't know about the DPDK it can work with different packet processing framework.

"Does it mean UDP-payload-size is NOT 1400 bytes (MTU size)? And it is as smaller as 64 bytes for example?"
My personal observation is that for the same amount of traffic the UTP traffic generates much more packets per second than the corresponding HTTP traffic running over TCP. These are the two tests that we did. However, I can't provide you numbers about this at the moment but there are lots of packets smaller than the MTU size usually. I think they come from things like the internal ACK packets which seem to be send more frequently than TCP. Also the request, cancel, have, etc messages, from the BitTorrent protocol, are most of the times sent in smaller packets.

"Do you handle UTP payload, or just "relay" it like proxy?"
Our proxies always work with sockets. We have application business logic built over the socket layer. For the test case we just proxied the data between pairs of UTP sockets in the same way we did it for the TCP proxy above.
We have implementation of the UTP protocol which provides a socket API similar to the BSD socket API with read/write/shutdown/close/etc functions. As you probably may have read, the UTP protocol is, kind of, a simplified version of the TCP protocol but also more suitable for the needs of the BitTorrent traffic. So this is a reliable protocol and this means that there is a need for socket buffers. Our implementation is built over the UDP sockets provided by the F-stack. The data are read from the UDP sockets and put into the buffers of the corresponding UTP socket. If contiguous data are collected into the buffers, the implementation fires notification to the application layer. The write direction works in the opposite way. The data from the application are first written to the buffers of the UTP socket and then later send via the internal UDP socket from the F-stack.

So to summarize the above. We handle the TCP/UDP payload using the regular BSD socket API provided by the F-stack library and our UTP stack library. For the test we just relayed the data between a few thousands pairs of sockets. Currently we do much more complex manipulation of this data but this is still work in progress and the final performance is still not tested.

Hope the above explanations help.
Pavel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-users] What is TCP read performance by using DPDK?
  2021-04-15 16:03       ` Hao Chen
@ 2021-04-16  6:00         ` Pavel Vazharov
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Vazharov @ 2021-04-16  6:00 UTC (permalink / raw)
  To: Hao Chen; +Cc: users

Hi Hao,

The current design of the application very roughly is following:
1. There is one main thread which pumps out the packets from the NIC queues
using rte_eth_rx_burst(), as you said. In the future we may need several
main threads to be able to scale the application. Each one of them will
work on separate groups of RX queues. The main thread distributes the
received packets to N other threads using single producer single consumer
rings provided by DPDK (rte_ring).
2. Each one of these N other threads runs a separate F-stack version. As I
said we use a networking stack per thread and share nothing design. Let's
call them worker threads for now.
3. Each worker thread has its own spsc_ring for the incoming packets and
uses a separate NIC queue to send the outgoing packets using
rte_eth_tx_burst.
The main loop of such worker thread looks roughly in the following way
(pseudo code):
while (not stopped) {

    if (X_milliseconds_have_passed)

        call_fstack_tick_functionality();

    send_queued_fstack_packets(); // using rte_eth_tx_burst

    dequeue_incoming_packets_from_spsc_ring();

    enqueue_the_incoming_packets_to_the_fstack();

    if (Y_milliseconds_have_passed)

        process_fstack_socket_events_using_epoll();

}
You may not the following things from the above code.
- The packets are sent (rte_eth_tx_burst) in the same thread where the
socket events are processed. The outgoing packets are also sent if we queue
enough of them while processing socket write events but this will
complicate the explanation here.
- The timer events and the socket events are not processed on each
iteration of the loop. These milliseconds come from a config file and are
measured using rte_rdtsc.
- The loop is very similar to the one present in the F-stack itself -
https://github.com/F-Stack/f-stack/blob/dev/lib/ff_dpdk_if.c#L1817. It's
just that in our case this loop is decoupled from the F-stack because we
removed the DPDK from the F-stack in order to use the latter as a separate
library and use a separate networking stack per thread.
4. The number of worker threads is configurable via the application config
file and the application sets up the NIC with the same number of RX/TX
queues as the number of worker threads. This way the main thread pumps out
packets from N RX queues and each worker thread enqueues packets to each
own TX queue i.e. there is no sharing. So the application may run with
single RX/TX queue and then it'll have one main thread and one worker
thread. Or may run with 10 RX/TX queues and then it'll have 1 main thread
and 10 worker threads. It depends on the traffic amount that we expect to
handle, the NIC capabilities, etc.

Regards,
Pavel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-04-23  8:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-23 23:06 [dpdk-users] What is TCP read performance by using DPDK? Hao Chen
2021-03-29  8:38 ` Pavel Vazharov
2021-04-15  5:59   ` Hao Chen
2021-04-15  6:57     ` Pavel Vazharov
2021-04-15 16:03       ` Hao Chen
2021-04-16  6:00         ` Pavel Vazharov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).