* [dpdk-dev] Appropriate DPDK data structures for TCP sockets
@ 2015-02-22 0:38 Matthew Hall
2015-02-23 0:02 ` Stephen Hemminger
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Hall @ 2015-02-22 0:38 UTC (permalink / raw)
To: <dev@dpdk.org>
Hello fellow stack hackers :) ,
I'm working on a simple server-side implementation of TCP on DPDK.
For this to work I need a good data structure to store some sockets.
The lookup key is like this:
struct ss_flow_key_s {
uint8_t sip[IPV6_ALEN];
uint8_t dip[IPV6_ALEN];
uint16_t sport;
uint16_t dport;
uint8_t protocol;
} __attribute__((packed));
typedef struct ss_flow_key_s ss_flow_key_t;
The socket itself is like this:
enum ss_tcp_state_e {
SS_TCP_CLOSED = 0, SS_TCP_LISTEN = 1, SS_TCP_SYN_TX = 2,
SS_TCP_SYN_RX = 3, SS_TCP_OPEN = 4, SS_TCP_UNKNOWN = -1,
};
typedef enum ss_tcp_state_e ss_tcp_state_t;
// RFC 793, RFC 1122
struct ss_tcp_socket_s {
ss_flow_key_t key;
uint32_t id;
rte_spinlock_t lock;
ss_tcp_state_t state;
uint32_t ticks_last;
uint16_t rx_buf_offset;
uint16_t mss;
uint64_t rx_failures;
uint8_t rx_data[L4_TCP_WINDOW_SIZE * 2];
} __rte_cache_aligned;
So far I was using rte_hash, but it's single writer multi reader, which is eventually going to need some more complicated locking and probably run kind of slow. Also, I need some timer functions to delete dead sockets and so forth, and rte_hash doesn't have any iteration API.
So then I was trying to figure out if I needed to use a linked list for the iteration or if there is some other API I should use instead like rte_table_*. However the documentation of rte_table is kind of confusing so I wasn't sure if that was the right choice either.
Any advice?
Thanks,
Matthew.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Appropriate DPDK data structures for TCP sockets
2015-02-22 0:38 [dpdk-dev] Appropriate DPDK data structures for TCP sockets Matthew Hall
@ 2015-02-23 0:02 ` Stephen Hemminger
2015-02-23 4:50 ` Matthew Hall
0 siblings, 1 reply; 7+ messages in thread
From: Stephen Hemminger @ 2015-02-23 0:02 UTC (permalink / raw)
To: Matthew Hall; +Cc: <dev@dpdk.org>
On Sat, 21 Feb 2015 16:38:45 -0800
Matthew Hall <mhall@mhcomputing.net> wrote:
> So far I was using rte_hash, but it's single writer multi reader, which is eventually going to need some more complicated locking and probably run kind of slow. Also, I need some timer functions to delete dead sockets and so forth, and rte_hash doesn't have any iteration API.
Use userspace RCU? or BSD RB_TREE
The existing rte_hash is too limited for many use cases.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Appropriate DPDK data structures for TCP sockets
2015-02-23 0:02 ` Stephen Hemminger
@ 2015-02-23 4:50 ` Matthew Hall
2015-02-23 14:48 ` Matt Laswell
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Hall @ 2015-02-23 4:50 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: <dev@dpdk.org>
On Feb 22, 2015, at 4:02 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> Use userspace RCU? or BSD RB_TREE
Thanks Stephen,
I think the RB_TREE stuff is single threaded mostly.
But user-space RCU looks quite good indeed, I didn't know somebody ported it out of the kernel. I'll check it out.
Matthew.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Appropriate DPDK data structures for TCP sockets
2015-02-23 4:50 ` Matthew Hall
@ 2015-02-23 14:48 ` Matt Laswell
2015-02-23 21:16 ` Matthew Hall
0 siblings, 1 reply; 7+ messages in thread
From: Matt Laswell @ 2015-02-23 14:48 UTC (permalink / raw)
To: Matthew Hall; +Cc: <dev@dpdk.org>
Hey Matthew,
I've mostly worked on stackless systems over the last few years, but I have
done a fair bit of work on high performance, highly scalable connection
tracking data structures. In that spirit, here are a few counterintuitive
insights I've gained over the years. Perhaps they'll be useful to you.
Apologies in advance for likely being a bit long-winded.
First, you really need to take cache performance into account when you're
choosing a data structure. Something like a balanced tree can seem awfully
appealing at first blush, either on its own or as a chaining mechanism for
a hash table. But the problem with trees is that there really isn't much
locality of reference in your memory use - every single step in your
descent ends up being a cache miss. This hurts you twice: once that you
end up stalled waiting for the next node in the tree to load from main
memory, and again when you have to reload whatever you pushed out of cache
to get it.
It's often better if, instead of a tree, you do linear search across arrays
of hash values. It's easy to size the array so that it is exactly one
cache line long, and you can generally do linear search of the whole thing
in less time than it takes to do a single cache line fill. If you find a
match, you can do full verification against the full tuple as needed.
Second, rather than synchronizing (perhaps with locks, perhaps with
lockless data structures), it's often beneficial to create multiple
threads, each of which holds a fraction of your connection tracking data.
Every connection belongs to a single one of these threads, selected perhaps
by hash or RSS value, and all packets from the connection go through that
single thread. This approach has a couple of advantages. First,
obviously, no slowdowns for synchronization. But, second, I've found that
when you are spreading packets from a single connection across many compute
elements, you're inevitably going to start putting packets out of order.
In many applications, this ultimately leads to some additional processing
to put things back in order, which gives away the performance gains you
achieved. Of course, this approach brings its own set of complexities, and
challenges for your application, and doesn't always spread the work as
efficiently across all of your cores. But it might be worth considering.
Third, it's very worthwhile to have a cache for the most recently accessed
connection. First, because network traffic is bursty, and you'll
frequently see multiple packets from the same connection in succession.
Second, because it can make life easier for your application code. If you
have multiple places that need to access connection data, you don't have to
worry so much about the cost of repeated searches. Again, this may or may
not matter for your particular application. But for ones I've worked on,
it's been a win.
Anyway, as predicted, this post has gone far too long for a Monday
morning. Regardless, I hope you found it useful. Let me know if you have
questions or comments.
--
Matt Laswell
infinite io, inc.
laswell@infiniteio.com
On Sun, Feb 22, 2015 at 10:50 PM, Matthew Hall <mhall@mhcomputing.net>
wrote:
>
> On Feb 22, 2015, at 4:02 PM, Stephen Hemminger <stephen@networkplumber.org>
> wrote:
> > Use userspace RCU? or BSD RB_TREE
>
> Thanks Stephen,
>
> I think the RB_TREE stuff is single threaded mostly.
>
> But user-space RCU looks quite good indeed, I didn't know somebody ported
> it out of the kernel. I'll check it out.
>
> Matthew.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Appropriate DPDK data structures for TCP sockets
2015-02-23 14:48 ` Matt Laswell
@ 2015-02-23 21:16 ` Matthew Hall
2015-02-23 21:51 ` Avi Kivity
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Hall @ 2015-02-23 21:16 UTC (permalink / raw)
To: Matt Laswell; +Cc: <dev@dpdk.org>
On Mon, Feb 23, 2015 at 08:48:57AM -0600, Matt Laswell wrote:
> Apologies in advance for likely being a bit long-winded.
Long winded is great, helps me get context.
> First, you really need to take cache performance into account when you're
> choosing a data structure. Something like a balanced tree can seem awfully
> appealing at first blush
Agreed. I did some amount of DPDK stuff before but without TCP. This is why I
was figuring a packet-hash is better than a tree.
> Second, rather than synchronizing (perhaps with locks, perhaps with
> lockless data structures), it's often beneficial to create multiple
> threads, each of which holds a fraction of your connection tracking data.
Yes, I REALLY REALLY REALLY wanted to do RSS. But the virtio-net and other
VM's don't support RSS, unlike the classic PCIe NIC's. In order to get the
community to use my app I have to give them a "batteries included"
environment, where the system can still work even with no RSS.
> Third, it's very worthwhile to have a cache for the most recently accessed
> connection. First, because network traffic is bursty, and you'll
> frequently see multiple packets from the same connection in succession.
> Second, because it can make life easier for your application code. If you
> have multiple places that need to access connection data, you don't have to
> worry so much about the cost of repeated searches. Again, this may or may
> not matter for your particular application. But for ones I've worked on,
> it's been a win.
Yes, this sounds like a really good idea. One advantage in my product, I am
only doing TCP Syslog, so I don't have an arbitrary zillion connections like
FW or IPS would want. I could cap it at something like 8192 or 16384 and be
good enough for some time until a better solution is worked out.
I could make some capped array or linked list of the X most recent ones for
cheap access. It's just socket pointers so it doesn't hardly cost anything to
copy a couple pointers into a cache and quickly invalidate when the connection
closes.
> Anyway, as predicted, this post has gone far too long for a Monday
> morning. Regardless, I hope you found it useful.
This was great. Thank you!
Matthew.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Appropriate DPDK data structures for TCP sockets
2015-02-23 21:16 ` Matthew Hall
@ 2015-02-23 21:51 ` Avi Kivity
2015-03-13 6:41 ` Matthew Hall
0 siblings, 1 reply; 7+ messages in thread
From: Avi Kivity @ 2015-02-23 21:51 UTC (permalink / raw)
To: Matthew Hall, Matt Laswell; +Cc: <dev@dpdk.org>
On 02/23/2015 11:16 PM, Matthew Hall wrote:
> On Mon, Feb 23, 2015 at 08:48:57AM -0600, Matt Laswell wrote:
>> Apologies in advance for likely being a bit long-winded.
> Long winded is great, helps me get context.
>
>> First, you really need to take cache performance into account when you're
>> choosing a data structure. Something like a balanced tree can seem awfully
>> appealing at first blush
> Agreed. I did some amount of DPDK stuff before but without TCP. This is why I
> was figuring a packet-hash is better than a tree.
>
>> Second, rather than synchronizing (perhaps with locks, perhaps with
>> lockless data structures), it's often beneficial to create multiple
>> threads, each of which holds a fraction of your connection tracking data.
> Yes, I REALLY REALLY REALLY wanted to do RSS. But the virtio-net and other
> VM's don't support RSS, unlike the classic PCIe NIC's. In order to get the
> community to use my app I have to give them a "batteries included"
> environment, where the system can still work even with no RSS.
For an example of a tcp stack on top of dpdk please see seastar [1]. It
supports hardware RSS, software RSS, or a combination (if the number of
hardware queues is smaller than the number of cores).
>> Third, it's very worthwhile to have a cache for the most recently accessed
>> connection. First, because network traffic is bursty, and you'll
>> frequently see multiple packets from the same connection in succession.
>> Second, because it can make life easier for your application code. If you
>> have multiple places that need to access connection data, you don't have to
>> worry so much about the cost of repeated searches. Again, this may or may
>> not matter for your particular application. But for ones I've worked on,
>> it's been a win.
> Yes, this sounds like a really good idea. One advantage in my product, I am
> only doing TCP Syslog, so I don't have an arbitrary zillion connections like
> FW or IPS would want. I could cap it at something like 8192 or 16384 and be
> good enough for some time until a better solution is worked out.
>
> I could make some capped array or linked list of the X most recent ones for
> cheap access. It's just socket pointers so it doesn't hardly cost anything to
> copy a couple pointers into a cache and quickly invalidate when the connection
> closes.
A simple per-core hash table is sufficient in our experience. Yes, you
will take a cache miss, but it's not the end of the world.
[1] https://github.com/cloudius-systems/seastar
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Appropriate DPDK data structures for TCP sockets
2015-02-23 21:51 ` Avi Kivity
@ 2015-03-13 6:41 ` Matthew Hall
0 siblings, 0 replies; 7+ messages in thread
From: Matthew Hall @ 2015-03-13 6:41 UTC (permalink / raw)
To: Avi Kivity; +Cc: <dev@dpdk.org>
On Mon, Feb 23, 2015 at 11:51:46PM +0200, Avi Kivity wrote:
> https://github.com/cloudius-systems/seastar
Hi Avi and others,
My code unintentionally ended up looking somewhat like a C version of your
seastar C++ code, even though I didn't really look at yours too much when I
coded mine as it was using a lot of hardcore C++ features I really don't have
any clue about. :)
Someday maybe we can all do a bake-off of tests of DPDK TCPs and Host TCPs and
see what kind of stability, features, and performance we can get.
I didn't use anything too crazy or high performance just yet... just rte_hash
and some spinlocks... but I'm keeping all the collective advice in mind to
figure out how I'm going to make all this stuff work right soon.
Matthew.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-03-13 6:41 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-22 0:38 [dpdk-dev] Appropriate DPDK data structures for TCP sockets Matthew Hall
2015-02-23 0:02 ` Stephen Hemminger
2015-02-23 4:50 ` Matthew Hall
2015-02-23 14:48 ` Matt Laswell
2015-02-23 21:16 ` Matthew Hall
2015-02-23 21:51 ` Avi Kivity
2015-03-13 6:41 ` Matthew Hall
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).