* [dpdk-dev] rte_lpm with larger nexthops or another method?
@ 2015-06-22 2:29 Matthew Hall
2015-06-22 10:11 ` Vladimir Medvedkin
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Hall @ 2015-06-22 2:29 UTC (permalink / raw)
To: <dev@dpdk.org>
Hello,
I have gone out on the internet for days looking at a bunch of different radix tree implementations to see if I could figure a way to implement my own tree, just to work around the really low 255 CIDR block limitation in librte_lpm. Unfortunately every single one I could find falls into one of these two annoying categories:
1) bloated with a lot of irrelevant kernel code I don't care about (especially the Linux version but also the BSD one, which also makes a weird assumption every address object stores its length in byte 0 of the address struct). These are hard to convert into something that plays nice with raw packet data.
2) very seemingly simple code, which breaks horribly if you try to add IPv6 support (such as the radix tree from University of Michigan / LLVM compiler benchmark suite, and the one from the old unmaintained mrt daemon, which includes a bizarre custom reference-counted memory manager that is very convoluted). These are easy to set up, but cause a lot of weird segfaults which I am having a difficult time to try to debug.
So it seems like I am going nowhere with this approach. Instead, I'd like to know, what would I need to do to add this support to my local copy of librte_lpm? Let's assume for the sake of this discussion, that I don't care one iota about any performance cost, and I am happy if I need to prefetch two cachelines instead of just one (which I recall from a past thread is why librte_lpm has such a low nexthop limit to start with).
Failing that, does anybody have a known good userspace version of any of these sort of items:
1) Hash Based FIB (forwarding information base),
2) Tree Based FIB,
3) Patricia trie (which does not break horribly on IPv6 or make bad assumptions about data format besides uint8_t* and length),
4) Crit-Bit tree
5) any other good way of taking IPv4 and IPv6 and finding the longest prefix match against a table of pre-loaded CIDR blocks?
I am really pulling out my hair trying to find a way to do something which doesn't seem like it should have to be be this difficult. I must be missing a more obvious way to handle this.
Thanks,
Matthew
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-22 2:29 [dpdk-dev] rte_lpm with larger nexthops or another method? Matthew Hall
@ 2015-06-22 10:11 ` Vladimir Medvedkin
2015-06-22 17:53 ` Matthew Hall
0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Medvedkin @ 2015-06-22 10:11 UTC (permalink / raw)
To: Matthew Hall; +Cc: <dev@dpdk.org>
Hi Matthew,
I just recently thought about next_hop extension. For ipv4 we can do
something like:
struct rte_lpm_tbl24_entry {
/* Stores Next hop or group index (i.e. gindex)into tbl8. */
union {
uint32_t next_hop :24;
uint32_t tbl8_gindex :24;
}__attribute__((__packed__));
/* Using single uint8_t to store 3 values. */
uint32_t valid :1; /**< Validation flag. */
uint32_t ext_entry :1; /**< External entry. */
uint32_t depth :6; /**< Rule depth. */
};
so we have 24 bit for next_hop.
2015-06-22 5:29 GMT+03:00 Matthew Hall <mhall@mhcomputing.net>:
> Hello,
>
> I have gone out on the internet for days looking at a bunch of different
> radix tree implementations to see if I could figure a way to implement my
> own tree, just to work around the really low 255 CIDR block limitation in
> librte_lpm. Unfortunately every single one I could find falls into one of
> these two annoying categories:
>
> 1) bloated with a lot of irrelevant kernel code I don't care about
> (especially the Linux version but also the BSD one, which also makes a
> weird assumption every address object stores its length in byte 0 of the
> address struct). These are hard to convert into something that plays nice
> with raw packet data.
>
> 2) very seemingly simple code, which breaks horribly if you try to add
> IPv6 support (such as the radix tree from University of Michigan / LLVM
> compiler benchmark suite, and the one from the old unmaintained mrt daemon,
> which includes a bizarre custom reference-counted memory manager that is
> very convoluted). These are easy to set up, but cause a lot of weird
> segfaults which I am having a difficult time to try to debug.
>
> So it seems like I am going nowhere with this approach. Instead, I'd like
> to know, what would I need to do to add this support to my local copy of
> librte_lpm? Let's assume for the sake of this discussion, that I don't care
> one iota about any performance cost, and I am happy if I need to prefetch
> two cachelines instead of just one (which I recall from a past thread is
> why librte_lpm has such a low nexthop limit to start with).
>
> Failing that, does anybody have a known good userspace version of any of
> these sort of items:
>
> 1) Hash Based FIB (forwarding information base),
> 2) Tree Based FIB,
> 3) Patricia trie (which does not break horribly on IPv6 or make bad
> assumptions about data format besides uint8_t* and length),
> 4) Crit-Bit tree
> 5) any other good way of taking IPv4 and IPv6 and finding the longest
> prefix match against a table of pre-loaded CIDR blocks?
>
> I am really pulling out my hair trying to find a way to do something which
> doesn't seem like it should have to be be this difficult. I must be missing
> a more obvious way to handle this.
>
> Thanks,
> Matthew
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-22 10:11 ` Vladimir Medvedkin
@ 2015-06-22 17:53 ` Matthew Hall
2015-06-23 3:51 ` Stephen Hemminger
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Hall @ 2015-06-22 17:53 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: <dev@dpdk.org>
That's a lot better indeed! 16 million CIDR blocks would be a huge improvement.
Do you happen to know what is also possible for rte_lpm6?
Matthew.
On Mon, Jun 22, 2015 at 01:11:14PM +0300, Vladimir Medvedkin wrote:
> Hi Matthew,
>
> I just recently thought about next_hop extension. For ipv4 we can do
> something like:
> struct rte_lpm_tbl24_entry {
> /* Stores Next hop or group index (i.e. gindex)into tbl8. */
> union {
> uint32_t next_hop :24;
> uint32_t tbl8_gindex :24;
> }__attribute__((__packed__));
> /* Using single uint8_t to store 3 values. */
> uint32_t valid :1; /**< Validation flag. */
> uint32_t ext_entry :1; /**< External entry. */
> uint32_t depth :6; /**< Rule depth. */
> };
> so we have 24 bit for next_hop.
>
> 2015-06-22 5:29 GMT+03:00 Matthew Hall <mhall@mhcomputing.net>:
>
> > Hello,
> >
> > I have gone out on the internet for days looking at a bunch of different
> > radix tree implementations to see if I could figure a way to implement my
> > own tree, just to work around the really low 255 CIDR block limitation in
> > librte_lpm. Unfortunately every single one I could find falls into one of
> > these two annoying categories:
> >
> > 1) bloated with a lot of irrelevant kernel code I don't care about
> > (especially the Linux version but also the BSD one, which also makes a
> > weird assumption every address object stores its length in byte 0 of the
> > address struct). These are hard to convert into something that plays nice
> > with raw packet data.
> >
> > 2) very seemingly simple code, which breaks horribly if you try to add
> > IPv6 support (such as the radix tree from University of Michigan / LLVM
> > compiler benchmark suite, and the one from the old unmaintained mrt daemon,
> > which includes a bizarre custom reference-counted memory manager that is
> > very convoluted). These are easy to set up, but cause a lot of weird
> > segfaults which I am having a difficult time to try to debug.
> >
> > So it seems like I am going nowhere with this approach. Instead, I'd like
> > to know, what would I need to do to add this support to my local copy of
> > librte_lpm? Let's assume for the sake of this discussion, that I don't care
> > one iota about any performance cost, and I am happy if I need to prefetch
> > two cachelines instead of just one (which I recall from a past thread is
> > why librte_lpm has such a low nexthop limit to start with).
> >
> > Failing that, does anybody have a known good userspace version of any of
> > these sort of items:
> >
> > 1) Hash Based FIB (forwarding information base),
> > 2) Tree Based FIB,
> > 3) Patricia trie (which does not break horribly on IPv6 or make bad
> > assumptions about data format besides uint8_t* and length),
> > 4) Crit-Bit tree
> > 5) any other good way of taking IPv4 and IPv6 and finding the longest
> > prefix match against a table of pre-loaded CIDR blocks?
> >
> > I am really pulling out my hair trying to find a way to do something which
> > doesn't seem like it should have to be be this difficult. I must be missing
> > a more obvious way to handle this.
> >
> > Thanks,
> > Matthew
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-22 17:53 ` Matthew Hall
@ 2015-06-23 3:51 ` Stephen Hemminger
2015-06-23 6:30 ` Matthew Hall
0 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2015-06-23 3:51 UTC (permalink / raw)
To: Matthew Hall; +Cc: <dev@dpdk.org>
In order to make Vyatta/Brocade router work with LPM code
I ended up redoing the layout. It is:
/** Tbl24 entry structure. */
struct rte_lpm_tbl24_entry {
/* Using single uint8_t to store 3 values. */
uint8_t valid :1; /**< Validation flag. */
uint8_t ext_entry :1; /**< external entry? */
uint8_t depth; /**< Rule depth. */
/* Stores Next hop or group index (i.e. gindex)into tbl8. */
union {
uint16_t next_hop;
uint16_t tbl8_gindex;
};
};
/** Tbl8 entry structure. */
struct rte_lpm_tbl8_entry {
uint16_t next_hop; /**< next hop. */
uint8_t depth; /**< Rule depth. */
uint8_t valid :1; /**< Validation flag. */
uint8_t valid_group :1; /**< Group validation flag. */
};
And also several other scalability improvements (plus IPv6)
and the correct handling of /32.
Unfortunately, this is such a big binary change that I was
reluctant to break any tests or applications using existing code
and therefore never submitted the patches.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-23 3:51 ` Stephen Hemminger
@ 2015-06-23 6:30 ` Matthew Hall
2015-06-23 7:19 ` Vladimir Medvedkin
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Hall @ 2015-06-23 6:30 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: <dev@dpdk.org>
On Mon, Jun 22, 2015 at 11:51:02PM -0400, Stephen Hemminger wrote:
> In order to make Vyatta/Brocade router work with LPM code
> I ended up redoing the layout. It is:
>
> And also several other scalability improvements (plus IPv6)
> and the correct handling of /32.
>
> Unfortunately, this is such a big binary change that I was
> reluctant to break any tests or applications using existing code
> and therefore never submitted the patches.
1. What you and Vladimir have done to this code kicks total ass and will be a
huge help so I am very excited to squeeze in some cycles somewhere to test all
of this stuff out ASAP.
2. Vladimir's changes were somewhat smaller, but Stephen yours are larger.
Stephen, if you could place them into a cloned copy of DPDK or a branch
somewhere for convenient pickup, I think I could help you make a lot of progress.
I could help test these fixes in a second app besides your own to get some
cross validation, and help make the required cleanups, so we could get a bit
more external validation before we try to negotiate a safe way to merge them
upstream to Bruce since he is marked as the LPM maintainer.
My DPDK fork is located here, for example, but it could really be anywhere you
like to put it which I could access. Or even a one-off zip or tarball with the
git repo inside and I could host it in my fork or give you access on the fork
to push it as a second remote if you are OK to do that...
https://github.com/megahall/dpdk_mhall
Matthew.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-23 6:30 ` Matthew Hall
@ 2015-06-23 7:19 ` Vladimir Medvedkin
2015-06-24 4:13 ` Matthew Hall
0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Medvedkin @ 2015-06-23 7:19 UTC (permalink / raw)
To: Matthew Hall; +Cc: <dev@dpdk.org>
Hi all,
Matthew, I think ipv6 lpm code need less changes
struct rte_lpm6_tbl_entry {
uint32_t next_hop: 21; /**< Next hop / next table to be
checked. */
uint32_t depth :8; /**< Rule depth. */
/* Flags. */
uint32_t valid :1; /**< Validation flag. */
uint32_t valid_group :1; /**< Group validation flag. */
uint32_t ext_entry :1; /**< External entry. */
};
there already is 21 bit for next_hop (need chenge only for rte_lpm6_rule)
In Stephen approach for next_hop given only 16 bits, this is enough for
next hop index, but not enough for AS number that originate prefix.
Regards,
Vladimir
2015-06-23 9:30 GMT+03:00 Matthew Hall <mhall@mhcomputing.net>:
> On Mon, Jun 22, 2015 at 11:51:02PM -0400, Stephen Hemminger wrote:
> > In order to make Vyatta/Brocade router work with LPM code
> > I ended up redoing the layout. It is:
> >
> > And also several other scalability improvements (plus IPv6)
> > and the correct handling of /32.
> >
> > Unfortunately, this is such a big binary change that I was
> > reluctant to break any tests or applications using existing code
> > and therefore never submitted the patches.
>
> 1. What you and Vladimir have done to this code kicks total ass and will
> be a
> huge help so I am very excited to squeeze in some cycles somewhere to test
> all
> of this stuff out ASAP.
>
> 2. Vladimir's changes were somewhat smaller, but Stephen yours are larger.
> Stephen, if you could place them into a cloned copy of DPDK or a branch
> somewhere for convenient pickup, I think I could help you make a lot of
> progress.
>
> I could help test these fixes in a second app besides your own to get some
> cross validation, and help make the required cleanups, so we could get a
> bit
> more external validation before we try to negotiate a safe way to merge
> them
> upstream to Bruce since he is marked as the LPM maintainer.
>
> My DPDK fork is located here, for example, but it could really be anywhere
> you
> like to put it which I could access. Or even a one-off zip or tarball with
> the
> git repo inside and I could host it in my fork or give you access on the
> fork
> to push it as a second remote if you are OK to do that...
>
> https://github.com/megahall/dpdk_mhall
>
> Matthew.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-23 7:19 ` Vladimir Medvedkin
@ 2015-06-24 4:13 ` Matthew Hall
2015-06-24 4:28 ` Matthew Hall
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Hall @ 2015-06-24 4:13 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: <dev@dpdk.org>
On Tue, Jun 23, 2015 at 10:19:58AM +0300, Vladimir Medvedkin wrote:
> Hi all,
>
> Matthew, I think ipv6 lpm code need less changes
> struct rte_lpm6_tbl_entry {
> uint32_t next_hop: 21; /**< Next hop / next table to be
> checked. */
> uint32_t depth :8; /**< Rule depth. */
>
> /* Flags. */
> uint32_t valid :1; /**< Validation flag. */
> uint32_t valid_group :1; /**< Group validation flag. */
> uint32_t ext_entry :1; /**< External entry. */
> };
> there already is 21 bit for next_hop (need chenge only for rte_lpm6_rule)
> In Stephen approach for next_hop given only 16 bits, this is enough for
> next hop index, but not enough for AS number that originate prefix.
>
> Regards,
> Vladimir
Vladimir,
One thing I was confused, you published the changes to rte_lpm_tbl24_entry but
you didn't say what you did to change rte_lpm_tbl8_entry, as that one only had
an 8-bit next_hop as well. I wanted to be sure I didn't change it wrong and
break something.
Hopefully Stephen can make his bug fixes available so I could add all of this
together and try to make a patchset for dpdk-next to test it all out. Would be
a huge win compared to all the crappy LPM code I found on the Internet.
Matthew.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-24 4:13 ` Matthew Hall
@ 2015-06-24 4:28 ` Matthew Hall
2015-06-24 5:15 ` Matthew Hall
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Hall @ 2015-06-24 4:28 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: <dev@dpdk.org>
On Tue, Jun 23, 2015 at 09:13:14PM -0700, Matthew Hall wrote:
> Vladimir,
>
> One thing I was confused, you published the changes to rte_lpm_tbl24_entry but
> you didn't say what you did to change rte_lpm_tbl8_entry, as that one only had
> an 8-bit next_hop as well. I wanted to be sure I didn't change it wrong and
> break something.
>
> Hopefully Stephen can make his bug fixes available so I could add all of this
> together and try to make a patchset for dpdk-next to test it all out. Would be
> a huge win compared to all the crappy LPM code I found on the Internet.
>
> Matthew.
Another thing, this part errors out now. Not sure if it's just a performance warning or if it will break the code when I just comment it out... ;)
RTE_BUILD_BUG_ON(sizeof(struct rte_lpm_tbl24_entry) != 2);
RTE_BUILD_BUG_ON(sizeof(struct rte_lpm_tbl8_entry) != 2);
/vagrant/external/dpdk/lib/librte_lpm/rte_lpm.c:162:2: error: array size is negative
RTE_BUILD_BUG_ON(sizeof(struct rte_lpm_tbl24_entry) != 2);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/vagrant/external/dpdk/build/include/rte_common.h:178:21: note: expanded from macro
'RTE_BUILD_BUG_ON'
((void)sizeof(char[1 - 2*!!(condition)])); \
^~~~~~~~~~~~~~~~~~~
/vagrant/external/dpdk/lib/librte_lpm/rte_lpm.c:163:2: error: array size is negative
RTE_BUILD_BUG_ON(sizeof(struct rte_lpm_tbl8_entry) != 2);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/vagrant/external/dpdk/build/include/rte_common.h:178:21: note: expanded from macro
'RTE_BUILD_BUG_ON'
((void)sizeof(char[1 - 2*!!(condition)])); \
^~~~~~~~~~~~~~~~~~~
Matthew.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-24 4:28 ` Matthew Hall
@ 2015-06-24 5:15 ` Matthew Hall
2015-06-24 7:04 ` Vladimir Medvedkin
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Hall @ 2015-06-24 5:15 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: <dev@dpdk.org>
OK, I went and made a whole ton of patches to LPM and the tests and examples,
now the selftest errors out... but I think maybe I don't have an adequate
amount of hugepages. How much hugepage memory did people have when they did the selftest successfully before?
I just keep seeing this over and over...
RTE>>lpm_autotest
Error at line 293:
ERROR: LPM Test tests[i]: FAIL
LPM: LPM memory allocation failed
Matthew.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-24 5:15 ` Matthew Hall
@ 2015-06-24 7:04 ` Vladimir Medvedkin
2015-06-24 17:56 ` Matthew Hall
0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Medvedkin @ 2015-06-24 7:04 UTC (permalink / raw)
To: Matthew Hall; +Cc: <dev@dpdk.org>
Hi Matthew,
I published changes to rte_lpm_tbl24_entry only because it was just an idea
:) So rte_lpm_tbl8_entry should look like:
struct rte_lpm_tbl8_entry {
uint32_t next_hop :24; /**< next hop. */
uint32_t valid :1; /**< Validation flag. */
uint32_t valid_group :1; /**< Group validation flag. */
uint32_t depth :6; /**< Rule depth. */
};
,and
struct rte_lpm_rule {
uint32_t ip; /**< Rule IP address. */
uint32_t next_hop; /**< Rule next hop. */
};
, and different defines and checks should be modified too.
2015-06-24 8:15 GMT+03:00 Matthew Hall <mhall@mhcomputing.net>:
> OK, I went and made a whole ton of patches to LPM and the tests and
> examples,
> now the selftest errors out... but I think maybe I don't have an adequate
> amount of hugepages. How much hugepage memory did people have when they
> did the selftest successfully before?
>
> I just keep seeing this over and over...
>
> RTE>>lpm_autotest
> Error at line 293:
> ERROR: LPM Test tests[i]: FAIL
> LPM: LPM memory allocation failed
>
> Matthew.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-24 7:04 ` Vladimir Medvedkin
@ 2015-06-24 17:56 ` Matthew Hall
2015-06-26 7:01 ` Matthew Hall
0 siblings, 1 reply; 12+ messages in thread
From: Matthew Hall @ 2015-06-24 17:56 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: <dev@dpdk.org>
On Wed, Jun 24, 2015 at 10:04:53AM +0300, Vladimir Medvedkin wrote:
> I published changes to rte_lpm_tbl24_entry only because it was just an idea
> :)
Understood. Just wanted to be sure I understood it right to convert it into C
code. :)
> different defines and checks should be modified too.
Yes, definitely... modified them last night and got something compileable. Now
just hoping somebody who knows more about librte_lpm will reply and explain
how to get the self tests to run so I can be sure that the changes don't just
compile but also actually run.
Matthew.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] rte_lpm with larger nexthops or another method?
2015-06-24 17:56 ` Matthew Hall
@ 2015-06-26 7:01 ` Matthew Hall
0 siblings, 0 replies; 12+ messages in thread
From: Matthew Hall @ 2015-06-26 7:01 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: <dev@dpdk.org>
Hi guys,
So I just finally fixed all the weird unit test failures for the basic-mode.
But the rte_lpm_lookupx4 will take quite a bit longer because it uses a lot of
real weird SSE intrinsic functions I never used before in my life.
However pretty soon I should get some kind of results how it will work. Sadly
it appears to be only 1/2 as fast as the old version in the benchmarks, but I
suspect this wouldn't matter in my specific app, we shall see...
Matthew.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-06-26 7:03 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-22 2:29 [dpdk-dev] rte_lpm with larger nexthops or another method? Matthew Hall
2015-06-22 10:11 ` Vladimir Medvedkin
2015-06-22 17:53 ` Matthew Hall
2015-06-23 3:51 ` Stephen Hemminger
2015-06-23 6:30 ` Matthew Hall
2015-06-23 7:19 ` Vladimir Medvedkin
2015-06-24 4:13 ` Matthew Hall
2015-06-24 4:28 ` Matthew Hall
2015-06-24 5:15 ` Matthew Hall
2015-06-24 7:04 ` Vladimir Medvedkin
2015-06-24 17:56 ` Matthew Hall
2015-06-26 7:01 ` Matthew Hall
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).