* rte_fib network order bug
@ 2024-11-12 9:31 Robin Jarry
2024-11-13 10:42 ` Medvedkin, Vladimir
0 siblings, 1 reply; 13+ messages in thread
From: Robin Jarry @ 2024-11-12 9:31 UTC (permalink / raw)
To: Vladimir Medvedkin, dev
Hi Vladimir,
I started playing with the new RTE_FIB_F_NETWORK_ORDER flag and I found
that it does not work at all.
rte_fib is based on rte_rib to perform all slow path lookups and
modifications. Unfortunately, rte_rib does not handle network order
addresses. This causes the added routes to be incorrectly masked and
thus inserted at the wrong place in the dir24 structure.
Here is the config I am using:
static struct rte_fib_conf fib_conf = {
.type = RTE_FIB_DIR24_8,
.default_nh = 0,
.max_routes = 65536,
.rib_ext_sz = 0,
.dir24_8 = {
.nh_sz = RTE_FIB_DIR24_8_8B,
.num_tbl8 = 1 << 15,
},
.flags = RTE_FIB_F_NETWORK_ORDER,
};
And here is a short gdb session demonstrating the issue:
rte_fib_add (fib=0x16ed79740, ip=67305985, depth=24 '\030', next_hop=6438576192) at ../subprojects/dpdk/lib/fib/rte_fib.c:126
126 if ((fib == NULL) || (fib->modify == NULL) ||
(gdb)
129 return fib->modify(fib, ip, depth, next_hop, RTE_FIB_ADD);
(gdb) p (char*)inet_ntoa(ip)
$9 = 0x7ffff72a471c "1.2.3.4"
(gdb) p depth
$10 = 24 '\030'
(gdb) p fib->flags
$11 = 1 (
(gdb) s
dir24_8_modify (fib=0x16ed79740, ip=67305985, depth=24 '\030', next_hop=6438576192, op=0)
at ../subprojects/dpdk/lib/fib/dir24_8.c:452
452 struct rte_rib_node *tmp = NULL;
(gdb) n
455 int ret = 0;
(gdb)
458 if ((fib == NULL) || (depth > RTE_FIB_MAXDEPTH))
(gdb)
461 dp = rte_fib_get_dp(fib);
(gdb)
462 rib = rte_fib_get_rib(fib);
(gdb)
465 if (next_hop > get_max_nh(dp->nh_sz))
(gdb)
468 ip &= rte_rib_depth_to_mask(depth);
(gdb) p (char*)inet_ntoa(ip)
$12 = 0x7ffff72a471c "1.2.3.4"
(gdb) p depth
$13 = 24 '\030'
(gdb) n
470 node = rte_rib_lookup_exact(rib, ip, depth);
(gdb) p (char*)inet_ntoa(ip)
$14 = 0x7ffff72a471c "0.2.3.4"
(gdb) p (char*)inet_ntoa(rte_rib_depth_to_mask(depth))
$15 = 0x7ffff72a471c "0.255.255.255"
As you can see, the generated mask is invalid. It should have been
255.255.255.0.
rte_rib would need to have similar network order support and should
inherit it from rte_fib.
Let me know if you need more information.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-12 9:31 rte_fib network order bug Robin Jarry
@ 2024-11-13 10:42 ` Medvedkin, Vladimir
2024-11-13 13:27 ` Robin Jarry
0 siblings, 1 reply; 13+ messages in thread
From: Medvedkin, Vladimir @ 2024-11-13 10:42 UTC (permalink / raw)
To: Robin Jarry, dev
Hi Robin,
It should not. Here is documentation says regarding this flag:
/** If set, fib lookup is expecting IPv4 address in network byte order */
#define RTE_FIB_F_NETWORK_ORDER 1
As stated above lookups will be performed while expecting addresses to
be in BE byte order. Control plane API expects IPv4 prefix address to be
in CPU byte order.
On 12/11/2024 09:31, Robin Jarry wrote:
> Hi Vladimir,
>
> I started playing with the new RTE_FIB_F_NETWORK_ORDER flag and I
> found that it does not work at all.
>
> rte_fib is based on rte_rib to perform all slow path lookups and
> modifications. Unfortunately, rte_rib does not handle network order
> addresses. This causes the added routes to be incorrectly masked and
> thus inserted at the wrong place in the dir24 structure.
>
> Here is the config I am using:
>
> static struct rte_fib_conf fib_conf = {
> .type = RTE_FIB_DIR24_8,
> .default_nh = 0,
> .max_routes = 65536,
> .rib_ext_sz = 0,
> .dir24_8 = {
> .nh_sz = RTE_FIB_DIR24_8_8B,
> .num_tbl8 = 1 << 15,
> },
> .flags = RTE_FIB_F_NETWORK_ORDER,
> };
>
> And here is a short gdb session demonstrating the issue:
>
> rte_fib_add (fib=0x16ed79740, ip=67305985, depth=24 '\030',
> next_hop=6438576192) at ../subprojects/dpdk/lib/fib/rte_fib.c:126
> 126 if ((fib == NULL) || (fib->modify == NULL) ||
> (gdb) 129 return fib->modify(fib, ip, depth, next_hop,
> RTE_FIB_ADD);
> (gdb) p (char*)inet_ntoa(ip)
> $9 = 0x7ffff72a471c "1.2.3.4"
> (gdb) p depth
> $10 = 24 '\030'
> (gdb) p fib->flags $11 = 1 (
> (gdb) s
> dir24_8_modify (fib=0x16ed79740, ip=67305985, depth=24 '\030',
> next_hop=6438576192, op=0)
> at ../subprojects/dpdk/lib/fib/dir24_8.c:452
> 452 struct rte_rib_node *tmp = NULL;
> (gdb) n
> 455 int ret = 0;
> (gdb) 458 if ((fib == NULL) || (depth > RTE_FIB_MAXDEPTH))
> (gdb) 461 dp = rte_fib_get_dp(fib);
> (gdb) 462 rib = rte_fib_get_rib(fib);
> (gdb) 465 if (next_hop > get_max_nh(dp->nh_sz))
> (gdb) 468 ip &= rte_rib_depth_to_mask(depth);
> (gdb) p (char*)inet_ntoa(ip)
> $12 = 0x7ffff72a471c "1.2.3.4"
> (gdb) p depth
> $13 = 24 '\030'
> (gdb) n
> 470 node = rte_rib_lookup_exact(rib, ip, depth);
> (gdb) p (char*)inet_ntoa(ip)
> $14 = 0x7ffff72a471c "0.2.3.4"
> (gdb) p (char*)inet_ntoa(rte_rib_depth_to_mask(depth))
> $15 = 0x7ffff72a471c "0.255.255.255"
>
> As you can see, the generated mask is invalid. It should have been
> 255.255.255.0.
>
> rte_rib would need to have similar network order support and should
> inherit it from rte_fib.
>
> Let me know if you need more information.
>
--
Regards,
Vladimir
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-13 10:42 ` Medvedkin, Vladimir
@ 2024-11-13 13:27 ` Robin Jarry
2024-11-13 19:39 ` Medvedkin, Vladimir
0 siblings, 1 reply; 13+ messages in thread
From: Robin Jarry @ 2024-11-13 13:27 UTC (permalink / raw)
To: Medvedkin, Vladimir, dev
Medvedkin, Vladimir, Nov 13, 2024 at 11:42:
> Hi Robin,
>
> It should not. Here is documentation says regarding this flag:
>
> /** If set, fib lookup is expecting IPv4 address in network byte order */
> #define RTE_FIB_F_NETWORK_ORDER 1
>
> As stated above lookups will be performed while expecting addresses to
> be in BE byte order. Control plane API expects IPv4 prefix address to be
> in CPU byte order.
I had not understood that it was *only* the lookups that were network
order.
The original reason why a RTE_FIB_F_NETWORK_ORDER flag was suggested
some time ago is that inet_pton() always returns network order
addresses. It makes it much more natural to keep everything in network
order instead of having to swap things around.
Now, only having the lookup functions requiring network order addresses
but all the other functions in the API requiring host order addresses is
even more confusing.
In my opinion, when RTE_FIB_F_NETWORK_ORDER is set, *all* functions of
the fib API should take network order addresses. That obviously mean
that rib should also have a similar flag.
Is that possible without a massive rework? If not, then I think we
should revert the patch that adds it or at least discourage its use in
its current state.
What do you think?
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: rte_fib network order bug
2024-11-13 13:27 ` Robin Jarry
@ 2024-11-13 19:39 ` Medvedkin, Vladimir
2024-11-14 7:43 ` Morten Brørup
0 siblings, 1 reply; 13+ messages in thread
From: Medvedkin, Vladimir @ 2024-11-13 19:39 UTC (permalink / raw)
To: Robin Jarry, dev
I think control plane API should work with prefix addresses in CPU byte order. At least our RTE_IPV4 macro works this way. Also, prefix is an address + prefix length (not the mask), so it is more natural if address is in cpu byte order.
Also, I think byte swap should be done on the interface where byte order changes, and this boundary lies outside the FIB library. However, I've added this feature not only because it was asked, but also trying to improve performance in some cases, such as using AVX512 byte swap in vector path for users who don't want to bother about manually do byteswap on the fast path.
Why do you think this would discourage users?
-----Original Message-----
From: Robin Jarry <rjarry@redhat.com>
Sent: Wednesday, November 13, 2024 1:27 PM
To: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>; dev@dpdk.org
Subject: Re: rte_fib network order bug
Medvedkin, Vladimir, Nov 13, 2024 at 11:42:
> Hi Robin,
>
> It should not. Here is documentation says regarding this flag:
>
> /** If set, fib lookup is expecting IPv4 address in network byte order
> */ #define RTE_FIB_F_NETWORK_ORDER 1
>
> As stated above lookups will be performed while expecting addresses to
> be in BE byte order. Control plane API expects IPv4 prefix address to
> be in CPU byte order.
I had not understood that it was *only* the lookups that were network order.
The original reason why a RTE_FIB_F_NETWORK_ORDER flag was suggested some time ago is that inet_pton() always returns network order addresses. It makes it much more natural to keep everything in network order instead of having to swap things around.
Now, only having the lookup functions requiring network order addresses but all the other functions in the API requiring host order addresses is even more confusing.
In my opinion, when RTE_FIB_F_NETWORK_ORDER is set, *all* functions of the fib API should take network order addresses. That obviously mean that rib should also have a similar flag.
Is that possible without a massive rework? If not, then I think we should revert the patch that adds it or at least discourage its use in its current state.
What do you think?
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: rte_fib network order bug
2024-11-13 19:39 ` Medvedkin, Vladimir
@ 2024-11-14 7:43 ` Morten Brørup
2024-11-14 10:18 ` Robin Jarry
0 siblings, 1 reply; 13+ messages in thread
From: Morten Brørup @ 2024-11-14 7:43 UTC (permalink / raw)
To: Medvedkin, Vladimir, Robin Jarry, dev
> -----Original Message-----
> From: Medvedkin, Vladimir [mailto:vladimir.medvedkin@intel.com]
> Sent: Wednesday, 13 November 2024 20.39
> To: Robin Jarry; dev@dpdk.org
> Subject: RE: rte_fib network order bug
>
> I think control plane API should work with prefix addresses in CPU byte
> order. At least our RTE_IPV4 macro works this way. Also, prefix is an
> address + prefix length (not the mask), so it is more natural if
> address is in cpu byte order.
> Also, I think byte swap should be done on the interface where byte
> order changes, and this boundary lies outside the FIB library. However,
> I've added this feature not only because it was asked, but also trying
> to improve performance in some cases, such as using AVX512 byte swap in
> vector path for users who don't want to bother about manually do
> byteswap on the fast path.
>
> Why do you think this would discourage users?
>
Joining the discussing with a couple of comments.
1. When I saw the byte order flag the first time, it was not clear to me either that it only affected lookups - I too thought it covered the entire API of the library. This needs to be emphasized in the description of the flag. And the flag's name should contain LOOKUP, e.g.:
/** If set, FIB lookup is expecting IPv4 address in network byte order. Note: Only lookup! */
#define RTE_FIB_F_LOOKUP_NETWORK_ORDER 1
2. Control plane API should use CPU byte order. I consider inet_pton() irrelevant in this context.
Adding network byte order lookup for fast path optimization makes good sense, and adding it to the RIB library would be nice too.
If it was an address table (not longest prefix table, but a hash or similar), the learn()/update() function could be fast path, and thus support network byte order too; but it's not, so add() is control plane.
>
> -----Original Message-----
> From: Robin Jarry <rjarry@redhat.com>
> Sent: Wednesday, November 13, 2024 1:27 PM
> To: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>; dev@dpdk.org
> Subject: Re: rte_fib network order bug
>
> Medvedkin, Vladimir, Nov 13, 2024 at 11:42:
> > Hi Robin,
> >
> > It should not. Here is documentation says regarding this flag:
> >
> > /** If set, fib lookup is expecting IPv4 address in network byte
> order
> > */ #define RTE_FIB_F_NETWORK_ORDER 1
> >
> > As stated above lookups will be performed while expecting addresses
> to
> > be in BE byte order. Control plane API expects IPv4 prefix address to
> > be in CPU byte order.
>
> I had not understood that it was *only* the lookups that were network
> order.
>
> The original reason why a RTE_FIB_F_NETWORK_ORDER flag was suggested
> some time ago is that inet_pton() always returns network order
> addresses. It makes it much more natural to keep everything in network
> order instead of having to swap things around.
>
> Now, only having the lookup functions requiring network order addresses
> but all the other functions in the API requiring host order addresses
> is even more confusing.
>
> In my opinion, when RTE_FIB_F_NETWORK_ORDER is set, *all* functions of
> the fib API should take network order addresses. That obviously mean
> that rib should also have a similar flag.
>
> Is that possible without a massive rework? If not, then I think we
> should revert the patch that adds it or at least discourage its use in
> its current state.
>
> What do you think?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-14 7:43 ` Morten Brørup
@ 2024-11-14 10:18 ` Robin Jarry
2024-11-14 14:35 ` Morten Brørup
0 siblings, 1 reply; 13+ messages in thread
From: Robin Jarry @ 2024-11-14 10:18 UTC (permalink / raw)
To: Morten Brørup, Medvedkin, Vladimir, dev
Hi folks,
Morten Brørup, Nov 14, 2024 at 08:43:
> Medvedkin, Vladimir:
>> I think control plane API should work with prefix addresses in CPU
>> byte order. At least our RTE_IPV4 macro works this way. Also, prefix
>> is an address + prefix length (not the mask), so it is more natural
>> if address is in cpu byte order.
This may get into a religion debate, but in my opinion, an IPv4 address
is *not* an integer. It should be treated as an opaque value. RTE_IPV4
is only useful to define addresses in unit tests.
I do not know of any IPv4 stack implementation that deals with
*host order* addresses. Here are a couple of examples where all
addresses are stored in network order in the control plane:
https://elixir.bootlin.com/linux/v6.11.6/source/net/ipv4/fib_frontend.c#L1069
https://github.com/freebsd/freebsd-src/blob/release/14.1.0/sys/net/route/route_ctl.c#L692
https://git.fd.io/vpp/tree/src/vnet/fib/fib_table.c?h=v24.10#n237
>>
>> Also, I think byte swap should be done on the interface where byte
>> order changes, and this boundary lies outside the FIB library.
>> However, I've added this feature not only because it was asked, but
>> also trying to improve performance in some cases, such as using
>> AVX512 byte swap in vector path for users who don't want to bother
>> about manually do byteswap on the fast path.
>>
>> Why do you think this would discourage users?
>
> Joining the discussing with a couple of comments.
>
> 1. When I saw the byte order flag the first time, it was not clear to
> me either that it only affected lookups - I too thought it covered
> the entire API of the library. This needs to be emphasized in the
> description of the flag. And the flag's name should contain LOOKUP,
> e.g.:
>
> /** If set, FIB lookup is expecting IPv4 address in network byte order. Note: Only lookup! */
> #define RTE_FIB_F_LOOKUP_NETWORK_ORDER 1
>
> 2. Control plane API should use CPU byte order. I consider inet_pton()
> irrelevant in this context. Adding network byte order lookup for
> fast path optimization makes good sense, and adding it to the RIB
> library would be nice too.
>
> If it was an address table (not longest prefix table, but a hash or
> similar), the learn()/update() function could be fast path, and
> thus support network byte order too; but it's not, so add() is
> control plane.
Why would control plane use a different representation of addresses
compared to data plane? Also for consistency with IPv6, I really think
that *all* addresses should be dealt in their network form.
Cheers.
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: rte_fib network order bug
2024-11-14 10:18 ` Robin Jarry
@ 2024-11-14 14:35 ` Morten Brørup
2024-11-15 13:01 ` Robin Jarry
0 siblings, 1 reply; 13+ messages in thread
From: Morten Brørup @ 2024-11-14 14:35 UTC (permalink / raw)
To: Robin Jarry, Medvedkin, Vladimir, dev
> From: Robin Jarry [mailto:rjarry@redhat.com]
> Sent: Thursday, 14 November 2024 11.19
>
> Hi folks,
>
> Morten Brørup, Nov 14, 2024 at 08:43:
> > Medvedkin, Vladimir:
> >> I think control plane API should work with prefix addresses in CPU
> >> byte order. At least our RTE_IPV4 macro works this way. Also, prefix
> >> is an address + prefix length (not the mask), so it is more natural
> >> if address is in cpu byte order.
>
> This may get into a religion debate, but in my opinion, an IPv4 address
> is *not* an integer. It should be treated as an opaque value.
In theory yes. But it is easier/faster performing range comparisons (First <= IP <= Last) when converting to host endian integer for IPv4.
E.g. our appliance allows configuring IP ranges, not only IP subnets.
Thinking more about it, it might just be an old habit, hard to shake off. Treating IPv4 addresses as byte arrays would allow simple memcmp() for range comparison, like IPv6 addresses.
> RTE_IPV4
> is only useful to define addresses in unit tests.
There are plenty of special IP addresses and subnets, where a shortcut macro makes the address easier readable in the code.
>
> I do not know of any IPv4 stack implementation that deals with
> *host order* addresses. Here are a couple of examples where all
> addresses are stored in network order in the control plane:
>
> https://elixir.bootlin.com/linux/v6.11.6/source/net/ipv4/fib_frontend.c
> #L1069
>
> https://github.com/freebsd/freebsd-
> src/blob/release/14.1.0/sys/net/route/route_ctl.c#L692
>
> https://git.fd.io/vpp/tree/src/vnet/fib/fib_table.c?h=v24.10#n237
>
> >>
> >> Also, I think byte swap should be done on the interface where byte
> >> order changes, and this boundary lies outside the FIB library.
> >> However, I've added this feature not only because it was asked, but
> >> also trying to improve performance in some cases, such as using
> >> AVX512 byte swap in vector path for users who don't want to bother
> >> about manually do byteswap on the fast path.
> >>
> >> Why do you think this would discourage users?
> >
> > Joining the discussing with a couple of comments.
> >
> > 1. When I saw the byte order flag the first time, it was not clear to
> > me either that it only affected lookups - I too thought it covered
> > the entire API of the library. This needs to be emphasized in the
> > description of the flag. And the flag's name should contain
> LOOKUP,
> > e.g.:
> >
> > /** If set, FIB lookup is expecting IPv4 address in network byte
> order. Note: Only lookup! */
> > #define RTE_FIB_F_LOOKUP_NETWORK_ORDER 1
> >
> > 2. Control plane API should use CPU byte order. I consider
> inet_pton()
> > irrelevant in this context. Adding network byte order lookup for
> > fast path optimization makes good sense, and adding it to the RIB
> > library would be nice too.
> >
> > If it was an address table (not longest prefix table, but a hash
> or
> > similar), the learn()/update() function could be fast path, and
> > thus support network byte order too; but it's not, so add() is
> > control plane.
>
> Why would control plane use a different representation of addresses
> compared to data plane?
Excellent question.
Old habit? Growing up using big endian CPUs, we have come to think of IPv4 addresses as 32 bit numbers, so we keep treating them as such. With this old way of thinking, the only reason to use network endian in the fast path with little endian CPUs is for performance reasons (to save the byte swap) - if not, we would still prefer using host endian in the fast path too.
> Also for consistency with IPv6, I really think
> that *all* addresses should be dealt in their network form.
Food for thought!
>
> Cheers.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-14 14:35 ` Morten Brørup
@ 2024-11-15 13:01 ` Robin Jarry
2024-11-15 13:52 ` Morten Brørup
0 siblings, 1 reply; 13+ messages in thread
From: Robin Jarry @ 2024-11-15 13:01 UTC (permalink / raw)
To: Morten Brørup, Medvedkin, Vladimir, dev
Morten Brørup, Nov 14, 2024 at 15:35:
>> RTE_IPV4 is only useful to define addresses in unit tests.
>
> There are plenty of special IP addresses and subnets, where a shortcut
> macro makes the address easier readable in the code.
OK, let me reformulate. I didn't mean to say that RTE_IPV4 is useless.
But it will always generate addresses in *host order*. Which means they
cannot be used in IPv4 headers without passing them through htonl().
This is weird in my opinion.
>> Why would control plane use a different representation of addresses
>> compared to data plane?
>
> Excellent question.
> Old habit? Growing up using big endian CPUs, we have come to think of
> IPv4 addresses as 32 bit numbers, so we keep treating them as such.
> With this old way of thinking, the only reason to use network endian
> in the fast path with little endian CPUs is for performance reasons
> (to save the byte swap) - if not, we would still prefer using host
> endian in the fast path too.
I understand the implementation reasons why you would prefer working
with host order integers. But the APIs that deal with IPv4 addresses
should not reflect implementation details.
>> Also for consistency with IPv6, I really think
>> that *all* addresses should be dealt in their network form.
>
> Food for thought!
Vladimir, could we at least consider adding a real network order mode
for the rib and fib libraries? So that we can have consistent APIs
between IPv4 and IPv6?
On that same topic, I wonder if it would make sense to change the API
parameters to use an opaque rte_ipv4_addr_t type instead of a native
uint32_t to avoid any confusion.
Thanks!
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: rte_fib network order bug
2024-11-15 13:01 ` Robin Jarry
@ 2024-11-15 13:52 ` Morten Brørup
2024-11-15 14:07 ` Bruce Richardson
2024-11-15 14:28 ` Robin Jarry
0 siblings, 2 replies; 13+ messages in thread
From: Morten Brørup @ 2024-11-15 13:52 UTC (permalink / raw)
To: Robin Jarry, Medvedkin, Vladimir, dev
> From: Robin Jarry [mailto:rjarry@redhat.com]
> Sent: Friday, 15 November 2024 14.02
>
> Morten Brørup, Nov 14, 2024 at 15:35:
> >> RTE_IPV4 is only useful to define addresses in unit tests.
> >
> > There are plenty of special IP addresses and subnets, where a
> shortcut
> > macro makes the address easier readable in the code.
>
> OK, let me reformulate. I didn't mean to say that RTE_IPV4 is useless.
> But it will always generate addresses in *host order*. Which means they
> cannot be used in IPv4 headers without passing them through htonl().
> This is weird in my opinion.
Robin, you've totally won me over on this endian discussion. :-)
Especially the IPv6 comparison make it clear why IPv4 should also be network byte order.
API/ABI stability is a pain... we're stuck with host endian IPv4 addresses; e.g. for the RTE_IPV4() macro, which I now agree produces the wrong endian value (on little endian CPUs).
>
> >> Why would control plane use a different representation of addresses
> >> compared to data plane?
> >
> > Excellent question.
> > Old habit? Growing up using big endian CPUs, we have come to think of
> > IPv4 addresses as 32 bit numbers, so we keep treating them as such.
> > With this old way of thinking, the only reason to use network endian
> > in the fast path with little endian CPUs is for performance reasons
> > (to save the byte swap) - if not, we would still prefer using host
> > endian in the fast path too.
>
> I understand the implementation reasons why you would prefer working
> with host order integers. But the APIs that deal with IPv4 addresses
> should not reflect implementation details.
They were probably designed based on the same way of thinking I was used to (until you convinced me I was wrong).
>
> >> Also for consistency with IPv6, I really think
> >> that *all* addresses should be dealt in their network form.
> >
> > Food for thought!
>
> Vladimir, could we at least consider adding a real network order mode
> for the rib and fib libraries? So that we can have consistent APIs
> between IPv4 and IPv6?
And/or rename RTE_FIB_F_NETWORK_ORDER to RTE_FIB_F_NETWORK_ORDER_LOOKUP or similar. This is important if real network order mode is added (now or later)!
>
> On that same topic, I wonder if it would make sense to change the API
> parameters to use an opaque rte_ipv4_addr_t type instead of a native
> uint32_t to avoid any confusion.
It could be considered an IPv4 address type (like the IPv6 address type) (which should be in network endian), which it is not, so I don't like this idea.
What the API really should offer is a choice (or a union) of uint32_t and rte_be32_t, but that's not possible, so also using uint32_t for big endian values seems like a viable compromise.
Another alternative, using void* for the IPv4 address array, seems overkill to me, since compilers don't warn about mixing uint32_t with rte_be32_t values (like mixing signed and unsigned emits warnings).
>
> Thanks!
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-15 13:52 ` Morten Brørup
@ 2024-11-15 14:07 ` Bruce Richardson
2024-11-15 14:28 ` Robin Jarry
1 sibling, 0 replies; 13+ messages in thread
From: Bruce Richardson @ 2024-11-15 14:07 UTC (permalink / raw)
To: Morten Brørup; +Cc: Robin Jarry, Medvedkin, Vladimir, dev
On Fri, Nov 15, 2024 at 02:52:57PM +0100, Morten Brørup wrote:
> > From: Robin Jarry [mailto:rjarry@redhat.com]
> > Sent: Friday, 15 November 2024 14.02
> >
> > Morten Brørup, Nov 14, 2024 at 15:35:
<snip>
> >
> > On that same topic, I wonder if it would make sense to change the API
> > parameters to use an opaque rte_ipv4_addr_t type instead of a native
> > uint32_t to avoid any confusion.
>
> It could be considered an IPv4 address type (like the IPv6 address type) (which should be in network endian), which it is not, so I don't like this idea.
Can you clarify your objection to this idea? For me, the idea of having
IPv4 addresses as a 4-byte array seems to offer a lot of advantages over
treating it as a single 32-bit value. We don't need to worry about packing
or alignment of the values, and everything would always be treated in network-byte
order. The main downside I see is compatibility - we'd need a whole new set
of definitions and functions in libs to make the change.
/Bruce
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-15 13:52 ` Morten Brørup
2024-11-15 14:07 ` Bruce Richardson
@ 2024-11-15 14:28 ` Robin Jarry
2024-11-15 16:20 ` Stephen Hemminger
1 sibling, 1 reply; 13+ messages in thread
From: Robin Jarry @ 2024-11-15 14:28 UTC (permalink / raw)
To: Morten Brørup, Medvedkin, Vladimir, dev
Morten Brørup, Nov 15, 2024 at 14:52:
> Robin, you've totally won me over on this endian discussion. :-)
> Especially the IPv6 comparison make it clear why IPv4 should also be
> network byte order.
>
> API/ABI stability is a pain... we're stuck with host endian IPv4
> addresses; e.g. for the RTE_IPV4() macro, which I now agree produces
> the wrong endian value (on little endian CPUs).
At least for 24.11 it is too late. But maybe we could make it right for
the next LTS?
>> Vladimir, could we at least consider adding a real network order mode
>> for the rib and fib libraries? So that we can have consistent APIs
>> between IPv4 and IPv6?
>
> And/or rename RTE_FIB_F_NETWORK_ORDER to
> RTE_FIB_F_NETWORK_ORDER_LOOKUP or similar. This is important if real
> network order mode is added (now or later)!
Maybe we could revert that patch and defer a complete change of the
rib/fib APIs to only expose network order addresses? It would be an ABI
breakage but if properly announced in advance, it should be possible.
Thinking about it some more. Having a flag for such a drastic change in
behaviour does not seem right.
>> On that same topic, I wonder if it would make sense to change the API
>> parameters to use an opaque rte_ipv4_addr_t type instead of a native
>> uint32_t to avoid any confusion.
>
> It could be considered an IPv4 address type (like the IPv6 address
> type) (which should be in network endian), which it is not, so I don't
> like this idea.
>
> What the API really should offer is a choice (or a union) of uint32_t
> and rte_be32_t, but that's not possible, so also using uint32_t for
> big endian values seems like a viable compromise.
>
> Another alternative, using void* for the IPv4 address array, seems
> overkill to me, since compilers don't warn about mixing uint32_t with
> rte_be32_t values (like mixing signed and unsigned emits warnings).
If what I proposed above is possible, then all these APIs could be using
rte_be32_t values (or even better, an rte_ipv4_addr_t alias for
consistency with IPv6). That would make everything much simpler.
Thoughts?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-15 14:28 ` Robin Jarry
@ 2024-11-15 16:20 ` Stephen Hemminger
2024-11-17 15:04 ` Vladimir Medvedkin
0 siblings, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2024-11-15 16:20 UTC (permalink / raw)
To: Robin Jarry; +Cc: Morten Brørup, Medvedkin, Vladimir, dev
On Fri, 15 Nov 2024 15:28:33 +0100
"Robin Jarry" <rjarry@redhat.com> wrote:
> Morten Brørup, Nov 15, 2024 at 14:52:
> > Robin, you've totally won me over on this endian discussion. :-)
> > Especially the IPv6 comparison make it clear why IPv4 should also be
> > network byte order.
> >
> > API/ABI stability is a pain... we're stuck with host endian IPv4
> > addresses; e.g. for the RTE_IPV4() macro, which I now agree produces
> > the wrong endian value (on little endian CPUs).
>
> At least for 24.11 it is too late. But maybe we could make it right for
> the next LTS?
>
> >> Vladimir, could we at least consider adding a real network order mode
> >> for the rib and fib libraries? So that we can have consistent APIs
> >> between IPv4 and IPv6?
> >
> > And/or rename RTE_FIB_F_NETWORK_ORDER to
> > RTE_FIB_F_NETWORK_ORDER_LOOKUP or similar. This is important if real
> > network order mode is added (now or later)!
>
> Maybe we could revert that patch and defer a complete change of the
> rib/fib APIs to only expose network order addresses? It would be an ABI
> breakage but if properly announced in advance, it should be possible.
>
> Thinking about it some more. Having a flag for such a drastic change in
> behaviour does not seem right.
It was a mistake for DPDK to define its own data structures for IP addresses.
Would have been much better to stick with the legacy what BSD, Linux (and Windows)
uses in API. 'struct in_addr' and 'struct in6_addr'
Reinvention did not help users.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: rte_fib network order bug
2024-11-15 16:20 ` Stephen Hemminger
@ 2024-11-17 15:04 ` Vladimir Medvedkin
0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Medvedkin @ 2024-11-17 15:04 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Robin Jarry, Morten Brørup, Medvedkin, Vladimir, dev
[-- Attachment #1: Type: text/plain, Size: 4983 bytes --]
Hi all,
[Robin] > I had not understood that it was *only* the lookups that were
network order
[Morten] >When I saw the byte order flag the first time, it was not clear
to me either that it only affected lookups - I too thought it covered the
entire API of the library. This needs to be emphasized in the description
of the flag. And the flag's name should contain LOOKUP
[Morten] > And/or rename RTE_FIB_F_NETWORK_ORDER to
RTE_FIB_F_NETWORK_ORDER_LOOKUP or similar.
There is a clear comment for this flag that it has effects on lookup.
Repeating the statement with an exclamation mark seems too much. Moreover,
at first this flag was named "RTE_FIB_FLAG_LOOKUP_BE" and it was suggested
for renaming here:
https://inbox.dpdk.org/dev/D4SWPKOPRD5Z.87YIET3Y4AW@redhat.com/
[Morten] >Control plane API should use CPU byte order ... adding it
(support for network byte order) to the RIB library would be nice too.
I'm not sure if I understood you correctly here, RIB is a control plane
library.
[Robin] > an IPv4 address is *not* an integer. It should be treated as an
opaque value.
I don't agree here. IPv4 is 32 bits of information. CPUs usually can treat
32 bits of information as an integer, which is really useful.
[Morten] > Treating IPv4 addresses as byte arrays would allow simple
memcmp() for range comparison
How is it possible for a general case? For example, I need to test IP
addresses against range 1.1.1.7 - 10.20.30.37.
[Robin] >Also for consistency with IPv6, I really think that *all*
addresses should be dealt in their network form.
There is no such a problem as byte order mismatch for IPv6 addresses since
they can not be treated by modern CPUs as an native integer type.
[Robin] >But it (RTE_IPV4) will always generate addresses in *host order*.
Which means they cannot be used in IPv4 headers without passing them
through htonl().
RTE_IPV4 is not limited by setting IPv4 headers values.
[Robin] >Maybe we could revert that patch and defer a complete change of
the rib/fib APIs to only expose network order addresses?
I don't agree with that. Don't limit yourself to just manipulating network
headers.
[Robin] >Thinking about it some more. Having a flag for such a drastic
change in behaviour does not seem right.
This flag is optional. I don't see any problems with that.
In general, here we just have different perspectives on the problem. I can
see and understand your point.
My considerations are:
- The vast majority of the longest prefix match algorithms works with
addresses in host byte order (binary trees, multibit tries, DXR, except
only hash based lookup)
- If you do byteswap two or more times - If you run byteswap two or more
times, you are probably doing something wrong in terms of computations
So, feel free to submit patches adding this feature to the control plane
API, but let's consider:
- default behaviour should remain the same. Why? At least because for my
usecases I'd like to have "data representation" (byte swap) outside of the
library. Not to mention ABI/API breakage
- IPv4 should stay as uint32_t. C doesn't know such a thing as byte order,
it knows about size and signedness. rte_be32_t is just a hint for us -
humans :)
пт, 15 нояб. 2024 г. в 17:00, Stephen Hemminger <stephen@networkplumber.org
>:
> On Fri, 15 Nov 2024 15:28:33 +0100
> "Robin Jarry" <rjarry@redhat.com> wrote:
>
> > Morten Brørup, Nov 15, 2024 at 14:52:
> > > Robin, you've totally won me over on this endian discussion. :-)
> > > Especially the IPv6 comparison make it clear why IPv4 should also be
> > > network byte order.
> > >
> > > API/ABI stability is a pain... we're stuck with host endian IPv4
> > > addresses; e.g. for the RTE_IPV4() macro, which I now agree produces
> > > the wrong endian value (on little endian CPUs).
> >
> > At least for 24.11 it is too late. But maybe we could make it right for
> > the next LTS?
> >
> > >> Vladimir, could we at least consider adding a real network order mode
> > >> for the rib and fib libraries? So that we can have consistent APIs
> > >> between IPv4 and IPv6?
> > >
> > > And/or rename RTE_FIB_F_NETWORK_ORDER to
> > > RTE_FIB_F_NETWORK_ORDER_LOOKUP or similar. This is important if real
> > > network order mode is added (now or later)!
> >
> > Maybe we could revert that patch and defer a complete change of the
> > rib/fib APIs to only expose network order addresses? It would be an ABI
> > breakage but if properly announced in advance, it should be possible.
> >
> > Thinking about it some more. Having a flag for such a drastic change in
> > behaviour does not seem right.
>
> It was a mistake for DPDK to define its own data structures for IP
> addresses.
> Would have been much better to stick with the legacy what BSD, Linux (and
> Windows)
> uses in API. 'struct in_addr' and 'struct in6_addr'
>
> Reinvention did not help users.
>
--
Regards,
Vladimir
[-- Attachment #2: Type: text/html, Size: 11432 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-11-17 15:04 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-12 9:31 rte_fib network order bug Robin Jarry
2024-11-13 10:42 ` Medvedkin, Vladimir
2024-11-13 13:27 ` Robin Jarry
2024-11-13 19:39 ` Medvedkin, Vladimir
2024-11-14 7:43 ` Morten Brørup
2024-11-14 10:18 ` Robin Jarry
2024-11-14 14:35 ` Morten Brørup
2024-11-15 13:01 ` Robin Jarry
2024-11-15 13:52 ` Morten Brørup
2024-11-15 14:07 ` Bruce Richardson
2024-11-15 14:28 ` Robin Jarry
2024-11-15 16:20 ` Stephen Hemminger
2024-11-17 15:04 ` Vladimir Medvedkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).