From: Bruce Richardson <bruce.richardson@intel.com>
To: Vladimir Medvedkin <medvedkinv@gmail.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2 1/2] Add RIB library
Date: Mon, 26 Mar 2018 10:50:42 +0100 [thread overview]
Message-ID: <20180326095042.GA17660@bricha3-MOBL.ger.corp.intel.com> (raw)
In-Reply-To: <CANDrEHmGxP8G__WGSw0NVqFdHv3C8iv1Vt0H7n_86F8SpTMfZg@mail.gmail.com>
On Sun, Mar 25, 2018 at 09:17:20PM +0300, Vladimir Medvedkin wrote:
> Hi,
>
> 2018-03-14 14:09 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com>:
>
> > On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > > RIB is an alternative to current LPM library.
> > > It solves the following problems
> > > - Increases the speed of control plane operations against lpm such as
> > > adding/deleting routes
> > > - Adds abstraction from dataplane algorithms, so it is possible to add
> > > different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
> > > in addition to current dir24_8
> > > - It is possible to keep user defined application specific additional
> > > information in struct rte_rib_node which represents route entry.
> > > It can be next hop/set of next hops (i.e. active and feasible),
> > > pointers to link rte_rib_node based on some criteria (i.e. next_hop),
> > > plenty of additional control plane information.
> > > - For dir24_8 implementation it is possible to remove
> > rte_lpm_tbl_entry.depth
> > > field that helps to save 6 bits.
> > > - Also new dir24_8 implementation supports different next_hop sizes
> > > (1/2/4/8 bytes per next hop)
> > > - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary
> > operator.
> > > Instead it returns special default value if there is no route.
> > >
> > > Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> > > ---
> > > config/common_base | 6 +
> > > doc/api/doxy-api.conf | 1 +
> > > lib/Makefile | 2 +
> > > lib/librte_rib/Makefile | 22 ++
> > > lib/librte_rib/rte_dir24_8.c | 482 ++++++++++++++++++++++++++++++
> > +++
> > > lib/librte_rib/rte_dir24_8.h | 116 ++++++++
> > > lib/librte_rib/rte_rib.c | 526 ++++++++++++++++++++++++++++++
> > +++++++
> > > lib/librte_rib/rte_rib.h | 322 +++++++++++++++++++++++
> > > lib/librte_rib/rte_rib_version.map | 18 ++
> > > mk/rte.app.mk | 1 +
> > > 10 files changed, 1496 insertions(+)
> > > create mode 100644 lib/librte_rib/Makefile
> > > create mode 100644 lib/librte_rib/rte_dir24_8.c
> > > create mode 100644 lib/librte_rib/rte_dir24_8.h
> > > create mode 100644 lib/librte_rib/rte_rib.c
> > > create mode 100644 lib/librte_rib/rte_rib.h
> > > create mode 100644 lib/librte_rib/rte_rib_version.map
> > >
> >
> > First pass review comments. For now just reviewed the main public header
> > file rte_rib.h. Later reviews will cover the other files as best I can.
> >
> > /Bruce
> >
> > <snip>
> > > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> > > new file mode 100644
> > > index 0000000..6eac8fb
> > > --- /dev/null
> > > +++ b/lib/librte_rib/rte_rib.h
> > > @@ -0,0 +1,322 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> > > + */
> > > +
> > > +#ifndef _RTE_RIB_H_
> > > +#define _RTE_RIB_H_
> > > +
> > > +/**
> > > + * @file
> > > + * Compressed trie implementation for Longest Prefix Match
> > > + */
> > > +
> > > +/** @internal Macro to enable/disable run-time checks. */
> > > +#if defined(RTE_LIBRTE_RIB_DEBUG)
> > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do { \
> > > + if (cond) \
> > > + return retval; \
> > > +} while (0)
> > > +#else
> > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> > > +#endif
> >
> > use RTE_ASSERT?
> >
> it was done just like it was done in the LPM lib. But if you think it
> should be RTE_ASSERT so be it.
>
>
> >
> > > +
> > > +#define RTE_RIB_VALID_NODE 1
> >
> > should there be an INVALID_NODE macro?
> >
> No
>
>
> >
> > > +#define RTE_RIB_GET_NXT_ALL 0
> > > +#define RTE_RIB_GET_NXT_COVER 1
> > > +
> > > +#define RTE_RIB_INVALID_ROUTE 0
> > > +#define RTE_RIB_VALID_ROUTE 1
> > > +
> > > +/** Max number of characters in RIB name. */
> > > +#define RTE_RIB_NAMESIZE 64
> > > +
> > > +/** Maximum depth value possible for IPv4 RIB. */
> > > +#define RTE_RIB_MAXDEPTH 32
> >
> > I think we should have IPv4 in the name here. Will it not be extended to
> > support IPv6 in future?
> >
> I think there should be a separate implementation of the library for ipv6
>
I can understand the need for a separate LPM implementation, but should
they both not be under the same rib library?
>
> >
> > > +
> > > +/**
> > > + * Macro to check if prefix1 {key1/depth1}
> > > + * is covered by prefix2 {key2/depth2}
> > > + */
> > > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)
> > \
> > > + ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\
> > > + && (depth1 > depth2))
> > Neat check!
> >
> > Any particular reason for using UINT64_MAX here rather than UINT32_MAX?
>
> in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain
> UINT32_MAX because shift count will be masked to 5 bits.
>
> I think you can avoid the casting and have a slightly shorter mask by
> > changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to
> > "~(UINT32_MAX >> depth2)"
> > I'd also suggest for readability putting the second check first, and,
> > for maintainability, using an inline function rather than a macro.
> >
> Agree, it looks clearer
>
>
> > > +
> > > +/** @internal Macro to get next node in tree*/
> > > +#define RTE_RIB_GET_NXT_NODE(node, key)
> > \
> > > + ((key & (1 << (31 - node->depth))) ? node->right : node->left)
> > > +/** @internal Macro to check if node is right child*/
> > > +#define RTE_RIB_IS_RIGHT_NODE(node) (node->parent->right == node)
> >
> > Again, consider inline fns rather than macros.
> >
> Ok
>
> For the latter macro, rather than doing additional pointer derefs to
> > parent, can you also get if it's a right node by using:
> > "(node->key & (1 << (32 - node->depth)))"?
> >
> No, it is not possible. Decision whether node be left or right is made
> using parent and child common depth.
> Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common depth
> will be /8 and 10.128.0.0/24 will be right child.
>
>
> > > +
> > > +
> > > +struct rte_rib_node {
> > > + struct rte_rib_node *left;
> > > + struct rte_rib_node *right;
> > > + struct rte_rib_node *parent;
> > > + uint32_t key;
> > > + uint8_t depth;
> > > + uint8_t flag;
> > > + uint64_t nh;
> > > + uint64_t ext[0];
> > > +};
> > > +
> > > +struct rte_rib;
> > > +
> > > +/** Type of FIB struct*/
> > > +enum rte_rib_type {
> > > + RTE_RIB_DIR24_8_1B,
> > > + RTE_RIB_DIR24_8_2B,
> > > + RTE_RIB_DIR24_8_4B,
> > > + RTE_RIB_DIR24_8_8B,
> > > + RTE_RIB_TYPE_MAX
> > > +};
> >
> > If the plan is to support multiple underlying fib types and algorithms
> > under the rib library, would it not be better to separate out the
> > algorithm part from the data storage part? So have the type just be
> > DIR_24_8, and have the 1, 2, 4 or 8 specified separately.
> >
> Yes, we were talk about it in IRC, agree. Now I pass next hop size in
> union rte_rib_fib_conf inside rte_rib_conf
>
>
> >
> > > +
> > > +enum rte_rib_op {
> > > + RTE_RIB_ADD,
> > > + RTE_RIB_DEL
> > > +};
> > > +
> > > +/** RIB nodes allocation type */
> > > +enum rte_rib_alloc_type {
> > > + RTE_RIB_MALLOC,
> > > + RTE_RIB_MEMPOOL,
> > > + RTE_RIB_ALLOC_MAX
> > > +};
> >
> > Not sure you need this any more. Malloc allocations and mempool
> > allocations are now pretty much the same thing.
> >
> Actually I think to remove malloc. On performance tests with
> adding/deleting huge amount of routes malloc is slower. Maybe because of
> fragmentation.
> What do you think?
>
Yes, definitely mempool allocations are the way to go!
>
> > > +
> > > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key,
> > > + uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
> >
> > Do you anticipate more ops in future than just add and delete? If not,
> > why not just split this function into two and drop the op struct.
> >
> It is difficult question. I'm not ready to make decision at the moment.
>
>
> >
> > > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips,
> > > + uint64_t *next_hops, const unsigned n);
> > > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib
> > *rib);
> > > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> > > + struct rte_rib_node *node);
> > > +
> > > +struct rte_rib {
> > > + char name[RTE_RIB_NAMESIZE];
> > > + /*pointer to rib trie*/
> > > + struct rte_rib_node *trie;
> > > + /*pointer to dataplane struct*/
> > > + void *fib;
> > > + /*prefix modification*/
> > > + rte_rib_modify_fn_t modify;
> > > + /* Bulk lookup fn*/
> > > + rte_rib_tree_lookup_fn_t lookup;
> > > + /*alloc trie element*/
> > > + rte_rib_alloc_node_fn_t alloc_node;
> > > + /*free trie element*/
> > > + rte_rib_free_node_fn_t free_node;
> > > + struct rte_mempool *node_pool;
> > > + uint32_t cur_nodes;
> > > + uint32_t cur_routes;
> > > + int max_nodes;
> > > + int node_sz;
> > > + enum rte_rib_type type;
> > > + enum rte_rib_alloc_type alloc_type;
> > > +};
> > > +
> > > +/** RIB configuration structure */
> > > +struct rte_rib_conf {
> > > + enum rte_rib_type type;
> > > + enum rte_rib_alloc_type alloc_type;
> > > + int max_nodes;
> > > + size_t node_sz;
> > > + uint64_t def_nh;
> > > +};
> > > +
> > > +/**
> > > + * Lookup an IP into the RIB structure
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @param key
> > > + * IP to be looked up in the RIB
> > > + * @return
> > > + * pointer to struct rte_rib_node on success,
> > > + * NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> > > +
> > > +/**
> > > + * Lookup less specific route into the RIB structure
> > > + *
> > > + * @param ent
> > > + * Pointer to struct rte_rib_node that represents target route
> > > + * @return
> > > + * pointer to struct rte_rib_node that represents
> > > + * less specific route on success,
> > > + * NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> > > +
> > > +/**
> > > + * Lookup prefix into the RIB structure
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @param key
> > > + * net to be looked up in the RIB
> > > + * @param depth
> > > + * prefix length
> > > + * @return
> > > + * pointer to struct rte_rib_node on success,
> > > + * NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t
> > depth);
> >
> > Can you explain the difference between this and regular lookup, and how
> > they would be used. I don't think the names convey the differences
> > sufficiently, and so we should look to rename one or both to be clearer.
> >
> Regular lookup (rte_rib_tree_lookup) will lookup for most specific node for
> passed key.
> rte_rib_tree_lookup_exact will lookup node contained key and depth equal to
> passed in args. It used to find exact route.
>
So if there is no node exactly matching the parameters, it the lookup_exact
returns failure? E.g. if you request a /24 node, it won't return a /8 node
that would cover the /24?
>
> >
> > > +
> > > +/**
> > > + * Retrieve next more specific prefix from the RIB
> > s/more/most/
> >
>
> > > + * that is covered by key/depth supernet
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @param key
> > > + * net address of supernet prefix that covers returned more specific
> > prefixes
> > > + * @param depth
> > > + * supernet prefix length
> > > + * @param cur
> > > + * pointer to the last returned prefix to get next prefix
> > > + * or
> > > + * NULL to get first more specific prefix
> > > + * @param flag
> > > + * -RTE_RIB_GET_NXT_ALL
> > > + * get all prefixes from subtrie
> >
> > By all prefixes do you mean more specific, i.e. the final prefix?
> >
> What do you mean the final prefix?
>
The most specific one, or the longest prefix.
>
> > > + * -RTE_RIB_GET_NXT_COVER
> > > + * get only first more specific prefix even if it have more specifics
> > > + * @return
> > > + * pointer to the next more specific prefix
> > > + * or
> > > + * NULL if there is no prefixes left
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth,
> > > + struct rte_rib_node *cur, int flag);
> > > +
> > > +/**
> > > + * Remove prefix from the RIB
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @param key
> > > + * net to be removed from the RIB
> > > + * @param depth
> > > + * prefix length
> > > + */
> > > +void
> > > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > > +
> > > +/**
> > > + * Insert prefix into the RIB
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @param key
> > > + * net to be inserted to the RIB
> > > + * @param depth
> > > + * prefix length
> > > + * @return
> > > + * pointer to new rte_rib_node on success
> > > + * NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > > +
> > > +/**
> > > + * Create RIB
> > > + *
> > > + * @param name
> > > + * RIB name
> > > + * @param socket_id
> > > + * NUMA socket ID for RIB table memory allocation
> > > + * @param conf
> > > + * Structure containing the configuration
> > > + * @return
> > > + * Handle to RIB object on success
> > > + * NULL otherwise with rte_errno set to an appropriate values.
> > > + */
> > > +struct rte_rib *
> > > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf
> > *conf);
> > > +
> > > +/**
> > > + * Find an existing RIB object and return a pointer to it.
> > > + *
> > > + * @param name
> > > + * Name of the rib object as passed to rte_rib_create()
> > > + * @return
> > > + * Pointer to rib object or NULL if object not found with rte_errno
> > > + * set appropriately. Possible rte_errno values include:
> > > + * - ENOENT - required entry not available to return.
> > > + */
> > > +struct rte_rib *
> > > +rte_rib_find_existing(const char *name);
> > > +
> > > +/**
> > > + * Free an RIB object.
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @return
> > > + * None
> > > + */
> > > +void
> > > +rte_rib_free(struct rte_rib *rib);
> > > +
> > > +/**
> > > + * Add a rule to the RIB.
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @param ip
> > > + * IP of the rule to be added to the RIB
> > > + * @param depth
> > > + * Depth of the rule to be added to the RIB
> > > + * @param next_hop
> > > + * Next hop of the rule to be added to the RIB
> > > + * @return
> > > + * 0 on success, negative value otherwise
> > > + */
> > > +int
> > > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t
> > next_hop);
> > > +
> > > +/**
> > > + * Delete a rule from the RIB.
> > > + *
> > > + * @param rib
> > > + * RIB object handle
> > > + * @param ip
> > > + * IP of the rule to be deleted from the RIB
> > > + * @param depth
> > > + * Depth of the rule to be deleted from the RIB
> > > + * @return
> > > + * 0 on success, negative value otherwise
> > > + */
> > > +int
> > > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> > > +
> > > +/**
> > > + * Lookup multiple IP addresses in an FIB. This may be implemented as a
> > > + * macro, so the address of the function should not be used.
> > > + *
> > > + * @param RIB
> > > + * RIB object handle
> > > + * @param ips
> > > + * Array of IPs to be looked up in the FIB
> > > + * @param next_hops
> > > + * Next hop of the most specific rule found for IP.
> > > + * This is an array of eight byte values.
> > > + * If the lookup for the given IP failed, then corresponding element
> > would
> > > + * contain default value, see description of then next parameter.
> > > + * @param n
> > > + * Number of elements in ips (and next_hops) array to lookup. This
> > should be a
> > > + * compile time constant, and divisible by 8 for best performance.
> > > + * @param defv
> > > + * Default value to populate into corresponding element of hop[]
> > array,
> > > + * if lookup would fail.
> > > + * @return
> > > + * -EINVAL for incorrect arguments, otherwise 0
> > > + */
> > > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n) \
> > > + rib->lookup(rib->fib, ips, next_hops, n)
> >
> > My main thought here is whether this needs to be a function at all?
> > Given that it takes a full burst of addresses in a single go, how much
> > performance would actually be lost by making this a regular function in
> > the C file?
> > IF we do convert this to a regular function, then a lot of the structure
> > definitions above - most importantly, the rib structure itself - can
> > probably be moved to a private header file and not exposed to
> > applications at all. This will make ABI compatibility a *lot* easier, as
> > the structures can be changed without affecting the public ABI.
> >
> I didn't quite understand what you mean.
>
Sorry, by "needs to be a function" in first line read "needs to be a
macro". Basically, the point is to not inline anything that doesn't need
it. If a function works on a burst of packets, it probably will be fine
being a regular function than a macro or inline function.
/Bruce
next prev parent reply other threads:[~2018-03-26 9:50 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-21 21:44 [dpdk-dev] [PATCH v2 0/2] lib/rib: Add Routing Information Base library Medvedkin Vladimir
2018-02-21 21:44 ` [dpdk-dev] [PATCH v2 1/2] Add RIB library Medvedkin Vladimir
2018-03-14 11:09 ` Bruce Richardson
2018-03-14 12:05 ` Richardson, Bruce
2018-03-25 18:17 ` Vladimir Medvedkin
2018-03-26 9:50 ` Bruce Richardson [this message]
2018-03-29 19:59 ` Vladimir Medvedkin
2018-03-29 10:27 ` Bruce Richardson
2018-03-29 20:11 ` Vladimir Medvedkin
2018-03-29 20:41 ` Bruce Richardson
2018-02-21 21:44 ` [dpdk-dev] [PATCH v2 2/2] Add autotests for " Medvedkin Vladimir
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180326095042.GA17660@bricha3-MOBL.ger.corp.intel.com \
--to=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=medvedkinv@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).