From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 2542B2C58 for ; Mon, 26 Mar 2018 11:50:47 +0200 (CEST) X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Mar 2018 02:50:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,364,1517904000"; d="scan'208";a="214691845" Received: from bricha3-mobl.ger.corp.intel.com ([10.237.221.52]) by fmsmga006.fm.intel.com with SMTP; 26 Mar 2018 02:50:43 -0700 Received: by (sSMTP sendmail emulation); Mon, 26 Mar 2018 10:50:42 +0100 Date: Mon, 26 Mar 2018 10:50:42 +0100 From: Bruce Richardson To: Vladimir Medvedkin Cc: dev@dpdk.org Message-ID: <20180326095042.GA17660@bricha3-MOBL.ger.corp.intel.com> References: <1519249495-16594-1-git-send-email-medvedkinv@gmail.com> <1519249495-16594-2-git-send-email-medvedkinv@gmail.com> <20180314110918.GA11720@bricha3-MOBL.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Organization: Intel Research and Development Ireland Ltd. User-Agent: Mutt/1.9.4 (2018-02-28) Subject: Re: [dpdk-dev] [PATCH v2 1/2] Add RIB library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Mar 2018 09:50:48 -0000 On Sun, Mar 25, 2018 at 09:17:20PM +0300, Vladimir Medvedkin wrote: > Hi, > > 2018-03-14 14:09 GMT+03:00 Bruce Richardson : > > > On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote: > > > RIB is an alternative to current LPM library. > > > It solves the following problems > > > - Increases the speed of control plane operations against lpm such as > > > adding/deleting routes > > > - Adds abstraction from dataplane algorithms, so it is possible to add > > > different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc > > > in addition to current dir24_8 > > > - It is possible to keep user defined application specific additional > > > information in struct rte_rib_node which represents route entry. > > > It can be next hop/set of next hops (i.e. active and feasible), > > > pointers to link rte_rib_node based on some criteria (i.e. next_hop), > > > plenty of additional control plane information. > > > - For dir24_8 implementation it is possible to remove > > rte_lpm_tbl_entry.depth > > > field that helps to save 6 bits. > > > - Also new dir24_8 implementation supports different next_hop sizes > > > (1/2/4/8 bytes per next hop) > > > - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary > > operator. > > > Instead it returns special default value if there is no route. > > > > > > Signed-off-by: Medvedkin Vladimir > > > --- > > > config/common_base | 6 + > > > doc/api/doxy-api.conf | 1 + > > > lib/Makefile | 2 + > > > lib/librte_rib/Makefile | 22 ++ > > > lib/librte_rib/rte_dir24_8.c | 482 ++++++++++++++++++++++++++++++ > > +++ > > > lib/librte_rib/rte_dir24_8.h | 116 ++++++++ > > > lib/librte_rib/rte_rib.c | 526 ++++++++++++++++++++++++++++++ > > +++++++ > > > lib/librte_rib/rte_rib.h | 322 +++++++++++++++++++++++ > > > lib/librte_rib/rte_rib_version.map | 18 ++ > > > mk/rte.app.mk | 1 + > > > 10 files changed, 1496 insertions(+) > > > create mode 100644 lib/librte_rib/Makefile > > > create mode 100644 lib/librte_rib/rte_dir24_8.c > > > create mode 100644 lib/librte_rib/rte_dir24_8.h > > > create mode 100644 lib/librte_rib/rte_rib.c > > > create mode 100644 lib/librte_rib/rte_rib.h > > > create mode 100644 lib/librte_rib/rte_rib_version.map > > > > > > > First pass review comments. For now just reviewed the main public header > > file rte_rib.h. Later reviews will cover the other files as best I can. > > > > /Bruce > > > > > > > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h > > > new file mode 100644 > > > index 0000000..6eac8fb > > > --- /dev/null > > > +++ b/lib/librte_rib/rte_rib.h > > > @@ -0,0 +1,322 @@ > > > +/* SPDX-License-Identifier: BSD-3-Clause > > > + * Copyright(c) 2018 Vladimir Medvedkin > > > + */ > > > + > > > +#ifndef _RTE_RIB_H_ > > > +#define _RTE_RIB_H_ > > > + > > > +/** > > > + * @file > > > + * Compressed trie implementation for Longest Prefix Match > > > + */ > > > + > > > +/** @internal Macro to enable/disable run-time checks. */ > > > +#if defined(RTE_LIBRTE_RIB_DEBUG) > > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do { \ > > > + if (cond) \ > > > + return retval; \ > > > +} while (0) > > > +#else > > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) > > > +#endif > > > > use RTE_ASSERT? > > > it was done just like it was done in the LPM lib. But if you think it > should be RTE_ASSERT so be it. > > > > > > > + > > > +#define RTE_RIB_VALID_NODE 1 > > > > should there be an INVALID_NODE macro? > > > No > > > > > > > +#define RTE_RIB_GET_NXT_ALL 0 > > > +#define RTE_RIB_GET_NXT_COVER 1 > > > + > > > +#define RTE_RIB_INVALID_ROUTE 0 > > > +#define RTE_RIB_VALID_ROUTE 1 > > > + > > > +/** Max number of characters in RIB name. */ > > > +#define RTE_RIB_NAMESIZE 64 > > > + > > > +/** Maximum depth value possible for IPv4 RIB. */ > > > +#define RTE_RIB_MAXDEPTH 32 > > > > I think we should have IPv4 in the name here. Will it not be extended to > > support IPv6 in future? > > > I think there should be a separate implementation of the library for ipv6 > I can understand the need for a separate LPM implementation, but should they both not be under the same rib library? > > > > > > + > > > +/** > > > + * Macro to check if prefix1 {key1/depth1} > > > + * is covered by prefix2 {key2/depth2} > > > + */ > > > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2) > > \ > > > + ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\ > > > + && (depth1 > depth2)) > > Neat check! > > > > Any particular reason for using UINT64_MAX here rather than UINT32_MAX? > > in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain > UINT32_MAX because shift count will be masked to 5 bits. > > I think you can avoid the casting and have a slightly shorter mask by > > changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to > > "~(UINT32_MAX >> depth2)" > > I'd also suggest for readability putting the second check first, and, > > for maintainability, using an inline function rather than a macro. > > > Agree, it looks clearer > > > > > + > > > +/** @internal Macro to get next node in tree*/ > > > +#define RTE_RIB_GET_NXT_NODE(node, key) > > \ > > > + ((key & (1 << (31 - node->depth))) ? node->right : node->left) > > > +/** @internal Macro to check if node is right child*/ > > > +#define RTE_RIB_IS_RIGHT_NODE(node) (node->parent->right == node) > > > > Again, consider inline fns rather than macros. > > > Ok > > For the latter macro, rather than doing additional pointer derefs to > > parent, can you also get if it's a right node by using: > > "(node->key & (1 << (32 - node->depth)))"? > > > No, it is not possible. Decision whether node be left or right is made > using parent and child common depth. > Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common depth > will be /8 and 10.128.0.0/24 will be right child. > > > > > + > > > + > > > +struct rte_rib_node { > > > + struct rte_rib_node *left; > > > + struct rte_rib_node *right; > > > + struct rte_rib_node *parent; > > > + uint32_t key; > > > + uint8_t depth; > > > + uint8_t flag; > > > + uint64_t nh; > > > + uint64_t ext[0]; > > > +}; > > > + > > > +struct rte_rib; > > > + > > > +/** Type of FIB struct*/ > > > +enum rte_rib_type { > > > + RTE_RIB_DIR24_8_1B, > > > + RTE_RIB_DIR24_8_2B, > > > + RTE_RIB_DIR24_8_4B, > > > + RTE_RIB_DIR24_8_8B, > > > + RTE_RIB_TYPE_MAX > > > +}; > > > > If the plan is to support multiple underlying fib types and algorithms > > under the rib library, would it not be better to separate out the > > algorithm part from the data storage part? So have the type just be > > DIR_24_8, and have the 1, 2, 4 or 8 specified separately. > > > Yes, we were talk about it in IRC, agree. Now I pass next hop size in > union rte_rib_fib_conf inside rte_rib_conf > > > > > > > + > > > +enum rte_rib_op { > > > + RTE_RIB_ADD, > > > + RTE_RIB_DEL > > > +}; > > > + > > > +/** RIB nodes allocation type */ > > > +enum rte_rib_alloc_type { > > > + RTE_RIB_MALLOC, > > > + RTE_RIB_MEMPOOL, > > > + RTE_RIB_ALLOC_MAX > > > +}; > > > > Not sure you need this any more. Malloc allocations and mempool > > allocations are now pretty much the same thing. > > > Actually I think to remove malloc. On performance tests with > adding/deleting huge amount of routes malloc is slower. Maybe because of > fragmentation. > What do you think? > Yes, definitely mempool allocations are the way to go! > > > > + > > > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key, > > > + uint8_t depth, uint64_t next_hop, enum rte_rib_op op); > > > > Do you anticipate more ops in future than just add and delete? If not, > > why not just split this function into two and drop the op struct. > > > It is difficult question. I'm not ready to make decision at the moment. > > > > > > > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips, > > > + uint64_t *next_hops, const unsigned n); > > > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib > > *rib); > > > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib, > > > + struct rte_rib_node *node); > > > + > > > +struct rte_rib { > > > + char name[RTE_RIB_NAMESIZE]; > > > + /*pointer to rib trie*/ > > > + struct rte_rib_node *trie; > > > + /*pointer to dataplane struct*/ > > > + void *fib; > > > + /*prefix modification*/ > > > + rte_rib_modify_fn_t modify; > > > + /* Bulk lookup fn*/ > > > + rte_rib_tree_lookup_fn_t lookup; > > > + /*alloc trie element*/ > > > + rte_rib_alloc_node_fn_t alloc_node; > > > + /*free trie element*/ > > > + rte_rib_free_node_fn_t free_node; > > > + struct rte_mempool *node_pool; > > > + uint32_t cur_nodes; > > > + uint32_t cur_routes; > > > + int max_nodes; > > > + int node_sz; > > > + enum rte_rib_type type; > > > + enum rte_rib_alloc_type alloc_type; > > > +}; > > > + > > > +/** RIB configuration structure */ > > > +struct rte_rib_conf { > > > + enum rte_rib_type type; > > > + enum rte_rib_alloc_type alloc_type; > > > + int max_nodes; > > > + size_t node_sz; > > > + uint64_t def_nh; > > > +}; > > > + > > > +/** > > > + * Lookup an IP into the RIB structure > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @param key > > > + * IP to be looked up in the RIB > > > + * @return > > > + * pointer to struct rte_rib_node on success, > > > + * NULL otherwise > > > + */ > > > +struct rte_rib_node * > > > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key); > > > + > > > +/** > > > + * Lookup less specific route into the RIB structure > > > + * > > > + * @param ent > > > + * Pointer to struct rte_rib_node that represents target route > > > + * @return > > > + * pointer to struct rte_rib_node that represents > > > + * less specific route on success, > > > + * NULL otherwise > > > + */ > > > +struct rte_rib_node * > > > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent); > > > + > > > +/** > > > + * Lookup prefix into the RIB structure > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @param key > > > + * net to be looked up in the RIB > > > + * @param depth > > > + * prefix length > > > + * @return > > > + * pointer to struct rte_rib_node on success, > > > + * NULL otherwise > > > + */ > > > +struct rte_rib_node * > > > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t > > depth); > > > > Can you explain the difference between this and regular lookup, and how > > they would be used. I don't think the names convey the differences > > sufficiently, and so we should look to rename one or both to be clearer. > > > Regular lookup (rte_rib_tree_lookup) will lookup for most specific node for > passed key. > rte_rib_tree_lookup_exact will lookup node contained key and depth equal to > passed in args. It used to find exact route. > So if there is no node exactly matching the parameters, it the lookup_exact returns failure? E.g. if you request a /24 node, it won't return a /8 node that would cover the /24? > > > > > > + > > > +/** > > > + * Retrieve next more specific prefix from the RIB > > s/more/most/ > > > > > > + * that is covered by key/depth supernet > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @param key > > > + * net address of supernet prefix that covers returned more specific > > prefixes > > > + * @param depth > > > + * supernet prefix length > > > + * @param cur > > > + * pointer to the last returned prefix to get next prefix > > > + * or > > > + * NULL to get first more specific prefix > > > + * @param flag > > > + * -RTE_RIB_GET_NXT_ALL > > > + * get all prefixes from subtrie > > > > By all prefixes do you mean more specific, i.e. the final prefix? > > > What do you mean the final prefix? > The most specific one, or the longest prefix. > > > > + * -RTE_RIB_GET_NXT_COVER > > > + * get only first more specific prefix even if it have more specifics > > > + * @return > > > + * pointer to the next more specific prefix > > > + * or > > > + * NULL if there is no prefixes left > > > + */ > > > +struct rte_rib_node * > > > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth, > > > + struct rte_rib_node *cur, int flag); > > > + > > > +/** > > > + * Remove prefix from the RIB > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @param key > > > + * net to be removed from the RIB > > > + * @param depth > > > + * prefix length > > > + */ > > > +void > > > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth); > > > + > > > +/** > > > + * Insert prefix into the RIB > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @param key > > > + * net to be inserted to the RIB > > > + * @param depth > > > + * prefix length > > > + * @return > > > + * pointer to new rte_rib_node on success > > > + * NULL otherwise > > > + */ > > > +struct rte_rib_node * > > > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth); > > > + > > > +/** > > > + * Create RIB > > > + * > > > + * @param name > > > + * RIB name > > > + * @param socket_id > > > + * NUMA socket ID for RIB table memory allocation > > > + * @param conf > > > + * Structure containing the configuration > > > + * @return > > > + * Handle to RIB object on success > > > + * NULL otherwise with rte_errno set to an appropriate values. > > > + */ > > > +struct rte_rib * > > > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf > > *conf); > > > + > > > +/** > > > + * Find an existing RIB object and return a pointer to it. > > > + * > > > + * @param name > > > + * Name of the rib object as passed to rte_rib_create() > > > + * @return > > > + * Pointer to rib object or NULL if object not found with rte_errno > > > + * set appropriately. Possible rte_errno values include: > > > + * - ENOENT - required entry not available to return. > > > + */ > > > +struct rte_rib * > > > +rte_rib_find_existing(const char *name); > > > + > > > +/** > > > + * Free an RIB object. > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @return > > > + * None > > > + */ > > > +void > > > +rte_rib_free(struct rte_rib *rib); > > > + > > > +/** > > > + * Add a rule to the RIB. > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @param ip > > > + * IP of the rule to be added to the RIB > > > + * @param depth > > > + * Depth of the rule to be added to the RIB > > > + * @param next_hop > > > + * Next hop of the rule to be added to the RIB > > > + * @return > > > + * 0 on success, negative value otherwise > > > + */ > > > +int > > > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t > > next_hop); > > > + > > > +/** > > > + * Delete a rule from the RIB. > > > + * > > > + * @param rib > > > + * RIB object handle > > > + * @param ip > > > + * IP of the rule to be deleted from the RIB > > > + * @param depth > > > + * Depth of the rule to be deleted from the RIB > > > + * @return > > > + * 0 on success, negative value otherwise > > > + */ > > > +int > > > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth); > > > + > > > +/** > > > + * Lookup multiple IP addresses in an FIB. This may be implemented as a > > > + * macro, so the address of the function should not be used. > > > + * > > > + * @param RIB > > > + * RIB object handle > > > + * @param ips > > > + * Array of IPs to be looked up in the FIB > > > + * @param next_hops > > > + * Next hop of the most specific rule found for IP. > > > + * This is an array of eight byte values. > > > + * If the lookup for the given IP failed, then corresponding element > > would > > > + * contain default value, see description of then next parameter. > > > + * @param n > > > + * Number of elements in ips (and next_hops) array to lookup. This > > should be a > > > + * compile time constant, and divisible by 8 for best performance. > > > + * @param defv > > > + * Default value to populate into corresponding element of hop[] > > array, > > > + * if lookup would fail. > > > + * @return > > > + * -EINVAL for incorrect arguments, otherwise 0 > > > + */ > > > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n) \ > > > + rib->lookup(rib->fib, ips, next_hops, n) > > > > My main thought here is whether this needs to be a function at all? > > Given that it takes a full burst of addresses in a single go, how much > > performance would actually be lost by making this a regular function in > > the C file? > > IF we do convert this to a regular function, then a lot of the structure > > definitions above - most importantly, the rib structure itself - can > > probably be moved to a private header file and not exposed to > > applications at all. This will make ABI compatibility a *lot* easier, as > > the structures can be changed without affecting the public ABI. > > > I didn't quite understand what you mean. > Sorry, by "needs to be a function" in first line read "needs to be a macro". Basically, the point is to not inline anything that doesn't need it. If a function works on a burst of packets, it probably will be fine being a regular function than a macro or inline function. /Bruce