Re: [dpdk-dev] [PATCH v2 1/2] Add RIB library

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Bruce Richardson <bruce.richardson@intel.com>
To: Vladimir Medvedkin <medvedkinv@gmail.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2 1/2] Add RIB library
Date: Mon, 26 Mar 2018 10:50:42 +0100	[thread overview]
Message-ID: <20180326095042.GA17660@bricha3-MOBL.ger.corp.intel.com> (raw)
In-Reply-To: <CANDrEHmGxP8G__WGSw0NVqFdHv3C8iv1Vt0H7n_86F8SpTMfZg@mail.gmail.com>

On Sun, Mar 25, 2018 at 09:17:20PM +0300, Vladimir Medvedkin wrote:
> Hi,
> 
> 2018-03-14 14:09 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com>:
> 
> > On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > > RIB is an alternative to current LPM library.
> > > It solves the following problems
> > >  - Increases the speed of control plane operations against lpm such as
> > >    adding/deleting routes
> > >  - Adds abstraction from dataplane algorithms, so it is possible to add
> > >    different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
> > >    in addition to current dir24_8
> > >  - It is possible to keep user defined application specific additional
> > >    information in struct rte_rib_node which represents route entry.
> > >    It can be next hop/set of next hops (i.e. active and feasible),
> > >    pointers to link rte_rib_node based on some criteria (i.e. next_hop),
> > >    plenty of additional control plane information.
> > >  - For dir24_8 implementation it is possible to remove
> > rte_lpm_tbl_entry.depth
> > >    field that helps to save 6 bits.
> > >  - Also new dir24_8 implementation supports different next_hop sizes
> > >    (1/2/4/8 bytes per next hop)
> > >  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary
> > operator.
> > >    Instead it returns special default value if there is no route.
> > >
> > > Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> > > ---
> > >  config/common_base                 |   6 +
> > >  doc/api/doxy-api.conf              |   1 +
> > >  lib/Makefile                       |   2 +
> > >  lib/librte_rib/Makefile            |  22 ++
> > >  lib/librte_rib/rte_dir24_8.c       | 482 ++++++++++++++++++++++++++++++
> > +++
> > >  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
> > >  lib/librte_rib/rte_rib.c           | 526 ++++++++++++++++++++++++++++++
> > +++++++
> > >  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
> > >  lib/librte_rib/rte_rib_version.map |  18 ++
> > >  mk/rte.app.mk                      |   1 +
> > >  10 files changed, 1496 insertions(+)
> > >  create mode 100644 lib/librte_rib/Makefile
> > >  create mode 100644 lib/librte_rib/rte_dir24_8.c
> > >  create mode 100644 lib/librte_rib/rte_dir24_8.h
> > >  create mode 100644 lib/librte_rib/rte_rib.c
> > >  create mode 100644 lib/librte_rib/rte_rib.h
> > >  create mode 100644 lib/librte_rib/rte_rib_version.map
> > >
> >
> > First pass review comments. For now just reviewed the main public header
> > file rte_rib.h. Later reviews will cover the other files as best I can.
> >
> > /Bruce
> >
> > <snip>
> > > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> > > new file mode 100644
> > > index 0000000..6eac8fb
> > > --- /dev/null
> > > +++ b/lib/librte_rib/rte_rib.h
> > > @@ -0,0 +1,322 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> > > + */
> > > +
> > > +#ifndef _RTE_RIB_H_
> > > +#define _RTE_RIB_H_
> > > +
> > > +/**
> > > + * @file
> > > + * Compressed trie implementation for Longest Prefix Match
> > > + */
> > > +
> > > +/** @internal Macro to enable/disable run-time checks. */
> > > +#if defined(RTE_LIBRTE_RIB_DEBUG)
> > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {    \
> > > +     if (cond)                                       \
> > > +             return retval;                          \
> > > +} while (0)
> > > +#else
> > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> > > +#endif
> >
> > use RTE_ASSERT?
> >
> it was done just like it was done in the LPM lib. But if you think it
> should be RTE_ASSERT so be it.
> 
> 
> >
> > > +
> > > +#define RTE_RIB_VALID_NODE   1
> >
> > should there be an INVALID_NODE macro?
> >
> No
> 
> 
> >
> > > +#define RTE_RIB_GET_NXT_ALL  0
> > > +#define RTE_RIB_GET_NXT_COVER        1
> > > +
> > > +#define RTE_RIB_INVALID_ROUTE        0
> > > +#define RTE_RIB_VALID_ROUTE  1
> > > +
> > > +/** Max number of characters in RIB name. */
> > > +#define RTE_RIB_NAMESIZE     64
> > > +
> > > +/** Maximum depth value possible for IPv4 RIB. */
> > > +#define RTE_RIB_MAXDEPTH     32
> >
> > I think we should have IPv4 in the name here. Will it not be extended to
> > support IPv6 in future?
> >
> I think there should be a separate implementation of the library for ipv6
> 
I can understand the need for a separate LPM implementation, but should
they both not be under the same rib library?

> 
> >
> > > +
> > > +/**
> > > + * Macro to check if prefix1 {key1/depth1}
> > > + * is covered by prefix2 {key2/depth2}
> > > + */
> > > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)
> >      \
> > > +     ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\
> > > +             && (depth1 > depth2))
> > Neat check!
> >
> > Any particular reason for using UINT64_MAX here rather than UINT32_MAX?
> 
> in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain
> UINT32_MAX because shift count will be masked to 5 bits.
> 
> I think you can avoid the casting and have a slightly shorter mask by
> > changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to
> > "~(UINT32_MAX >> depth2)"
> > I'd also suggest for readability putting the second check first, and,
> > for maintainability, using an inline function rather than a macro.
> >
>  Agree, it looks clearer
> 
> 
> > > +
> > > +/** @internal Macro to get next node in tree*/
> > > +#define RTE_RIB_GET_NXT_NODE(node, key)
> >       \
> > > +     ((key & (1 << (31 - node->depth))) ? node->right : node->left)
> > > +/** @internal Macro to check if node is right child*/
> > > +#define RTE_RIB_IS_RIGHT_NODE(node)  (node->parent->right == node)
> >
> > Again, consider inline fns rather than macros.
> >
> Ok
> 
> For the latter macro, rather than doing additional pointer derefs to
> > parent, can you also get if it's a right node by using:
> > "(node->key & (1 << (32 - node->depth)))"?
> >
> No, it is not possible. Decision whether node be left or right is made
> using parent and child common depth.
> Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common depth
> will be /8 and 10.128.0.0/24 will be right child.
> 
> 
> > > +
> > > +
> > > +struct rte_rib_node {
> > > +     struct rte_rib_node *left;
> > > +     struct rte_rib_node *right;
> > > +     struct rte_rib_node *parent;
> > > +     uint32_t        key;
> > > +     uint8_t         depth;
> > > +     uint8_t         flag;
> > > +     uint64_t        nh;
> > > +     uint64_t        ext[0];
> > > +};
> > > +
> > > +struct rte_rib;
> > > +
> > > +/** Type of FIB struct*/
> > > +enum rte_rib_type {
> > > +     RTE_RIB_DIR24_8_1B,
> > > +     RTE_RIB_DIR24_8_2B,
> > > +     RTE_RIB_DIR24_8_4B,
> > > +     RTE_RIB_DIR24_8_8B,
> > > +     RTE_RIB_TYPE_MAX
> > > +};
> >
> > If the plan is to support multiple underlying fib types and algorithms
> > under the rib library, would it not be better to separate out the
> > algorithm part from the data storage part? So have the type just be
> > DIR_24_8, and have the 1, 2, 4 or 8 specified separately.
> >
> Yes, we were talk about it in IRC, agree. Now I pass next hop size in
> union rte_rib_fib_conf inside rte_rib_conf
> 
> 
> >
> > > +
> > > +enum rte_rib_op {
> > > +     RTE_RIB_ADD,
> > > +     RTE_RIB_DEL
> > > +};
> > > +
> > > +/** RIB nodes allocation type */
> > > +enum rte_rib_alloc_type {
> > > +     RTE_RIB_MALLOC,
> > > +     RTE_RIB_MEMPOOL,
> > > +     RTE_RIB_ALLOC_MAX
> > > +};
> >
> > Not sure you need this any more. Malloc allocations and mempool
> > allocations are now pretty much the same thing.
> >
> Actually I think to remove malloc. On performance tests with
> adding/deleting huge amount of routes malloc is slower. Maybe because of
> fragmentation.
> What do you think?
> 
Yes, definitely mempool allocations are the way to go!

> 
> > > +
> > > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key,
> > > +     uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
> >
> > Do you anticipate more ops in future than just add and delete? If not,
> > why not just split this function into two and drop the op struct.
> >
> It is difficult question. I'm not ready to make decision at the moment.
> 
> 
> >
> > > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips,
> > > +     uint64_t *next_hops, const unsigned n);
> > > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib
> > *rib);
> > > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> > > +     struct rte_rib_node *node);
> > > +
> > > +struct rte_rib {
> > > +     char name[RTE_RIB_NAMESIZE];
> > > +     /*pointer to rib trie*/
> > > +     struct rte_rib_node     *trie;
> > > +     /*pointer to dataplane struct*/
> > > +     void    *fib;
> > > +     /*prefix modification*/
> > > +     rte_rib_modify_fn_t     modify;
> > > +     /* Bulk lookup fn*/
> > > +     rte_rib_tree_lookup_fn_t        lookup;
> > > +     /*alloc trie element*/
> > > +     rte_rib_alloc_node_fn_t alloc_node;
> > > +     /*free trie element*/
> > > +     rte_rib_free_node_fn_t  free_node;
> > > +     struct rte_mempool      *node_pool;
> > > +     uint32_t                cur_nodes;
> > > +     uint32_t                cur_routes;
> > > +     int                     max_nodes;
> > > +     int                     node_sz;
> > > +     enum rte_rib_type       type;
> > > +     enum rte_rib_alloc_type alloc_type;
> > > +};
> > > +
> > > +/** RIB configuration structure */
> > > +struct rte_rib_conf {
> > > +     enum rte_rib_type       type;
> > > +     enum rte_rib_alloc_type alloc_type;
> > > +     int     max_nodes;
> > > +     size_t  node_sz;
> > > +     uint64_t def_nh;
> > > +};
> > > +
> > > +/**
> > > + * Lookup an IP into the RIB structure
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  IP to be looked up in the RIB
> > > + * @return
> > > + *  pointer to struct rte_rib_node on success,
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> > > +
> > > +/**
> > > + * Lookup less specific route into the RIB structure
> > > + *
> > > + * @param ent
> > > + *  Pointer to struct rte_rib_node that represents target route
> > > + * @return
> > > + *  pointer to struct rte_rib_node that represents
> > > + *  less specific route on success,
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> > > +
> > > +/**
> > > + * Lookup prefix into the RIB structure
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net to be looked up in the RIB
> > > + * @param depth
> > > + *  prefix length
> > > + * @return
> > > + *  pointer to struct rte_rib_node on success,
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t
> > depth);
> >
> > Can you explain the difference between this and regular lookup, and how
> > they would be used. I don't think the names convey the differences
> > sufficiently, and so we should look to rename one or both to be clearer.
> >
> Regular lookup (rte_rib_tree_lookup) will lookup for most specific node for
> passed key.
> rte_rib_tree_lookup_exact will lookup node contained key and depth equal to
> passed in args. It used to find exact route.
> 
So if there is no node exactly matching the parameters, it the lookup_exact
returns failure? E.g. if you request a /24 node, it won't return a /8 node
that would cover the /24?

> 
> >
> > > +
> > > +/**
> > > + * Retrieve next more specific prefix from the RIB
> > s/more/most/
> >
> 
> > > + * that is covered by key/depth supernet
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net address of supernet prefix that covers returned more specific
> > prefixes
> > > + * @param depth
> > > + *  supernet prefix length
> > > + * @param cur
> > > + *   pointer to the last returned prefix to get next prefix
> > > + *   or
> > > + *   NULL to get first more specific prefix
> > > + * @param flag
> > > + *  -RTE_RIB_GET_NXT_ALL
> > > + *   get all prefixes from subtrie
> >
> > By all prefixes do you mean more specific, i.e. the final prefix?
> >
> What do you mean the final prefix?
> 
The most specific one, or the longest prefix.

> 
> > > + *  -RTE_RIB_GET_NXT_COVER
> > > + *   get only first more specific prefix even if it have more specifics
> > > + * @return
> > > + *  pointer to the next more specific prefix
> > > + *  or
> > > + *  NULL if there is no prefixes left
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth,
> > > +     struct rte_rib_node *cur, int flag);
> > > +
> > > +/**
> > > + * Remove prefix from the RIB
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net to be removed from the RIB
> > > + * @param depth
> > > + *  prefix length
> > > + */
> > > +void
> > > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > > +
> > > +/**
> > > + * Insert prefix into the RIB
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net to be inserted to the RIB
> > > + * @param depth
> > > + *  prefix length
> > > + * @return
> > > + *  pointer to new rte_rib_node on success
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > > +
> > > +/**
> > > + * Create RIB
> > > + *
> > > + * @param name
> > > + *  RIB name
> > > + * @param socket_id
> > > + *  NUMA socket ID for RIB table memory allocation
> > > + * @param conf
> > > + *  Structure containing the configuration
> > > + * @return
> > > + *  Handle to RIB object on success
> > > + *  NULL otherwise with rte_errno set to an appropriate values.
> > > + */
> > > +struct rte_rib *
> > > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf
> > *conf);
> > > +
> > > +/**
> > > + * Find an existing RIB object and return a pointer to it.
> > > + *
> > > + * @param name
> > > + *  Name of the rib object as passed to rte_rib_create()
> > > + * @return
> > > + *  Pointer to rib object or NULL if object not found with rte_errno
> > > + *  set appropriately. Possible rte_errno values include:
> > > + *   - ENOENT - required entry not available to return.
> > > + */
> > > +struct rte_rib *
> > > +rte_rib_find_existing(const char *name);
> > > +
> > > +/**
> > > + * Free an RIB object.
> > > + *
> > > + * @param rib
> > > + *   RIB object handle
> > > + * @return
> > > + *   None
> > > + */
> > > +void
> > > +rte_rib_free(struct rte_rib *rib);
> > > +
> > > +/**
> > > + * Add a rule to the RIB.
> > > + *
> > > + * @param rib
> > > + *   RIB object handle
> > > + * @param ip
> > > + *   IP of the rule to be added to the RIB
> > > + * @param depth
> > > + *   Depth of the rule to be added to the RIB
> > > + * @param next_hop
> > > + *   Next hop of the rule to be added to the RIB
> > > + * @return
> > > + *   0 on success, negative value otherwise
> > > + */
> > > +int
> > > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t
> > next_hop);
> > > +
> > > +/**
> > > + * Delete a rule from the RIB.
> > > + *
> > > + * @param rib
> > > + *   RIB object handle
> > > + * @param ip
> > > + *   IP of the rule to be deleted from the RIB
> > > + * @param depth
> > > + *   Depth of the rule to be deleted from the RIB
> > > + * @return
> > > + *   0 on success, negative value otherwise
> > > + */
> > > +int
> > > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> > > +
> > > +/**
> > > + * Lookup multiple IP addresses in an FIB. This may be implemented as a
> > > + * macro, so the address of the function should not be used.
> > > + *
> > > + * @param RIB
> > > + *   RIB object handle
> > > + * @param ips
> > > + *   Array of IPs to be looked up in the FIB
> > > + * @param next_hops
> > > + *   Next hop of the most specific rule found for IP.
> > > + *   This is an array of eight byte values.
> > > + *   If the lookup for the given IP failed, then corresponding element
> > would
> > > + *   contain default value, see description of then next parameter.
> > > + * @param n
> > > + *   Number of elements in ips (and next_hops) array to lookup. This
> > should be a
> > > + *   compile time constant, and divisible by 8 for best performance.
> > > + * @param defv
> > > + *   Default value to populate into corresponding element of hop[]
> > array,
> > > + *   if lookup would fail.
> > > + *  @return
> > > + *   -EINVAL for incorrect arguments, otherwise 0
> > > + */
> > > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)      \
> > > +     rib->lookup(rib->fib, ips, next_hops, n)
> >
> > My main thought here is whether this needs to be a function at all?
> > Given that it takes a full burst of addresses in a single go, how much
> > performance would actually be lost by making this a regular function in
> > the C file?
> > IF we do convert this to a regular function, then a lot of the structure
> > definitions above - most importantly, the rib structure itself - can
> > probably be moved to a private header file and not exposed to
> > applications at all. This will make ABI compatibility a *lot* easier, as
> > the structures can be changed without affecting the public ABI.
> >
> I didn't quite understand what you mean.
> 
Sorry, by "needs to be a function" in first line read "needs to be a
macro". Basically, the point is to not inline anything that doesn't need
it. If a function works on a burst of packets, it probably will be fine
being a regular function than a macro or inline function.

/Bruce

next prev parent reply	other threads:[~2018-03-26  9:50 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-21 21:44 [dpdk-dev] [PATCH v2 0/2] lib/rib: Add Routing Information Base library Medvedkin Vladimir
2018-02-21 21:44 ` [dpdk-dev] [PATCH v2 1/2] Add RIB library Medvedkin Vladimir
2018-03-14 11:09   ` Bruce Richardson
2018-03-14 12:05     ` Richardson, Bruce
2018-03-25 18:17     ` Vladimir Medvedkin
2018-03-26  9:50       ` Bruce Richardson [this message]
2018-03-29 19:59         ` Vladimir Medvedkin
2018-03-29 10:27   ` Bruce Richardson
2018-03-29 20:11     ` Vladimir Medvedkin
2018-03-29 20:41       ` Bruce Richardson
2018-02-21 21:44 ` [dpdk-dev] [PATCH v2 2/2] Add autotests for " Medvedkin Vladimir

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180326095042.GA17660@bricha3-MOBL.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=medvedkinv@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).