From: Yongseok Koh <yskoh@mellanox.com>
To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Cc: Shahaf Shuler <shahafs@mellanox.com>,
Nelio Laranjeiro <nelio.laranjeiro@6wind.com>,
dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH 2/6] net/mlx5: add framework for switch flow rules
Date: Thu, 12 Jul 2018 11:25:27 -0700 [thread overview]
Message-ID: <20180712182526.GA73570@yongseok-MBP.local> (raw)
In-Reply-To: <20180712104646.GT5211@6wind.com>
On Thu, Jul 12, 2018 at 12:46:46PM +0200, Adrien Mazarguil wrote:
> On Wed, Jul 11, 2018 at 05:59:18PM -0700, Yongseok Koh wrote:
> > On Wed, Jun 27, 2018 at 08:08:12PM +0200, Adrien Mazarguil wrote:
> > > Because mlx5 switch flow rules are configured through Netlink (TC
> > > interface) and have little in common with Verbs, this patch adds a separate
> > > parser function to handle them.
> > >
> > > - mlx5_nl_flow_transpose() converts a rte_flow rule to its TC equivalent
> > > and stores the result in a buffer.
> > >
> > > - mlx5_nl_flow_brand() gives a unique handle to a flow rule buffer.
> > >
> > > - mlx5_nl_flow_create() instantiates a flow rule on the device based on
> > > such a buffer.
> > >
> > > - mlx5_nl_flow_destroy() performs the reverse operation.
> > >
> > > These functions are called by the existing implementation when encountering
> > > flow rules which must be offloaded to the switch (currently relying on the
> > > transfer attribute).
> > >
> > > Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
> <snip>
> > > diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
> > > index 9241855be..93b245991 100644
> > > --- a/drivers/net/mlx5/mlx5_flow.c
> > > +++ b/drivers/net/mlx5/mlx5_flow.c
> > > @@ -4,6 +4,7 @@
> > > */
> > >
> > > #include <sys/queue.h>
> > > +#include <stdalign.h>
> > > #include <stdint.h>
> > > #include <string.h>
> > >
> > > @@ -271,6 +272,7 @@ struct rte_flow {
> > > /**< Store tunnel packet type data to store in Rx queue. */
> > > uint8_t key[40]; /**< RSS hash key. */
> > > uint16_t (*queue)[]; /**< Destination queues to redirect traffic to. */
> > > + void *nl_flow; /**< Netlink flow buffer if relevant. */
> > > };
> > >
> > > static const struct rte_flow_ops mlx5_flow_ops = {
> > > @@ -2403,6 +2405,106 @@ mlx5_flow_actions(struct rte_eth_dev *dev,
> > > }
> > >
> > > /**
> > > + * Validate flow rule and fill flow structure accordingly.
> > > + *
> > > + * @param dev
> > > + * Pointer to Ethernet device.
> > > + * @param[out] flow
> > > + * Pointer to flow structure.
> > > + * @param flow_size
> > > + * Size of allocated space for @p flow.
> > > + * @param[in] attr
> > > + * Flow rule attributes.
> > > + * @param[in] pattern
> > > + * Pattern specification (list terminated by the END pattern item).
> > > + * @param[in] actions
> > > + * Associated actions (list terminated by the END action).
> > > + * @param[out] error
> > > + * Perform verbose error reporting if not NULL.
> > > + *
> > > + * @return
> > > + * A positive value representing the size of the flow object in bytes
> > > + * regardless of @p flow_size on success, a negative errno value otherwise
> > > + * and rte_errno is set.
> > > + */
> > > +static int
> > > +mlx5_flow_merge_switch(struct rte_eth_dev *dev,
> > > + struct rte_flow *flow,
> > > + size_t flow_size,
> > > + const struct rte_flow_attr *attr,
> > > + const struct rte_flow_item pattern[],
> > > + const struct rte_flow_action actions[],
> > > + struct rte_flow_error *error)
> > > +{
> > > + struct priv *priv = dev->data->dev_private;
> > > + unsigned int n = mlx5_domain_to_port_id(priv->domain_id, NULL, 0);
> > > + uint16_t port_list[!n + n];
> > > + struct mlx5_nl_flow_ptoi ptoi[!n + n + 1];
> > > + size_t off = RTE_ALIGN_CEIL(sizeof(*flow), alignof(max_align_t));
> > > + unsigned int i;
> > > + unsigned int own = 0;
> > > + int ret;
> > > +
> > > + /* At least one port is needed when no switch domain is present. */
> > > + if (!n) {
> > > + n = 1;
> > > + port_list[0] = dev->data->port_id;
> > > + } else {
> > > + n = mlx5_domain_to_port_id(priv->domain_id, port_list, n);
> > > + if (n > RTE_DIM(port_list))
> > > + n = RTE_DIM(port_list);
> > > + }
> > > + for (i = 0; i != n; ++i) {
> > > + struct rte_eth_dev_info dev_info;
> > > +
> > > + rte_eth_dev_info_get(port_list[i], &dev_info);
> > > + if (port_list[i] == dev->data->port_id)
> > > + own = i;
> > > + ptoi[i].port_id = port_list[i];
> > > + ptoi[i].ifindex = dev_info.if_index;
> > > + }
> > > + /* Ensure first entry of ptoi[] is the current device. */
> > > + if (own) {
> > > + ptoi[n] = ptoi[0];
> > > + ptoi[0] = ptoi[own];
> > > + ptoi[own] = ptoi[n];
> > > + }
> > > + /* An entry with zero ifindex terminates ptoi[]. */
> > > + ptoi[n].port_id = 0;
> > > + ptoi[n].ifindex = 0;
> > > + if (flow_size < off)
> > > + flow_size = 0;
> > > + ret = mlx5_nl_flow_transpose((uint8_t *)flow + off,
> > > + flow_size ? flow_size - off : 0,
> > > + ptoi, attr, pattern, actions, error);
> > > + if (ret < 0)
> > > + return ret;
> >
> > So, there's an assumption that the buffer allocated outside of this API is
> > enough to include all the messages in mlx5_nl_flow_transpose(), right? If
> > flow_size isn't enough, buf_tmp will be used and _transpose() doesn't return
> > error but required size. Sounds confusing, may need to make a change or to have
> > clearer documentation.
>
> Well, isn't it already documented? Besides these are the usual snprintf()
> semantics used everywhere in these files, I think this was a major topic of
> discussion with Nelio on the flow rework series :)
>
> buf_tmp[] is internal to mlx5_nl_flow_transpose() and used as a fallback to
> complete a pass when input buffer is not large enough (including
> zero-sized). Having a valid buffer is a constraint imposed by libmnl,
> because we badly want to know how much space will be needed assuming the
> flow rule was successfully processed.
>
> Without libmnl, the helpers it provides would have been written in a way
> that doesn't require buf_tmp[]. However libmnl is just too convenient to
> pass up, hence this compromise.
>
> (just to remind onlookers, we want to allocate the minimum amount of memory
> we possibly can for resources needed by each flow rule, and do so through a
> single allocation, end goal being to support millions of flow rules while
> wasting as little memory as possible.)
>
> <snip>
> > > diff --git a/drivers/net/mlx5/mlx5_nl_flow.c b/drivers/net/mlx5/mlx5_nl_flow.c
> > > index 7a8683b03..1fc62fb0a 100644
> > > --- a/drivers/net/mlx5/mlx5_nl_flow.c
> > > +++ b/drivers/net/mlx5/mlx5_nl_flow.c
> > > @@ -5,7 +5,9 @@
> > >
> > > #include <errno.h>
> > > #include <libmnl/libmnl.h>
> > > +#include <linux/if_ether.h>
> > > #include <linux/netlink.h>
> > > +#include <linux/pkt_cls.h>
> > > #include <linux/pkt_sched.h>
> > > #include <linux/rtnetlink.h>
> > > #include <stdalign.h>
> > > @@ -14,11 +16,248 @@
> > > #include <stdlib.h>
> > > #include <sys/socket.h>
> > >
> > > +#include <rte_byteorder.h>
> > > #include <rte_errno.h>
> > > #include <rte_flow.h>
> > >
> > > #include "mlx5.h"
> > >
> > > +/** Parser state definitions for mlx5_nl_flow_trans[]. */
> > > +enum mlx5_nl_flow_trans {
> > > + INVALID,
> > > + BACK,
> > > + ATTR,
> > > + PATTERN,
> > > + ITEM_VOID,
> > > + ACTIONS,
> > > + ACTION_VOID,
> > > + END,
> > > +};
> > > +
> > > +#define TRANS(...) (const enum mlx5_nl_flow_trans []){ __VA_ARGS__, INVALID, }
> > > +
> > > +#define PATTERN_COMMON \
> > > + ITEM_VOID, ACTIONS
> > > +#define ACTIONS_COMMON \
> > > + ACTION_VOID, END
> > > +
> > > +/** Parser state transitions used by mlx5_nl_flow_transpose(). */
> > > +static const enum mlx5_nl_flow_trans *const mlx5_nl_flow_trans[] = {
> > > + [INVALID] = NULL,
> > > + [BACK] = NULL,
> > > + [ATTR] = TRANS(PATTERN),
> > > + [PATTERN] = TRANS(PATTERN_COMMON),
> > > + [ITEM_VOID] = TRANS(BACK),
> > > + [ACTIONS] = TRANS(ACTIONS_COMMON),
> > > + [ACTION_VOID] = TRANS(BACK),
> > > + [END] = NULL,
> > > +};
> > > +
> > > +/**
> > > + * Transpose flow rule description to rtnetlink message.
> > > + *
> > > + * This function transposes a flow rule description to a traffic control
> > > + * (TC) filter creation message ready to be sent over Netlink.
> > > + *
> > > + * Target interface is specified as the first entry of the @p ptoi table.
> > > + * Subsequent entries enable this function to resolve other DPDK port IDs
> > > + * found in the flow rule.
> > > + *
> > > + * @param[out] buf
> > > + * Output message buffer. May be NULL when @p size is 0.
> > > + * @param size
> > > + * Size of @p buf. Message may be truncated if not large enough.
> > > + * @param[in] ptoi
> > > + * DPDK port ID to network interface index translation table. This table
> > > + * is terminated by an entry with a zero ifindex value.
> > > + * @param[in] attr
> > > + * Flow rule attributes.
> > > + * @param[in] pattern
> > > + * Pattern specification.
> > > + * @param[in] actions
> > > + * Associated actions.
> > > + * @param[out] error
> > > + * Perform verbose error reporting if not NULL.
> > > + *
> > > + * @return
> > > + * A positive value representing the exact size of the message in bytes
> > > + * regardless of the @p size parameter on success, a negative errno value
> > > + * otherwise and rte_errno is set.
> > > + */
> > > +int
> > > +mlx5_nl_flow_transpose(void *buf,
> > > + size_t size,
> > > + const struct mlx5_nl_flow_ptoi *ptoi,
> > > + const struct rte_flow_attr *attr,
> > > + const struct rte_flow_item *pattern,
> > > + const struct rte_flow_action *actions,
> > > + struct rte_flow_error *error)
> > > +{
> > > + alignas(struct nlmsghdr)
> > > + uint8_t buf_tmp[MNL_SOCKET_BUFFER_SIZE];
> > > + const struct rte_flow_item *item;
> > > + const struct rte_flow_action *action;
> > > + unsigned int n;
> > > + struct nlattr *na_flower;
> > > + struct nlattr *na_flower_act;
> > > + const enum mlx5_nl_flow_trans *trans;
> > > + const enum mlx5_nl_flow_trans *back;
> > > +
> > > + if (!size)
> > > + goto error_nobufs;
> > > +init:
> > > + item = pattern;
> > > + action = actions;
> > > + n = 0;
> > > + na_flower = NULL;
> > > + na_flower_act = NULL;
> > > + trans = TRANS(ATTR);
> > > + back = trans;
> > > +trans:
> > > + switch (trans[n++]) {
> > > + struct nlmsghdr *nlh;
> > > + struct tcmsg *tcm;
> > > +
> > > + case INVALID:
> > > + if (item->type)
> > > + return rte_flow_error_set
> > > + (error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ITEM,
> > > + item, "unsupported pattern item combination");
> > > + else if (action->type)
> > > + return rte_flow_error_set
> > > + (error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ACTION,
> > > + action, "unsupported action combination");
> > > + return rte_flow_error_set
> > > + (error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > + "flow rule lacks some kind of fate action");
> > > + case BACK:
> > > + trans = back;
> > > + n = 0;
> > > + goto trans;
> > > + case ATTR:
> > > + /*
> > > + * Supported attributes: no groups, some priorities and
> > > + * ingress only. Don't care about transfer as it is the
> > > + * caller's problem.
> > > + */
> > > + if (attr->group)
> > > + return rte_flow_error_set
> > > + (error, ENOTSUP,
> > > + RTE_FLOW_ERROR_TYPE_ATTR_GROUP,
> > > + attr, "groups are not supported");
> > > + if (attr->priority > 0xfffe)
> > > + return rte_flow_error_set
> > > + (error, ENOTSUP,
> > > + RTE_FLOW_ERROR_TYPE_ATTR_PRIORITY,
> > > + attr, "lowest priority level is 0xfffe");
> > > + if (!attr->ingress)
> > > + return rte_flow_error_set
> > > + (error, ENOTSUP,
> > > + RTE_FLOW_ERROR_TYPE_ATTR_INGRESS,
> > > + attr, "only ingress is supported");
> > > + if (attr->egress)
> > > + return rte_flow_error_set
> > > + (error, ENOTSUP,
> > > + RTE_FLOW_ERROR_TYPE_ATTR_INGRESS,
> > > + attr, "egress is not supported");
> > > + if (size < mnl_nlmsg_size(sizeof(*tcm)))
> > > + goto error_nobufs;
> > > + nlh = mnl_nlmsg_put_header(buf);
> > > + nlh->nlmsg_type = 0;
> > > + nlh->nlmsg_flags = 0;
> > > + nlh->nlmsg_seq = 0;
> > > + tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > > + tcm->tcm_family = AF_UNSPEC;
> > > + tcm->tcm_ifindex = ptoi[0].ifindex;
> > > + /*
> > > + * Let kernel pick a handle by default. A predictable handle
> > > + * can be set by the caller on the resulting buffer through
> > > + * mlx5_nl_flow_brand().
> > > + */
> > > + tcm->tcm_handle = 0;
> > > + tcm->tcm_parent = TC_H_MAKE(TC_H_INGRESS, TC_H_MIN_INGRESS);
> > > + /*
> > > + * Priority cannot be zero to prevent the kernel from
> > > + * picking one automatically.
> > > + */
> > > + tcm->tcm_info = TC_H_MAKE((attr->priority + 1) << 16,
> > > + RTE_BE16(ETH_P_ALL));
> > > + break;
> > > + case PATTERN:
> > > + if (!mnl_attr_put_strz_check(buf, size, TCA_KIND, "flower"))
> > > + goto error_nobufs;
> > > + na_flower = mnl_attr_nest_start_check(buf, size, TCA_OPTIONS);
> > > + if (!na_flower)
> > > + goto error_nobufs;
> > > + if (!mnl_attr_put_u32_check(buf, size, TCA_FLOWER_FLAGS,
> > > + TCA_CLS_FLAGS_SKIP_SW))
> > > + goto error_nobufs;
> > > + break;
> > > + case ITEM_VOID:
> > > + if (item->type != RTE_FLOW_ITEM_TYPE_VOID)
> > > + goto trans;
> > > + ++item;
> > > + break;
> > > + case ACTIONS:
> > > + if (item->type != RTE_FLOW_ITEM_TYPE_END)
> > > + goto trans;
> > > + assert(na_flower);
> > > + assert(!na_flower_act);
> > > + na_flower_act =
> > > + mnl_attr_nest_start_check(buf, size, TCA_FLOWER_ACT);
> > > + if (!na_flower_act)
> > > + goto error_nobufs;
> > > + break;
> > > + case ACTION_VOID:
> > > + if (action->type != RTE_FLOW_ACTION_TYPE_VOID)
> > > + goto trans;
> > > + ++action;
> > > + break;
> > > + case END:
> > > + if (item->type != RTE_FLOW_ITEM_TYPE_END ||
> > > + action->type != RTE_FLOW_ACTION_TYPE_END)
> > > + goto trans;
> > > + if (na_flower_act)
> > > + mnl_attr_nest_end(buf, na_flower_act);
> > > + if (na_flower)
> > > + mnl_attr_nest_end(buf, na_flower);
> > > + nlh = buf;
> > > + return nlh->nlmsg_len;
> > > + }
> > > + back = trans;
> > > + trans = mlx5_nl_flow_trans[trans[n - 1]];
> > > + n = 0;
> > > + goto trans;
> > > +error_nobufs:
> > > + if (buf != buf_tmp) {
> > > + buf = buf_tmp;
> > > + size = sizeof(buf_tmp);
> > > + goto init;
> > > + }
> >
> > Continuing my comment above.
> > This part is unclear. It looks to me that this func does:
> >
> > 1) if size is zero, consider it as a testing call to know the amount of memory
> > required.
>
> Yeah, in fact this one is a shortcut to speed up this specific scenario as
> it happens all the time in the two-pass use case. You can lump it with 2).
>
> > 2) if size isn't zero but not enough, it stops writing to buf and start over to
> > return the amount of memory required instead of returning error.
> > 3) if size isn't zero and enough, it fills in buf.
> >
> > Do I correctly understand?
>
> Yes. Another minor note for 2), the returned buffer is also filled up to the
> point of failure (mimics snprintf()).
>
> Perhaps the following snippet can better summarize the envisioned approach:
>
> int ret = snprintf(NULL, 0, "something", ...);
>
> if (ret < 0) {
> goto court;
> } else {
> char buf[ret];
>
> snprintf(buf, sizeof(buf), "something", ...); /* Guaranteed. */
> [...]
> }
I know you and Nelio mimicked snprintf() but as _merge() isn't a public API for
users but an internal API. I didn't think it should necessarily be like that. I
hoped to have it used either for testing (knowing size) or real translation - 1)
and 3). And no possibility for 2), then 2) would've been handled by assert().
I beleive this could've made the code simpler.
However, as I already acked Nelio's patchset, agreed on the idea and Nelio
already documented the behavior,
Acked-by: Yongseok Koh <yskoh@mellanox.com>
Thanks
next prev parent reply other threads:[~2018-07-12 18:25 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-27 18:08 [dpdk-dev] [PATCH 0/6] net/mlx5: add support " Adrien Mazarguil
2018-06-27 18:08 ` [dpdk-dev] [PATCH 1/6] net/mlx5: lay groundwork for switch offloads Adrien Mazarguil
2018-07-12 0:17 ` Yongseok Koh
2018-07-12 10:46 ` Adrien Mazarguil
2018-07-12 17:33 ` Yongseok Koh
2018-06-27 18:08 ` [dpdk-dev] [PATCH 2/6] net/mlx5: add framework for switch flow rules Adrien Mazarguil
2018-07-12 0:59 ` Yongseok Koh
2018-07-12 10:46 ` Adrien Mazarguil
2018-07-12 18:25 ` Yongseok Koh [this message]
2018-06-27 18:08 ` [dpdk-dev] [PATCH 3/6] net/mlx5: add fate actions to " Adrien Mazarguil
2018-07-12 1:00 ` Yongseok Koh
2018-06-27 18:08 ` [dpdk-dev] [PATCH 4/6] net/mlx5: add L2-L4 pattern items " Adrien Mazarguil
2018-07-12 1:02 ` Yongseok Koh
2018-06-27 18:08 ` [dpdk-dev] [PATCH 5/6] net/mlx5: add VLAN item and actions " Adrien Mazarguil
2018-07-12 1:10 ` Yongseok Koh
2018-07-12 10:47 ` Adrien Mazarguil
2018-07-12 18:49 ` Yongseok Koh
2018-06-27 18:08 ` [dpdk-dev] [PATCH 6/6] net/mlx5: add port ID pattern item " Adrien Mazarguil
2018-07-12 1:13 ` Yongseok Koh
2018-06-28 9:05 ` [dpdk-dev] [PATCH 0/6] net/mlx5: add support for " Nélio Laranjeiro
2018-07-13 9:40 ` [dpdk-dev] [PATCH v2 " Adrien Mazarguil
2018-07-13 9:40 ` [dpdk-dev] [PATCH v2 1/6] net/mlx5: lay groundwork for switch offloads Adrien Mazarguil
2018-07-14 1:29 ` Yongseok Koh
2018-07-23 21:40 ` Ferruh Yigit
2018-07-24 0:50 ` Stephen Hemminger
2018-07-24 4:35 ` Shahaf Shuler
2018-07-24 19:33 ` Stephen Hemminger
2018-07-13 9:40 ` [dpdk-dev] [PATCH v2 2/6] net/mlx5: add framework for switch flow rules Adrien Mazarguil
2018-07-13 9:40 ` [dpdk-dev] [PATCH v2 3/6] net/mlx5: add fate actions to " Adrien Mazarguil
2018-07-13 9:40 ` [dpdk-dev] [PATCH v2 4/6] net/mlx5: add L2-L4 pattern items " Adrien Mazarguil
2018-07-13 9:40 ` [dpdk-dev] [PATCH v2 5/6] net/mlx5: add VLAN item and actions " Adrien Mazarguil
2018-07-13 9:40 ` [dpdk-dev] [PATCH v2 6/6] net/mlx5: add port ID pattern item " Adrien Mazarguil
2018-07-22 11:21 ` [dpdk-dev] [PATCH v2 0/6] net/mlx5: add support for " Shahaf Shuler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180712182526.GA73570@yongseok-MBP.local \
--to=yskoh@mellanox.com \
--cc=adrien.mazarguil@6wind.com \
--cc=dev@dpdk.org \
--cc=nelio.laranjeiro@6wind.com \
--cc=shahafs@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).