DPDK patches and discussions
 help / color / mirror / Atom feed
* Re: [PATCH v3] netdev-dpdk: add control plane protection support
       [not found] <20221021145308.141933-1-rjarry@redhat.com>
@ 2022-11-11 18:15 ` Kevin Traynor
  2022-11-14  9:41   ` Kevin Traynor
  2022-11-14 12:46   ` Robin Jarry
  0 siblings, 2 replies; 3+ messages in thread
From: Kevin Traynor @ 2022-11-11 18:15 UTC (permalink / raw)
  To: Robin Jarry, dev; +Cc: Christophe Fontaine

Hi Robin,

On 21/10/2022 15:53, Robin Jarry wrote:
> Some control protocols are used to maintain link status between
> forwarding engines (e.g. LACP). When the system is not sized properly,
> the PMD threads may not be able to process all incoming traffic from the
> configured Rx queues. When a signaling packet of such protocols is
> dropped, it can cause link flapping, worsening the situation.
> 
> Use the RTE flow API to redirect these protocols into a dedicated Rx
> queue. The assumption is made that the ratio between control protocol
> traffic and user data traffic is very low and thus this dedicated Rx
> queue will never get full. The RSS redirection table is re-programmed to
> only use the other Rx queues. The RSS table size is stored in the
> netdev_dpdk structure at port initialization to avoid requesting the
> information again when changing the port configuration.
> 
> The additional Rx queue will be assigned a PMD core like any other Rx
> queue. Polling that extra queue may introduce increased latency and
> a slight performance penalty at the benefit of preventing link flapping.
> 
> This feature must be enabled per port on specific protocols via the
> cp-protection option. This option takes a coma-separated list of
> protocol names. It is only supported on ethernet ports.
> 
> If the user has already configured multiple Rx queues on the port, an
> additional one will be allocated for control plane packets. If the
> hardware cannot satisfy the requested number of requested Rx queues, the
> last Rx queue will be assigned for control plane. If only one Rx queue
> is available, the cp-protection feature will be disabled. If the
> hardware does not support the RTE flow matchers/actions, the feature
> will be disabled.
> 
> It cannot be enabled when other_config:hw-offload=true as it may
> conflict with the offloaded RTE flows. Similarly, if hw-offload is
> enabled while some ports already have cp-protection enabled, the RTE
> flow offloading will be disabled on these ports.
> 
> Example use:
> 
>   ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
>     set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
>     set interface phy0 options:cp-protection=lacp -- \
>     set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
>     set interface phy1 options:cp-protection=lacp
> 
> As a starting point, only one protocol is supported: LACP. Other
> protocols can be added in the future. NIC compatibility should be
> checked.
> 
> To validate that this works as intended, I used a traffic generator to
> generate random traffic slightly above the machine capacity at line rate
> on a two ports bond interface. OVS is configured to receive traffic on
> two VLANs and pop/push them in a br-int bridge based on tags set on
> patch ports.
> 
>     +----------------------+
>     |         DUT          |
>     |+--------------------+|
>     ||       br-int       || default flow, action=NORMAL
>     ||                    ||
>     || patch10    patch11 ||
>     |+---|-----------|----+|
>     |    |           |     |
>     |+---|-----------|----+|
>     || patch00    patch01 ||
>     ||  tag:10    tag:20  ||
>     ||                    ||
>     ||       br-phy       || default flow, action=NORMAL
>     ||                    ||
>     ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
>     ||    phy0   phy1     ||
>     |+------|-----|-------+|
>     +-------|-----|--------+
>             |     |
>     +-------|-----|--------+
>     |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
>     |         lag          | mode trunk VLANs 10, 20
>     |                      |
>     |        switch        |
>     |                      |
>     |  vlan 10    vlan 20  |  mode access
>     |   port2      port3   |
>     +-----|----------|-----+
>           |          |
>     +-----|----------|-----+
>     |   port0      port1   |  Random traffic that is properly balanced
>     |                      |  across the bond ports in both directions.
>     |  traffic generator   |
>     +----------------------+
> 
> Without cp-protection, the bond0 links are randomly switching to
> "defaulted" when one of the LACP packets sent by the switch is dropped
> because the RX queues are full and the PMD threads did not process them
> fast enough. When that happens, all traffic must go through a single
> link which causes above line rate traffic to be dropped.
> 
> When cp-protection is enabled, no LACP packet is dropped and the bond
> links remain enabled at all times, maximizing the throughput.
> 
> This feature may be considered as "QoS". However, it does not work by
> limiting the rate of traffic explicitly. It only guarantees that some
> protocols have a lower chance of being dropped because the PMD cores
> cannot keep up with regular traffic.
> 
> The choice of protocols is limited on purpose. This is not meant to be
> configurable by users. Some limited configurability could be considered
> in the future but it would expose to more potential issues if users are
> accidentally redirecting all traffic in the control plane queue.
> 

Not a full review, but rather send now with issue I saw on CX-5 that we 
discussed as it'll impact testing.

> Cc: Christophe Fontaine <cfontain@redhat.com>
> Cc: Kevin Traynor <ktraynor@redhat.com>
> Signed-off-by: Robin Jarry <rjarry@redhat.com>
> ---
> v2 -> v3:
> 
> * Added dry_run validation that rte_flows are all supported by the NIC
>    before configuring anything.
> * Added check to make cp-protection and hw-offload mutually exclusive.
> * Removed the "match-all" RSS flow that dealt with redirecting all
>    non-control-plane traffic to all but the control-plane Rx queue. Very
>    few NICs actually support "match-all" flows without any mask. This was
>    replaced by reconfiguring the RSS redirection table. The
> * Made sure to unconfigure everything and remove the extra Rx queue in
>    the case the hardware does not support one of the RTE flows.
> * Updated vswitchd/vswitch.xml
> * Added diagnostics info in netdev_dpdk_get_status
> * Tested under load on the following NICs:
>    - Intel E810 (2x 25G)
>    - Mellanox ConnectX-5 (2x 25G)
> * Basic functionality tested on the following NICs:
>    - Intel 82599ES (2x 10G)
>    - Intel X710 (4x 10G)
>    - Mellanox ConnectX-4 (2x 25G)
> 
>   Documentation/topics/dpdk/phy.rst |  55 ++++++
>   lib/netdev-dpdk.c                 | 293 +++++++++++++++++++++++++++++-
>   vswitchd/vswitch.xml              |  26 +++
>   3 files changed, 373 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
> index 937f4c40e5a8..86e69d79b104 100644
> --- a/Documentation/topics/dpdk/phy.rst
> +++ b/Documentation/topics/dpdk/phy.rst
> @@ -131,6 +131,61 @@ possible with DPDK acceleration. It is possible to configure multiple Rx queues
>   for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
>   information on configuring PMD threads, refer to :doc:`pmd`.
>   
> +Control Plane Protection
> +------------------------
> +
> +Some control protocols are used to maintain link status between forwarding
> +engines. In SDN environments, these packets share the same physical network
> +than the user data traffic.
> +
> +When the system is not sized properly, the PMD threads may not be able to
> +process all incoming traffic from the configured Rx queues. When a signaling
> +packet of such protocols is dropped, it can cause link flapping, worsening the
> +situation.
> +
> +Some physical NICs can be programmed to put these protocols in a dedicated
> +hardware Rx queue using the rte_flow__ API.
> +
> +__ https://doc.dpdk.org/guides-21.11/prog_guide/rte_flow.html#device-compatibility
> +
> +The currently supported control plane protocols are:
> +
> +``lacp``
> +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
> +
> +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
> +
> +.. warning::
> +
> +   This feature is not compatible with all NICs. Refer to vendor documentation
> +   for more information.
> +
> +Control plane protection must be enabled on specific protocols per port. The
> +``cp-protection`` option requires a coma separated list of protocol names::
> +
> +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +        options:dpdk-devargs=0000:01:00.0 options:cp-protection=lacp
> +
> +.. note::
> +
> +   If multiple Rx queues are already configured, regular RSS (Receive Side
> +   Scaling) queue balancing is done on all but the extra control plane
> +   protection queue.
> +
> +.. tip::
> +
> +   You can check if control plane protection is supported on a port with the
> +   following command::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
> +
> +   If the hardware does not support redirecting control plane traffic to
> +   a dedicated queue, it will be explicit::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection_queue=unsupported, driver_name=..., rss_queues="0-1"}
> +
>   .. _dpdk-phy-flow-control:
>   
>   Flow Control
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 0dd655507b50..94f04437a641 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -410,6 +410,11 @@ enum dpdk_hw_ol_features {
>       NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
>   };
>   
> +enum dpdk_cp_prot_flags {
> +    DPDK_CP_PROT_UNSUPPORTED = 1 << 0,
> +    DPDK_CP_PROT_LACP = 1 << 1,
> +};
> +
>   /*
>    * In order to avoid confusion in variables names, following naming convention
>    * should be used, if possible:
> @@ -453,6 +458,7 @@ struct netdev_dpdk {
>           };
>           struct dpdk_tx_queue *tx_q;
>           struct rte_eth_link link;
> +        uint16_t reta_size;
>       );
>   
>       PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
> @@ -529,6 +535,13 @@ struct netdev_dpdk {
>   
>           /* VF configuration. */
>           struct eth_addr requested_hwaddr;
> +
> +        /* Requested control plane protection flags,
> +         * from the enum set 'dpdk_cp_prot_flags' */
> +        uint64_t requested_cp_prot_flags;
> +        uint64_t cp_prot_flags;
> +        size_t cp_prot_flows_num;
> +        struct rte_flow **cp_prot_flows;
>       );
>   
>       PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -1192,6 +1205,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>                         netdev_get_name(&dev->up));
>           }
>       }
> +    dev->reta_size = info.reta_size;
>   
>       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
> @@ -1309,6 +1323,10 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>       dev->requested_n_txq = NR_QUEUE;
>       dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
>       dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
> +    dev->requested_cp_prot_flags = 0;
> +    dev->cp_prot_flags = 0;
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
>   
>       /* Initialize the flow control to NULL */
>       memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
> @@ -1904,6 +1922,9 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
>       int new_n_rxq;
>   
>       new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (dev->requested_cp_prot_flags) {
> +        new_n_rxq += 1;
> +    }
>       if (new_n_rxq != dev->requested_n_rxq) {
>           dev->requested_n_rxq = new_n_rxq;
>           netdev_request_reconfigure(&dev->up);
> @@ -1927,6 +1948,53 @@ dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
>       }
>   }
>   
> +static int
> +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
> +                        const struct smap *args, char **errp)
> +{
> +    const char *arg = smap_get_def(args, "cp-protection", "");
> +    uint64_t flags = 0;
> +    char buf[256];
> +    char *token, *saveptr;
> +
> +    ovs_strzcpy(buf, arg, sizeof(buf));
> +    buf[sizeof(buf) - 1] = '\0';
> +
> +    token = strtok_r(buf, ",", &saveptr);
> +    while (token) {
> +        if (strcmp(token, "lacp") == 0) {
> +            flags |= DPDK_CP_PROT_LACP;
> +        } else {
> +            VLOG_WARN_BUF(
> +                errp, "%s options:cp-protection unknown protocol '%s'",
> +                netdev_get_name(netdev), token);
> +            return -1;
> +        }
> +        token = strtok_r(NULL, ",", &saveptr);
> +    }
> +
> +    if (flags && dev->type != DPDK_DEV_ETH) {
> +        VLOG_WARN_BUF( errp,
> +            "%s options:cp-protection is only supported on ethernet ports",
> +            netdev_get_name(netdev));
> +        return -1;
> +    }
> +
> +    if (flags && netdev_is_flow_api_enabled()) {
> +        VLOG_WARN_BUF(errp,
> +            "%s options:cp-protection is incompatible with hw-offload",
> +            netdev_get_name(netdev));
> +        return -1;
> +    }
> +
> +    if (flags != dev->requested_cp_prot_flags) {
> +        dev->requested_cp_prot_flags = flags;
> +        netdev_request_reconfigure(netdev);
> +    }
> +
> +    return 0;
> +}
> +
>   static int
>   netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>                          char **errp)
> @@ -1946,6 +2014,11 @@ netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>       ovs_mutex_lock(&dpdk_mutex);
>       ovs_mutex_lock(&dev->mutex);
>   
> +    if (dpdk_cp_prot_set_config(netdev, dev, args, errp) < 0) {
> +        err = EINVAL;
> +        goto out;
> +    }
> +
>       dpdk_set_rxq_config(dev, args);
>   
>       dpdk_process_queue_size(netdev, args, "n_rxq_desc",
> @@ -3639,8 +3712,10 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>   {
>       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>       struct rte_eth_dev_info dev_info;
> +    uint64_t cp_prot_flags;
>       uint32_t link_speed;
>       uint32_t dev_flags;
> +    int n_rxq;
>   
>       if (!rte_eth_dev_is_valid_port(dev->port_id)) {
>           return ENODEV;
> @@ -3651,6 +3726,8 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>       rte_eth_dev_info_get(dev->port_id, &dev_info);
>       link_speed = dev->link.link_speed;
>       dev_flags = *dev_info.dev_flags;
> +    cp_prot_flags = dev->cp_prot_flags;
> +    n_rxq = netdev->n_rxq;
>       ovs_mutex_unlock(&dev->mutex);
>       const struct rte_bus *bus;
>       const struct rte_pci_device *pci_dev;
> @@ -3703,6 +3780,24 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>                           ETH_ADDR_ARGS(dev->hwaddr));
>       }
>   
> +    if (cp_prot_flags) {
> +        if (cp_prot_flags & DPDK_CP_PROT_UNSUPPORTED) {
> +            smap_add(args, "cp_protection_queue", "unsupported");
> +            if (n_rxq > 1) {
> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 1);
> +            } else {
> +                smap_add(args, "rss_queues", "0");
> +            }
> +        } else {
> +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
> +            if (n_rxq > 2) {
> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
> +            } else {
> +                smap_add(args, "rss_queues", "0");
> +            }
> +        }
> +    }
> +
>       return 0;
>   }
>   
> @@ -4933,6 +5028,179 @@ static const struct dpdk_qos_ops trtcm_policer_ops = {
>       .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
>   };
>   
> +static int
> +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
> +                      const struct rte_flow_attr *attr,
> +                      const struct rte_flow_item items[],
> +                      const struct rte_flow_action actions[],
> +                      const char *desc, bool dry_run)
> +{
> +    struct rte_flow_error error;
> +    struct rte_flow *flow;
> +    size_t num;
> +
> +    if (dry_run) {
> +        int ret;
> +        ret = rte_flow_validate(dev->port_id, attr, items, actions, &error);
> +        if (rte_flow_validate(dev->port_id, attr, items, actions, &error)) {

'if (ret)'

> +            VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
> +                netdev_get_name(&dev->up), desc, error.message);
> +        }
> +        return ret;
> +    }
> +
> +    flow = rte_flow_create(dev->port_id, attr, items, actions, &error);
> +    if (flow == NULL) {
> +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
> +            netdev_get_name(&dev->up), desc, error.message);
> +        return rte_errno;
> +    }
> +
> +    num = dev->cp_prot_flows_num + 1;
> +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
> +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
> +    dev->cp_prot_flows_num = num;
> +
> +    return 0;
> +}
> +
> +static int
> +dpdk_cp_prot_add_traffic_flow(struct netdev_dpdk *dev,
> +                              const struct rte_flow_item items[],
> +                              const char *desc, bool dry_run)
> +{
> +    const struct rte_flow_attr attr = { .ingress = 1 };
> +    const struct rte_flow_action actions[] = {
> +        {
> +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
> +            .conf = &(const struct rte_flow_action_queue) {
> +                .index = dev->up.n_rxq - 1,
> +            },
> +        },
> +        { .type = RTE_FLOW_ACTION_TYPE_END },
> +    };
> +
> +    if (!dry_run) {
> +        VLOG_INFO("%s: cp-protection: redirecting %s traffic to queue %d",
> +            netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
> +    }
> +    return dpdk_cp_prot_add_flow(dev, &attr, items, actions, desc, dry_run);
> +}
> +
> +static int
> +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
> +{
> +    struct rte_eth_rss_reta_entry64 *reta_conf;
> +    size_t reta_conf_size;
> +    int err;
> +
> +    if (rss_n_rxq == 1) {
> +        VLOG_INFO("%s: cp-protection: redirecting other traffic to queue 0",
> +            netdev_get_name(&dev->up));
> +    } else {
> +        VLOG_INFO("%s: cp-protection: applying rss on queues 0-%d",
> +            netdev_get_name(&dev->up), rss_n_rxq - 1);
> +    }
> +
> +    reta_conf_size = (dev->reta_size / RTE_ETH_RETA_GROUP_SIZE)
> +        * sizeof(struct rte_eth_rss_reta_entry64);

In dpdk_eth_dev_init, we get reta_size from driver,

mlx5_ethdev.c

333├>        info->reta_size = priv->reta_idx_n ?
334│                 priv->reta_idx_n : config->ind_table_max_size;

(gdb) p priv->reta_idx_n
$5 = 1
(gdb) p config->ind_table_max_size
$6 = 512

and store:
dev->reta_size = info.reta_size;

Now we use it,
dev->reta_size = 1 / RTE_ETH_RETA_GROUP_SIZE   (64)
but it results in reta_conf_size = 0

> +    reta_conf = xmalloc(reta_conf_size);

xmalloc only allocates 1 byte (void *p = malloc(size ? size : 1);)

> +    memset(reta_conf, 0, reta_conf_size);
> +
> +    for (uint16_t i = 0; i < dev->reta_size; i++) {
> +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
> +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
> +        reta_conf[idx].mask |= 1ULL << shift;
> +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
> +    }
> +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, dev->reta_size);
> +    if (err < 0) {
> +        VLOG_DBG("%s: failed to configure RSS redirection table: err=%d",
> +            netdev_get_name(&dev->up), err);
> +    }
> +
> +    free(reta_conf);
> +
> +    return err;
> +}
> +
> +static int
> +dpdk_cp_prot_configure(struct netdev_dpdk *dev, bool dry_run)
> +{
> +    int err = 0;
> +
> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_UNSUPPORTED) {
> +        goto out;
> +    }
> +    if (dev->up.n_rxq < 2) {
> +        err = ENOTSUP;
> +        VLOG_DBG("%s: cp-protection: not enough available rx queues",
> +            netdev_get_name(&dev->up));
> +        goto out;
> +    }
> +
> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
> +        err = dpdk_cp_prot_add_traffic_flow(
> +            dev,
> +            (const struct rte_flow_item []) {
> +                {
> +                    .type = RTE_FLOW_ITEM_TYPE_ETH,
> +                    .spec = &(const struct rte_flow_item_eth){
> +                        .type = htons(ETH_TYPE_LACP),
> +                    },
> +                    .mask = &(const struct rte_flow_item_eth){
> +                        .type = htons(0xffff),
> +                    },
> +                },
> +                { .type = RTE_FLOW_ITEM_TYPE_END },
> +            },
> +            "lacp",
> +            dry_run
> +        );
> +        if (err) {
> +            goto out;
> +        }
> +    }
> +
> +    if (!dry_run && dev->cp_prot_flows_num) {
> +        /* reconfigure RSS reta in all but the cp protection queue */
> +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
> +    }
> +
> +out:
> +    if (!dry_run) {
> +        dev->cp_prot_flags = dev->requested_cp_prot_flags;
> +    }
> +    if (err) {
> +        dev->requested_cp_prot_flags |= DPDK_CP_PROT_UNSUPPORTED;
> +    }
> +    return err;
> +}
> +
> +static void
> +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
> +{
> +    struct rte_flow_error error;
> +
> +    if (dev->cp_prot_flows_num == 0) {
> +        return;
> +    }
> +
> +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
> +
> +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
> +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
> +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
> +                netdev_get_name(&dev->up), error.message);
> +        }
> +    }
> +    free(dev->cp_prot_flows);
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
> +
> +    (void) dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq);
> +}
> +
>   static int
>   netdev_dpdk_reconfigure(struct netdev *netdev)
>   {
> @@ -4943,6 +5211,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>   
>       if (netdev->n_txq == dev->requested_n_txq
>           && netdev->n_rxq == dev->requested_n_rxq
> +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
>           && dev->mtu == dev->requested_mtu
>           && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
>           && dev->rxq_size == dev->requested_rxq_size
> @@ -4987,6 +5256,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>           }
>       }
>   
> +    dpdk_cp_prot_unconfigure(dev);
> +
>       err = dpdk_eth_dev_init(dev);
>       if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>           netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> @@ -5014,6 +5285,20 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>       if (!dev->tx_q) {
>           err = ENOMEM;
>       }
> +    if (!err && dev->requested_cp_prot_flags) {
> +        /* dry run first */
> +        err = dpdk_cp_prot_configure(dev, true);
> +        if (!err) {
> +            /* if no error, apply configuration */
> +            err = dpdk_cp_prot_configure(dev, false);
> +        }
> +        if (err) {
> +            /* no hw support, remove the extra queue & recover gracefully */
> +            err = 0;
> +            dev->requested_n_rxq -= 1;
> +            netdev_request_reconfigure(netdev);
> +        }
> +    }
>   
>       netdev_change_seq_changed(netdev);
>   
> @@ -5215,7 +5500,13 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
>       ovs_mutex_lock(&dev->mutex);
>       if (dev->type == DPDK_DEV_ETH) {
>           /* TODO: Check if we able to offload some minimal flow. */
> -        ret = true;
> +        if (dev->requested_cp_prot_flags || dev->cp_prot_flags) {
> +            VLOG_WARN(
> +                "%s: hw-offload is mutually exclusive with cp-protection",
> +                netdev_get_name(netdev));
> +        } else {
> +            ret = true;
> +        }
>       }
>       ovs_mutex_unlock(&dev->mutex);
>   out:
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 36388e3c42d7..7e6ae3df7583 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3430,6 +3430,32 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>           <p>This option may only be used with dpdk VF representors.</p>
>         </column>
>   
> +      <column name="options" key="cp-protection"
> +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
> +        <p>
> +          Allocate an extra Rx queue for control plane packets of the specified
> +          protocol(s).
> +        </p>
> +        <p>
> +          If the user has already configured multiple
> +          <code>options:n_rxq</code> on the port, an additional one will be
> +          allocated for control plane packets. If the hardware cannot satisfy
> +          the requested number of requested Rx queues, the last Rx queue will
> +          be assigned for control plane. If only one Rx queue is available or
> +          if the hardware does not support the RTE flow matchers/actions
> +          required to redirect the selected protocols,
> +          <code>cp-protection</code> will be disabled.
> +        </p>
> +        <p>
> +          This feature is multually exclusive with
> +          <code>other_options:hw-offload</code> as it may conflict with the
> +          offloaded RTE flows.
> +        </p>
> +        <p>
> +          Disabled by default.
> +        </p>
> +      </column>
> +
>         <column name="other_config" key="tx-steering"
>                 type='{"type": "string",
>                        "enum": ["set", ["thread", "hash"]]}'>



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] netdev-dpdk: add control plane protection support
  2022-11-11 18:15 ` [PATCH v3] netdev-dpdk: add control plane protection support Kevin Traynor
@ 2022-11-14  9:41   ` Kevin Traynor
  2022-11-14 12:46   ` Robin Jarry
  1 sibling, 0 replies; 3+ messages in thread
From: Kevin Traynor @ 2022-11-14  9:41 UTC (permalink / raw)
  To: Robin Jarry, dev, dev; +Cc: Christophe Fontaine, Ori Kam

Fixing To: to add back in OVS ML.

+cc Ori, as mlx5 operation discussed below.

On 11/11/2022 18:15, Kevin Traynor wrote:
> Hi Robin,
> 
> On 21/10/2022 15:53, Robin Jarry wrote:
>> Some control protocols are used to maintain link status between
>> forwarding engines (e.g. LACP). When the system is not sized properly,
>> the PMD threads may not be able to process all incoming traffic from the
>> configured Rx queues. When a signaling packet of such protocols is
>> dropped, it can cause link flapping, worsening the situation.
>>
>> Use the RTE flow API to redirect these protocols into a dedicated Rx
>> queue. The assumption is made that the ratio between control protocol
>> traffic and user data traffic is very low and thus this dedicated Rx
>> queue will never get full. The RSS redirection table is re-programmed to
>> only use the other Rx queues. The RSS table size is stored in the
>> netdev_dpdk structure at port initialization to avoid requesting the
>> information again when changing the port configuration.
>>
>> The additional Rx queue will be assigned a PMD core like any other Rx
>> queue. Polling that extra queue may introduce increased latency and
>> a slight performance penalty at the benefit of preventing link flapping.
>>
>> This feature must be enabled per port on specific protocols via the
>> cp-protection option. This option takes a coma-separated list of
>> protocol names. It is only supported on ethernet ports.
>>
>> If the user has already configured multiple Rx queues on the port, an
>> additional one will be allocated for control plane packets. If the
>> hardware cannot satisfy the requested number of requested Rx queues, the
>> last Rx queue will be assigned for control plane. If only one Rx queue
>> is available, the cp-protection feature will be disabled. If the
>> hardware does not support the RTE flow matchers/actions, the feature
>> will be disabled.
>>
>> It cannot be enabled when other_config:hw-offload=true as it may
>> conflict with the offloaded RTE flows. Similarly, if hw-offload is
>> enabled while some ports already have cp-protection enabled, the RTE
>> flow offloading will be disabled on these ports.
>>
>> Example use:
>>
>>    ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
>>      set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
>>      set interface phy0 options:cp-protection=lacp -- \
>>      set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
>>      set interface phy1 options:cp-protection=lacp
>>
>> As a starting point, only one protocol is supported: LACP. Other
>> protocols can be added in the future. NIC compatibility should be
>> checked.
>>
>> To validate that this works as intended, I used a traffic generator to
>> generate random traffic slightly above the machine capacity at line rate
>> on a two ports bond interface. OVS is configured to receive traffic on
>> two VLANs and pop/push them in a br-int bridge based on tags set on
>> patch ports.
>>
>>      +----------------------+
>>      |         DUT          |
>>      |+--------------------+|
>>      ||       br-int       || default flow, action=NORMAL
>>      ||                    ||
>>      || patch10    patch11 ||
>>      |+---|-----------|----+|
>>      |    |           |     |
>>      |+---|-----------|----+|
>>      || patch00    patch01 ||
>>      ||  tag:10    tag:20  ||
>>      ||                    ||
>>      ||       br-phy       || default flow, action=NORMAL
>>      ||                    ||
>>      ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
>>      ||    phy0   phy1     ||
>>      |+------|-----|-------+|
>>      +-------|-----|--------+
>>              |     |
>>      +-------|-----|--------+
>>      |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
>>      |         lag          | mode trunk VLANs 10, 20
>>      |                      |
>>      |        switch        |
>>      |                      |
>>      |  vlan 10    vlan 20  |  mode access
>>      |   port2      port3   |
>>      +-----|----------|-----+
>>            |          |
>>      +-----|----------|-----+
>>      |   port0      port1   |  Random traffic that is properly balanced
>>      |                      |  across the bond ports in both directions.
>>      |  traffic generator   |
>>      +----------------------+
>>
>> Without cp-protection, the bond0 links are randomly switching to
>> "defaulted" when one of the LACP packets sent by the switch is dropped
>> because the RX queues are full and the PMD threads did not process them
>> fast enough. When that happens, all traffic must go through a single
>> link which causes above line rate traffic to be dropped.
>>
>> When cp-protection is enabled, no LACP packet is dropped and the bond
>> links remain enabled at all times, maximizing the throughput.
>>
>> This feature may be considered as "QoS". However, it does not work by
>> limiting the rate of traffic explicitly. It only guarantees that some
>> protocols have a lower chance of being dropped because the PMD cores
>> cannot keep up with regular traffic.
>>
>> The choice of protocols is limited on purpose. This is not meant to be
>> configurable by users. Some limited configurability could be considered
>> in the future but it would expose to more potential issues if users are
>> accidentally redirecting all traffic in the control plane queue.
>>
> 
> Not a full review, but rather send now with issue I saw on CX-5 that we
> discussed as it'll impact testing.
> 
>> Cc: Christophe Fontaine <cfontain@redhat.com>
>> Cc: Kevin Traynor <ktraynor@redhat.com>
>> Signed-off-by: Robin Jarry <rjarry@redhat.com>
>> ---
>> v2 -> v3:
>>
>> * Added dry_run validation that rte_flows are all supported by the NIC
>>     before configuring anything.
>> * Added check to make cp-protection and hw-offload mutually exclusive.
>> * Removed the "match-all" RSS flow that dealt with redirecting all
>>     non-control-plane traffic to all but the control-plane Rx queue. Very
>>     few NICs actually support "match-all" flows without any mask. This was
>>     replaced by reconfiguring the RSS redirection table. The
>> * Made sure to unconfigure everything and remove the extra Rx queue in
>>     the case the hardware does not support one of the RTE flows.
>> * Updated vswitchd/vswitch.xml
>> * Added diagnostics info in netdev_dpdk_get_status
>> * Tested under load on the following NICs:
>>     - Intel E810 (2x 25G)
>>     - Mellanox ConnectX-5 (2x 25G)
>> * Basic functionality tested on the following NICs:
>>     - Intel 82599ES (2x 10G)
>>     - Intel X710 (4x 10G)
>>     - Mellanox ConnectX-4 (2x 25G)
>>
>>    Documentation/topics/dpdk/phy.rst |  55 ++++++
>>    lib/netdev-dpdk.c                 | 293 +++++++++++++++++++++++++++++-
>>    vswitchd/vswitch.xml              |  26 +++
>>    3 files changed, 373 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
>> index 937f4c40e5a8..86e69d79b104 100644
>> --- a/Documentation/topics/dpdk/phy.rst
>> +++ b/Documentation/topics/dpdk/phy.rst
>> @@ -131,6 +131,61 @@ possible with DPDK acceleration. It is possible to configure multiple Rx queues
>>    for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
>>    information on configuring PMD threads, refer to :doc:`pmd`.
>>    
>> +Control Plane Protection
>> +------------------------
>> +
>> +Some control protocols are used to maintain link status between forwarding
>> +engines. In SDN environments, these packets share the same physical network
>> +than the user data traffic.
>> +
>> +When the system is not sized properly, the PMD threads may not be able to
>> +process all incoming traffic from the configured Rx queues. When a signaling
>> +packet of such protocols is dropped, it can cause link flapping, worsening the
>> +situation.
>> +
>> +Some physical NICs can be programmed to put these protocols in a dedicated
>> +hardware Rx queue using the rte_flow__ API.
>> +
>> +__ https://doc.dpdk.org/guides-21.11/prog_guide/rte_flow.html#device-compatibility
>> +
>> +The currently supported control plane protocols are:
>> +
>> +``lacp``
>> +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
>> +
>> +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
>> +
>> +.. warning::
>> +
>> +   This feature is not compatible with all NICs. Refer to vendor documentation
>> +   for more information.
>> +
>> +Control plane protection must be enabled on specific protocols per port. The
>> +``cp-protection`` option requires a coma separated list of protocol names::
>> +
>> +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
>> +        options:dpdk-devargs=0000:01:00.0 options:cp-protection=lacp
>> +
>> +.. note::
>> +
>> +   If multiple Rx queues are already configured, regular RSS (Receive Side
>> +   Scaling) queue balancing is done on all but the extra control plane
>> +   protection queue.
>> +
>> +.. tip::
>> +
>> +   You can check if control plane protection is supported on a port with the
>> +   following command::
>> +
>> +      $ ovs-vsctl get interface dpdk-p0 status
>> +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
>> +
>> +   If the hardware does not support redirecting control plane traffic to
>> +   a dedicated queue, it will be explicit::
>> +
>> +      $ ovs-vsctl get interface dpdk-p0 status
>> +      {cp_protection_queue=unsupported, driver_name=..., rss_queues="0-1"}
>> +
>>    .. _dpdk-phy-flow-control:
>>    
>>    Flow Control
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 0dd655507b50..94f04437a641 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -410,6 +410,11 @@ enum dpdk_hw_ol_features {
>>        NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
>>    };
>>    
>> +enum dpdk_cp_prot_flags {
>> +    DPDK_CP_PROT_UNSUPPORTED = 1 << 0,
>> +    DPDK_CP_PROT_LACP = 1 << 1,
>> +};
>> +
>>    /*
>>     * In order to avoid confusion in variables names, following naming convention
>>     * should be used, if possible:
>> @@ -453,6 +458,7 @@ struct netdev_dpdk {
>>            };
>>            struct dpdk_tx_queue *tx_q;
>>            struct rte_eth_link link;
>> +        uint16_t reta_size;
>>        );
>>    
>>        PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>> @@ -529,6 +535,13 @@ struct netdev_dpdk {
>>    
>>            /* VF configuration. */
>>            struct eth_addr requested_hwaddr;
>> +
>> +        /* Requested control plane protection flags,
>> +         * from the enum set 'dpdk_cp_prot_flags' */
>> +        uint64_t requested_cp_prot_flags;
>> +        uint64_t cp_prot_flags;
>> +        size_t cp_prot_flows_num;
>> +        struct rte_flow **cp_prot_flows;
>>        );
>>    
>>        PADDED_MEMBERS(CACHE_LINE_SIZE,
>> @@ -1192,6 +1205,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>                          netdev_get_name(&dev->up));
>>            }
>>        }
>> +    dev->reta_size = info.reta_size;
>>    
>>        n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
>>        n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>> @@ -1309,6 +1323,10 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>>        dev->requested_n_txq = NR_QUEUE;
>>        dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
>>        dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
>> +    dev->requested_cp_prot_flags = 0;
>> +    dev->cp_prot_flags = 0;
>> +    dev->cp_prot_flows_num = 0;
>> +    dev->cp_prot_flows = NULL;
>>    
>>        /* Initialize the flow control to NULL */
>>        memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
>> @@ -1904,6 +1922,9 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
>>        int new_n_rxq;
>>    
>>        new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
>> +    if (dev->requested_cp_prot_flags) {
>> +        new_n_rxq += 1;
>> +    }
>>        if (new_n_rxq != dev->requested_n_rxq) {
>>            dev->requested_n_rxq = new_n_rxq;
>>            netdev_request_reconfigure(&dev->up);
>> @@ -1927,6 +1948,53 @@ dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
>>        }
>>    }
>>    
>> +static int
>> +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
>> +                        const struct smap *args, char **errp)
>> +{
>> +    const char *arg = smap_get_def(args, "cp-protection", "");
>> +    uint64_t flags = 0;
>> +    char buf[256];
>> +    char *token, *saveptr;
>> +
>> +    ovs_strzcpy(buf, arg, sizeof(buf));
>> +    buf[sizeof(buf) - 1] = '\0';
>> +
>> +    token = strtok_r(buf, ",", &saveptr);
>> +    while (token) {
>> +        if (strcmp(token, "lacp") == 0) {
>> +            flags |= DPDK_CP_PROT_LACP;
>> +        } else {
>> +            VLOG_WARN_BUF(
>> +                errp, "%s options:cp-protection unknown protocol '%s'",
>> +                netdev_get_name(netdev), token);
>> +            return -1;
>> +        }
>> +        token = strtok_r(NULL, ",", &saveptr);
>> +    }
>> +
>> +    if (flags && dev->type != DPDK_DEV_ETH) {
>> +        VLOG_WARN_BUF( errp,
>> +            "%s options:cp-protection is only supported on ethernet ports",
>> +            netdev_get_name(netdev));
>> +        return -1;
>> +    }
>> +
>> +    if (flags && netdev_is_flow_api_enabled()) {
>> +        VLOG_WARN_BUF(errp,
>> +            "%s options:cp-protection is incompatible with hw-offload",
>> +            netdev_get_name(netdev));
>> +        return -1;
>> +    }
>> +
>> +    if (flags != dev->requested_cp_prot_flags) {
>> +        dev->requested_cp_prot_flags = flags;
>> +        netdev_request_reconfigure(netdev);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>    static int
>>    netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>>                           char **errp)
>> @@ -1946,6 +2014,11 @@ netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>>        ovs_mutex_lock(&dpdk_mutex);
>>        ovs_mutex_lock(&dev->mutex);
>>    
>> +    if (dpdk_cp_prot_set_config(netdev, dev, args, errp) < 0) {
>> +        err = EINVAL;
>> +        goto out;
>> +    }
>> +
>>        dpdk_set_rxq_config(dev, args);
>>    
>>        dpdk_process_queue_size(netdev, args, "n_rxq_desc",
>> @@ -3639,8 +3712,10 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>>    {
>>        struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>        struct rte_eth_dev_info dev_info;
>> +    uint64_t cp_prot_flags;
>>        uint32_t link_speed;
>>        uint32_t dev_flags;
>> +    int n_rxq;
>>    
>>        if (!rte_eth_dev_is_valid_port(dev->port_id)) {
>>            return ENODEV;
>> @@ -3651,6 +3726,8 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>>        rte_eth_dev_info_get(dev->port_id, &dev_info);
>>        link_speed = dev->link.link_speed;
>>        dev_flags = *dev_info.dev_flags;
>> +    cp_prot_flags = dev->cp_prot_flags;
>> +    n_rxq = netdev->n_rxq;
>>        ovs_mutex_unlock(&dev->mutex);
>>        const struct rte_bus *bus;
>>        const struct rte_pci_device *pci_dev;
>> @@ -3703,6 +3780,24 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>>                            ETH_ADDR_ARGS(dev->hwaddr));
>>        }
>>    
>> +    if (cp_prot_flags) {
>> +        if (cp_prot_flags & DPDK_CP_PROT_UNSUPPORTED) {
>> +            smap_add(args, "cp_protection_queue", "unsupported");
>> +            if (n_rxq > 1) {
>> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 1);
>> +            } else {
>> +                smap_add(args, "rss_queues", "0");
>> +            }
>> +        } else {
>> +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
>> +            if (n_rxq > 2) {
>> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
>> +            } else {
>> +                smap_add(args, "rss_queues", "0");
>> +            }
>> +        }
>> +    }
>> +
>>        return 0;
>>    }
>>    
>> @@ -4933,6 +5028,179 @@ static const struct dpdk_qos_ops trtcm_policer_ops = {
>>        .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
>>    };
>>    
>> +static int
>> +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
>> +                      const struct rte_flow_attr *attr,
>> +                      const struct rte_flow_item items[],
>> +                      const struct rte_flow_action actions[],
>> +                      const char *desc, bool dry_run)
>> +{
>> +    struct rte_flow_error error;
>> +    struct rte_flow *flow;
>> +    size_t num;
>> +
>> +    if (dry_run) {
>> +        int ret;
>> +        ret = rte_flow_validate(dev->port_id, attr, items, actions, &error);
>> +        if (rte_flow_validate(dev->port_id, attr, items, actions, &error)) {
> 
> 'if (ret)'
> 
>> +            VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
>> +                netdev_get_name(&dev->up), desc, error.message);
>> +        }
>> +        return ret;
>> +    }
>> +
>> +    flow = rte_flow_create(dev->port_id, attr, items, actions, &error);
>> +    if (flow == NULL) {
>> +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
>> +            netdev_get_name(&dev->up), desc, error.message);
>> +        return rte_errno;
>> +    }
>> +
>> +    num = dev->cp_prot_flows_num + 1;
>> +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
>> +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
>> +    dev->cp_prot_flows_num = num;
>> +
>> +    return 0;
>> +}
>> +
>> +static int
>> +dpdk_cp_prot_add_traffic_flow(struct netdev_dpdk *dev,
>> +                              const struct rte_flow_item items[],
>> +                              const char *desc, bool dry_run)
>> +{
>> +    const struct rte_flow_attr attr = { .ingress = 1 };
>> +    const struct rte_flow_action actions[] = {
>> +        {
>> +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
>> +            .conf = &(const struct rte_flow_action_queue) {
>> +                .index = dev->up.n_rxq - 1,
>> +            },
>> +        },
>> +        { .type = RTE_FLOW_ACTION_TYPE_END },
>> +    };
>> +
>> +    if (!dry_run) {
>> +        VLOG_INFO("%s: cp-protection: redirecting %s traffic to queue %d",
>> +            netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
>> +    }
>> +    return dpdk_cp_prot_add_flow(dev, &attr, items, actions, desc, dry_run);
>> +}
>> +
>> +static int
>> +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
>> +{
>> +    struct rte_eth_rss_reta_entry64 *reta_conf;
>> +    size_t reta_conf_size;
>> +    int err;
>> +
>> +    if (rss_n_rxq == 1) {
>> +        VLOG_INFO("%s: cp-protection: redirecting other traffic to queue 0",
>> +            netdev_get_name(&dev->up));
>> +    } else {
>> +        VLOG_INFO("%s: cp-protection: applying rss on queues 0-%d",
>> +            netdev_get_name(&dev->up), rss_n_rxq - 1);
>> +    }
>> +
>> +    reta_conf_size = (dev->reta_size / RTE_ETH_RETA_GROUP_SIZE)
>> +        * sizeof(struct rte_eth_rss_reta_entry64);
> 
> In dpdk_eth_dev_init, we get reta_size from driver,
> 
> mlx5_ethdev.c
> 
> 333├>        info->reta_size = priv->reta_idx_n ?
> 334│                 priv->reta_idx_n : config->ind_table_max_size;
> 
> (gdb) p priv->reta_idx_n
> $5 = 1
> (gdb) p config->ind_table_max_size
> $6 = 512
> 
> and store:
> dev->reta_size = info.reta_size;
> 
> Now we use it,
> dev->reta_size = 1 / RTE_ETH_RETA_GROUP_SIZE   (64)
> but it results in reta_conf_size = 0
> 
>> +    reta_conf = xmalloc(reta_conf_size);
> 
> xmalloc only allocates 1 byte (void *p = malloc(size ? size : 1);)
> 
>> +    memset(reta_conf, 0, reta_conf_size);
>> +
>> +    for (uint16_t i = 0; i < dev->reta_size; i++) {
>> +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
>> +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
>> +        reta_conf[idx].mask |= 1ULL << shift;
>> +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
>> +    }
>> +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, dev->reta_size);
>> +    if (err < 0) {
>> +        VLOG_DBG("%s: failed to configure RSS redirection table: err=%d",
>> +            netdev_get_name(&dev->up), err);
>> +    }
>> +
>> +    free(reta_conf);
>> +
>> +    return err;
>> +}
>> +
>> +static int
>> +dpdk_cp_prot_configure(struct netdev_dpdk *dev, bool dry_run)
>> +{
>> +    int err = 0;
>> +
>> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_UNSUPPORTED) {
>> +        goto out;
>> +    }
>> +    if (dev->up.n_rxq < 2) {
>> +        err = ENOTSUP;
>> +        VLOG_DBG("%s: cp-protection: not enough available rx queues",
>> +            netdev_get_name(&dev->up));
>> +        goto out;
>> +    }
>> +
>> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
>> +        err = dpdk_cp_prot_add_traffic_flow(
>> +            dev,
>> +            (const struct rte_flow_item []) {
>> +                {
>> +                    .type = RTE_FLOW_ITEM_TYPE_ETH,
>> +                    .spec = &(const struct rte_flow_item_eth){
>> +                        .type = htons(ETH_TYPE_LACP),
>> +                    },
>> +                    .mask = &(const struct rte_flow_item_eth){
>> +                        .type = htons(0xffff),
>> +                    },
>> +                },
>> +                { .type = RTE_FLOW_ITEM_TYPE_END },
>> +            },
>> +            "lacp",
>> +            dry_run
>> +        );
>> +        if (err) {
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    if (!dry_run && dev->cp_prot_flows_num) {
>> +        /* reconfigure RSS reta in all but the cp protection queue */
>> +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
>> +    }
>> +
>> +out:
>> +    if (!dry_run) {
>> +        dev->cp_prot_flags = dev->requested_cp_prot_flags;
>> +    }
>> +    if (err) {
>> +        dev->requested_cp_prot_flags |= DPDK_CP_PROT_UNSUPPORTED;
>> +    }
>> +    return err;
>> +}
>> +
>> +static void
>> +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
>> +{
>> +    struct rte_flow_error error;
>> +
>> +    if (dev->cp_prot_flows_num == 0) {
>> +        return;
>> +    }
>> +
>> +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
>> +
>> +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
>> +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
>> +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
>> +                netdev_get_name(&dev->up), error.message);
>> +        }
>> +    }
>> +    free(dev->cp_prot_flows);
>> +    dev->cp_prot_flows_num = 0;
>> +    dev->cp_prot_flows = NULL;
>> +
>> +    (void) dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq);
>> +}
>> +
>>    static int
>>    netdev_dpdk_reconfigure(struct netdev *netdev)
>>    {
>> @@ -4943,6 +5211,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>    
>>        if (netdev->n_txq == dev->requested_n_txq
>>            && netdev->n_rxq == dev->requested_n_rxq
>> +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
>>            && dev->mtu == dev->requested_mtu
>>            && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
>>            && dev->rxq_size == dev->requested_rxq_size
>> @@ -4987,6 +5256,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>            }
>>        }
>>    
>> +    dpdk_cp_prot_unconfigure(dev);
>> +
>>        err = dpdk_eth_dev_init(dev);
>>        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
>>            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
>> @@ -5014,6 +5285,20 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>        if (!dev->tx_q) {
>>            err = ENOMEM;
>>        }
>> +    if (!err && dev->requested_cp_prot_flags) {
>> +        /* dry run first */
>> +        err = dpdk_cp_prot_configure(dev, true);
>> +        if (!err) {
>> +            /* if no error, apply configuration */
>> +            err = dpdk_cp_prot_configure(dev, false);
>> +        }
>> +        if (err) {
>> +            /* no hw support, remove the extra queue & recover gracefully */
>> +            err = 0;
>> +            dev->requested_n_rxq -= 1;
>> +            netdev_request_reconfigure(netdev);
>> +        }
>> +    }
>>    
>>        netdev_change_seq_changed(netdev);
>>    
>> @@ -5215,7 +5500,13 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
>>        ovs_mutex_lock(&dev->mutex);
>>        if (dev->type == DPDK_DEV_ETH) {
>>            /* TODO: Check if we able to offload some minimal flow. */
>> -        ret = true;
>> +        if (dev->requested_cp_prot_flags || dev->cp_prot_flags) {
>> +            VLOG_WARN(
>> +                "%s: hw-offload is mutually exclusive with cp-protection",
>> +                netdev_get_name(netdev));
>> +        } else {
>> +            ret = true;
>> +        }
>>        }
>>        ovs_mutex_unlock(&dev->mutex);
>>    out:
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 36388e3c42d7..7e6ae3df7583 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -3430,6 +3430,32 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>>            <p>This option may only be used with dpdk VF representors.</p>
>>          </column>
>>    
>> +      <column name="options" key="cp-protection"
>> +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
>> +        <p>
>> +          Allocate an extra Rx queue for control plane packets of the specified
>> +          protocol(s).
>> +        </p>
>> +        <p>
>> +          If the user has already configured multiple
>> +          <code>options:n_rxq</code> on the port, an additional one will be
>> +          allocated for control plane packets. If the hardware cannot satisfy
>> +          the requested number of requested Rx queues, the last Rx queue will
>> +          be assigned for control plane. If only one Rx queue is available or
>> +          if the hardware does not support the RTE flow matchers/actions
>> +          required to redirect the selected protocols,
>> +          <code>cp-protection</code> will be disabled.
>> +        </p>
>> +        <p>
>> +          This feature is multually exclusive with
>> +          <code>other_options:hw-offload</code> as it may conflict with the
>> +          offloaded RTE flows.
>> +        </p>
>> +        <p>
>> +          Disabled by default.
>> +        </p>
>> +      </column>
>> +
>>          <column name="other_config" key="tx-steering"
>>                  type='{"type": "string",
>>                         "enum": ["set", ["thread", "hash"]]}'>
> 
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] netdev-dpdk: add control plane protection support
  2022-11-11 18:15 ` [PATCH v3] netdev-dpdk: add control plane protection support Kevin Traynor
  2022-11-14  9:41   ` Kevin Traynor
@ 2022-11-14 12:46   ` Robin Jarry
  1 sibling, 0 replies; 3+ messages in thread
From: Robin Jarry @ 2022-11-14 12:46 UTC (permalink / raw)
  To: Kevin Traynor; +Cc: Christophe Fontaine, dev, Ori Kam, dev

Hi Kevin,

Kevin Traynor, Nov 11, 2022 at 19:15:
> > +    reta_conf_size = (dev->reta_size / RTE_ETH_RETA_GROUP_SIZE)
> > +        * sizeof(struct rte_eth_rss_reta_entry64);
>
> In dpdk_eth_dev_init, we get reta_size from driver,
>
> mlx5_ethdev.c
>
> 333├>        info->reta_size = priv->reta_idx_n ?
> 334│                 priv->reta_idx_n : config->ind_table_max_size;
>
> (gdb) p priv->reta_idx_n
> $5 = 1
> (gdb) p config->ind_table_max_size
> $6 = 512
>
> and store:
> dev->reta_size = info.reta_size;
>
> Now we use it,
> dev->reta_size = 1 / RTE_ETH_RETA_GROUP_SIZE   (64)
> but it results in reta_conf_size = 0
>
> > +    reta_conf = xmalloc(reta_conf_size);
>
> xmalloc only allocates 1 byte (void *p = malloc(size ? size : 1);)

Hmm indeed this is bad :)

Since reworking the RSS fallback I may not have re-tested with CX-5,
only with Intel NICs.

I am quite surprised of this return value. Could it be an issue with the
mlx5 dpdk driver?

Thanks for testing!


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-11-14 12:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20221021145308.141933-1-rjarry@redhat.com>
2022-11-11 18:15 ` [PATCH v3] netdev-dpdk: add control plane protection support Kevin Traynor
2022-11-14  9:41   ` Kevin Traynor
2022-11-14 12:46   ` Robin Jarry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).