From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pascal.mazon@6wind.com>
Received: from mail-wr0-f173.google.com (mail-wr0-f173.google.com
 [209.85.128.173]) by dpdk.org (Postfix) with ESMTP id AE8052A5D
 for <dev@dpdk.org>; Tue,  5 Dec 2017 08:53:36 +0100 (CET)
Received: by mail-wr0-f173.google.com with SMTP id s66so19939152wrc.9
 for <dev@dpdk.org>; Mon, 04 Dec 2017 23:53:36 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=6wind-com.20150623.gappssmtp.com; s=20150623;
 h=subject:to:cc:references:from:message-id:date:user-agent
 :mime-version:in-reply-to:content-transfer-encoding:content-language;
 bh=vfZQadPy7aDGMdcGf+guI066O/RVFfJyG7S8E3nddxM=;
 b=Iw2S2PDeYdyBOqjybfoHGczxjjCZWSRarnvZeoQGmFUQ8VyO6cA5VgaJhqr5Akcdlz
 EHzjNipVM1OABHWmeWRg06zH2giZN+3UoUmlnMoCQLhB1vJddom0McCoptpSZ+mccULS
 sY69GFDg3KAMSHtaseo2X0sWmL9gubv7IEq0j0rTa3oY4cr9sKLUaN47C3EV65J6sGsN
 nTKN7sn+Zf8C9YMvnrzp/T/wfgSweYMLX+ibRR6wAA5o4wYQJrQcZ14CvYzNGygLgwFX
 E+zsjYLjveAxDoWORT0EERDf/wtIWG/H1HF48U3qdy3WjMdjlhirAhcqfpiAre3qIkld
 4K3w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-transfer-encoding
 :content-language;
 bh=vfZQadPy7aDGMdcGf+guI066O/RVFfJyG7S8E3nddxM=;
 b=WB+8QhNiocYHhhpL241cdAsXZsDZhIwuaYHy64gjoWM7GNQeiewocjyTSwdRIHmt0D
 2w1TaT3f52KYCAqlhXHpGSllTdyV5zDdxj43M30osElUPoCUvSLcTKAQC9i+xDgQM+f3
 7zX83Sj3tkSqiDk182sYPYGiQqptfsuM89H67g/hLL8EwQd5STtaTPpXSjrPnn2oekXx
 iyVCkvDzjfRrKK6HbOXDoU3pJcqKtY2BObrsE1FuMMUytV46Hvw0aRdqgnshLzTCrLe/
 vY9GZrmwmIz8/Cmq6Y2NKlZh+T0OzkOlDlJW5XYClPQTs4QYe+IiXepWjBmwoTB/8IwZ
 LW1A==
X-Gm-Message-State: AJaThX7m/fgoHVNCnyngSEFZ4M8tzctyiMoo9BFH3/SULt0rLjKPTV1U
 YKvVoJUWTLPZ4Y4W9LBsiLReKOpGr2I=
X-Google-Smtp-Source: AGs4zMb2U+OkBOnJPpaxa17U0xxPpyO+JcmGD/FVvT6cx/XF+yZrxLgB0C95rr6NIdXilHMJtKV69w==
X-Received: by 10.223.147.133 with SMTP id 5mr14907222wrp.230.1512460414844;
 Mon, 04 Dec 2017 23:53:34 -0800 (PST)
Received: from [10.16.18.139] (host.78.145.23.62.rev.coltfrance.com.
 [62.23.145.78])
 by smtp.gmail.com with ESMTPSA id 128sm5682784wmi.28.2017.12.04.23.53.33
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 04 Dec 2017 23:53:34 -0800 (PST)
To: Ophir Munk <ophirmu@mellanox.com>, dev@dpdk.org
Cc: Thomas Monjalon <thomas@monjalon.net>, Olga Shern <olgas@mellanox.com>
References: <1512028870-13597-1-git-send-email-ophirmu@mellanox.com>
From: Pascal Mazon <pascal.mazon@6wind.com>
Message-ID: <49b99e45-f898-3e33-b435-8803ce80e1ab@6wind.com>
Date: Tue, 5 Dec 2017 08:53:35 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <1512028870-13597-1-git-send-email-ophirmu@mellanox.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Content-Language: en-US
Subject: Re: [dpdk-dev] [RFC 1/2] net/tap: add eBPF to TAP device
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Dec 2017 07:53:36 -0000

Hi Ophir,

I wrote that doc (rte_flow_tap) a while ago (10 months+), it is no
longer accurate and would very much need updating.
I might have written some of the code you propose, but I definitely
didn't see this patch before you sent it, you shouldn't use my sign-off.
That goes for the second patch in the series, too.

On 30/11/2017 09:01, Ophir Munk wrote:
> The DPDK traffic classifier is the rte_flow API and the tap PMD
> must support it including RSS queue mapping actions.
> An example usage for this requirement is failsafe transparent
> switching from a PCI device to TAP device while RSS queues are the
> same on both devices.
>
> TC was chosen as TAP classifier but TC alone does not support RSS
> queue mapping. This commit uses a combination of TC rules and eBPF
> actions in order to support TAP RSS.
>
> eBPF requires Linux version 3.19. eBPF is effective only when running
> with an appropriate kernel version. It must be compiled with the
> appropriate Linux kernel headers. In case the kernel headers do
> not include eBPF definitions a warning will be issued during
> compilation time and TAP RSS will not be supported.
>
> Signed-off-by: Pascal Mazon <pascal.mazon@6wind.com>
> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> ---
>
> The DPDK traffic classifier is the rte_flow API and the tap PMD
> must support it including RSS queue mapping actions.
> An example usage for this requirement is failsafe transparent
> switching from a PCI device to TAP device while RSS queues are the
> same on both devices.
> TC was chosen as TAP classifier but TC alone does not support RSS
> queue mapping. This RFC suggests using a combination of TC rules and eBPF
> actions in order to support TAP RSS.
> eBPF requires Linux version 3.19. eBPF is effective only when running
> with an appropriate kernel version. It must be compiled with the
> appropriate Linux kernel headers. In case the kernel headers do
> not include eBPF definitions a warning will be issued during
> compilation time and TAP RSS will not be supported.
> The C source file (tap_bpf_insns.c) includes eBPF "assembly
> instructions" in the format of an array of struct bpf_insns.
> This array is passed to the kernel for execution in BPF system call.
> The C language source file (tap_bpf_program.c) from which the
> "assembly instructions" were generated is included in TAP source tree,
> however it does not take part in dpdk compilation.
> TAP documentation will detail the process of eBPF "assembly instructions"
> generation.
>
> eBPF programs controlled from tap PMD will be used to match packets, compute a
> hash given the configured key, and send packets using the desired queue.
> In an eBPF program, it is typically not possible to edit the queue_mapping field
> in skb to direct the packet in the correct queue. That part would be addressed by
> chaining a ``skbedit queue_mapping`` action.
>
> A packet would go through these TC rules (on the local side of the tap netdevice):
>
>  +-----+---------------------------+----------------------------------+----------+
>  |PRIO |           Match           |              Action 1            | Action 2 |
>  +=====+===========================+==================================+==========+
>  |  1  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |  2  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
>  +-----+---------------------------+----------------------------------+----------+
>  | ... |                           |                                  |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |  x  | ANY                       | BPF: append NULL 32bits for hash |          |
>  |     |                           |                                  |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |x + 1| ACTUAL FLOW RULE 1 MATCH  | ...                              |          |
>  |     |                           |                                  |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |x + 2| ACTUAL FLOW RULE 2 MATCH  | ...                              |          |
>  |     |                           |                                  |          |
>  +-----+---------------------------+----------------------------------+----------+
>  | ... |                           |                                  |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |  y  | FLOW RULE RSS 1 MATCH     | BPF compute hash into packet     |reclassify|
>  |     |                           | tailroom && set queue in skb->cb |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |y + 1| FLOW RULE RSS 2 MATCH     | BPF compute hash into packet     |reclassify|
>  |     |                           | tailroom && set queue in skb->cb |          |
>  +-----+---------------------------+----------------------------------+----------+
>  | ... |                           |                                  |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |  z  | ANY (default RSS)         | BPF compute hash into packet     |reclassify|
>  |     |                           | tailroom && set queue in skb->cb |          |
>  +-----+---------------------------+----------------------------------+----------+
>  |  z  | ANY (isolate mode)        | DROP                             |          |
>  +-----+---------------------------+----------------------------------+----------+
>
>  Rules 1..x will match marked packets and will redirect them to their queues however
>  on first classification packets are not marked there will not be redirected.
>  Only when going later through RSS rules y..z BPF computes RSS hash,
>  sets queue in dkb->cb, and reclassifies packets. Then packets are classified again
>  through rules 1-x while being marked and  will be redirected.
>  Rules (x+1)..y are non-RSS TC rules already used in dpdk versions prior to 18.02
>
>  doc/guides/prog_guide/rte_flow_tap.rst | 962 +++++++++++++++++++++++++++++++++
>  drivers/net/tap/Makefile               |   6 +-
>  drivers/net/tap/rte_eth_tap.h          |   7 +-
>  drivers/net/tap/tap_bpf_elf.h          |  56 ++
>  drivers/net/tap/tap_flow.c             | 336 ++++++++----
>  5 files changed, 1263 insertions(+), 104 deletions(-)
>  create mode 100644 doc/guides/prog_guide/rte_flow_tap.rst
>  create mode 100644 drivers/net/tap/tap_bpf_elf.h
>
> diff --git a/doc/guides/prog_guide/rte_flow_tap.rst b/doc/guides/prog_guide/rte_flow_tap.rst
> new file mode 100644
> index 0000000..04ddda6
> --- /dev/null
> +++ b/doc/guides/prog_guide/rte_flow_tap.rst
> @@ -0,0 +1,962 @@
> +=====================================
> +Flow API support in TAP PMD, using TC
> +=====================================
> +
> +.. contents::
> +.. sectnum::
> +
> +.. footer::
> +
> +   v0.8 - page ###Page###
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +Rationale
> +=========
> +
> +For this project, the tap PMD has to receive selected traffic from a different
> +netdevice (refer to *VM migration with Microsoft Hyper-V and Mellanox
> +ConnectX-3* document) and only cover the same set of rules as supported by the
> +mlx4 PMD.
> +
> +The DPDK traffic classifier is the rte_flow API, and the tap PMD must therefore
> +implement it. For that, TC was chosen for several reasons:
> +
> +- it happens very early in the kernel stack for ingress (faster than netfilter).
> +- it supports dropping packets given a specific flow.
> +- it supports redirecting packets to a different netdevice.
> +- it has a "flower" classifier type that meets mostly the pattern items in
> +  rte_flow.
> +- it can be configured through a netlink socket, without an external tool.
> +
> +Modes of operation
> +==================
> +
> +There should be two modes of operation for the tap PMD regarding rte_flow:
> +*local* and *remote*. Only one mode can be in use at a time for a specific tap
> +interface.
> +
> +The *local* mode would be the default one, if no specific parameter is specified
> +in the command line. To start the application with tap in *remote* mode, set the
> +``remote`` tap parameter to the interface you want to redirect packets from,
> +e.g.::
> +
> +   testpmd -n 4 -c 0xf -m 1024 --vdev=net_tap,iface=tap0,remote=eth3 -- \
> +     -i --burst=64 --coremask=0x2
> +
> +*Local* mode
> +------------
> +
> +In *local* mode, flow rules would be applied as-is, on the tap netdevice itself
> +(e.g.: ``tap0``).
> +
> +The typical use-case is having a linux program (e.g. a webserver) communicating
> +with the DPDK app through the tap netdevice::
> +
> +   +-------------------------+
> +   |    DPDK application     |
> +   +-------------------------+
> +     |                     ^
> +     | rte_flow   rte_flow |
> +     v egress      ingress |
> +   +-------------------------+
> +   |        Tap PMD          |
> +   +-------------------------+
> +     |                     ^
> +     | TC               TC |
> +     v ingress      egress |
> +   +-------------------------+              +-------------------------+
> +   |                         |<-------------|                         |
> +   |   Tap netdevice (tap0)  |              |  Linux app (webserver)  |
> +   |                         |------------->|                         |
> +   +-------------------------+              +-------------------------+
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +*Remote* mode
> +-------------
> +
> +In *remote* mode, flow rules would be applied on the tap netdevice (e.g.:
> +``tap0``), and use a similar match to redirect specific packets from another
> +netdevice (e.g.: ``eth3``, a NetVSC netdevice in our project scenario)::
> +
> +   +-------------------------+
> +   |    DPDK application     |
> +   +-------------------------+
> +     |                     ^
> +     | rte_flow   rte_flow |
> +     v egress      ingress |
> +   +-------------------------+
> +   |        Tap PMD          |
> +   +-------------------------+
> +     |                     ^
> +     | TC               TC |
> +     v ingress      egress |
> +   +-------------------------+              +-------------------------+
> +   |                         |<------------------redirection-------\  |
> +   |   Tap netdevice (tap0)  |              |                       | |
> +   |                         |------------->|-\       eth3          | |
> +   +-------------------------+              +--|--------------------|-+
> +                                               | TC              TC ^
> +                                               | egress     ingress |
> +                                               v                    |
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +rte_flow rules conversion
> +=========================
> +
> +Netlink
> +-------
> +
> +The only way to create TC rules in the kernel is through netlink messages.
> +Two possibilities arise for managing TC rules:
> +
> +- Using native netlink API calls in the tap PMD
> +- Calling the ``tc`` command from iproute2 inside our PMD, via ``system()``.
> +
> +The former will be done, as library calls are faster than changing context and
> +executing an external program from within the tap PMD. Moreover, the kernel TC
> +API might propose features not yet implemented in iproute2. Furthermore, a
> +custom implementation enables finer tuning and better control.
> +
> +..
> +   Some implementations for TC configuration through Netlink exist already. It's a
> +   good source of inspiration on how to do it:
> +
> +   - iproute2's tc `source code`__
> +   - ovs's tc implementation__ (not yet upstream)
> +
> +   __ https://github.com/shemminger/iproute2/tree/master/tc
> +   __ https://mail.openvswitch.org/pipermail/ovs-dev/2016-November/324693.html
> +
> +Conversion examples
> +-------------------
> +
> +Here are a few examples of rules and how they can be translated from rte_flow
> +rules to TC rules. rte_flow rules will be expressed using testpmd's ``flow``
> +command syntax, while TC rules will use iproute2 ``tc`` command syntax.
> +
> +**Notes**:
> +   - rte_flow ``ingress`` direction can be translated into a TC ``egress`` rule,
> +     and vice versa, when it applies to a tap interface, as TC considers the
> +     kernel netdevice standpoint.
> +   - in TC, redirecting a packet works by taking a packet from ``ingress`` and
> +     sending to another device's ``egress``.
> +
> +*Local* mode
> +~~~~~~~~~~~~
> +
> +#. Flow rule to give packets coming on the ``tap0`` interface to RX queue 0:
> +
> +   Using rte_flow::
> +
> +      flow validate 0 ingress pattern port index is 0 / end \
> +        actions queue index 0 / end
> +
> +   Using ``tc``::
> +
> +      tc filter add dev tap0 parent 1: flower indev tap0 \
> +        action skbedit queue_mapping 0
> +
> +#. Flow rule to get packets with source mac ``de:ad:ca:fe:00:02`` on RX queue 2:
> +
> +   Using rte_flow::
> +
> +      flow create 0 ingress pattern eth src is de:ad:ca:fe:00:02 / end \
> +        actions queue 2 / end
> +
> +   Using ``tc``::
> +
> +      tc filter add dev tap0 parent 1: flower src_mac de:ad:ca:fe:00:02 \
> +        action skbedit queue_mapping 2
> +
> +#. Flow rule to drop packets matching specific 5-tuple info:
> +
> +   Using rte_flow::
> +
> +      flow create 0 ingress pattern eth dst is 3a:80:ce:61:36:54 \
> +        src is 52:43:7b:fd:ac:f3 / ipv4 src is 1.1.1.1 dst is 2.2.2.2 \
> +        / udp src is 4444 dst is 5555 / end actions drop / end
> +
> +   Using ``tc``::
> +
> +      tc filter add dev tap0 parent 1: flower dst_mac 3a:80:ce:61:36:54 \
> +        src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \
> +        ip_proto udp src_port 4444 dst_port 5555 action drop
> +
> +*Remote* mode
> +~~~~~~~~~~~~~
> +
> +In *remote* mode, an additional rule for redirecting packet is systematically
> +required. The examples are similar to the previous section (the rte_flow rule
> +will thus be omitted).
> +
> +#. TC rules to give packets coming on the ``eth3`` interface to ``tap0`` RX
> +   queue 0::
> +
> +      # redirection rule
> +      tc filter add dev eth3 parent ffff: flower indev eth3 \
> +        action mirred egress redirect dev tap0
> +      # actual tap rule
> +      tc filter add dev tap0 parent 1: flower indev tap0 \
> +        action skbedit queue_mapping 0
> +
> +#. TC rules to get packets with source mac ``de:ad:ca:fe:00:02`` on RX queue 2::
> +
> +      # redirection rule
> +      tc filter add dev eth3 parent ffff: flower src_mac de:ad:ca:fe:00:02 \
> +        action mirred egress redirect dev tap0
> +      # actual tap rule
> +      tc filter add dev tap0 parent 1: flower src_mac de:ad:ca:fe:00:02 \
> +        action skbedit queue_mapping 2
> +
> +#. TC rules to drop packets matching specific 5-tuple info::
> +
> +      # redirection rule
> +      tc filter add dev eth3 parent ffff:  flower dst_mac 3a:80:ce:61:36:54 \
> +        src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \
> +        ip_proto udp src_port 4444 dst_port 5555 \
> +        action mirred egress redirect dev tap0
> +      # actual tap rule
> +      tc filter add dev tap0 parent 1: flower dst_mac 3a:80:ce:61:36:54 \
> +        src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \
> +        ip_proto udp src_port 4444 dst_port 5555 action drop
> +
> +One last thing, to redirect packets the other way around (from ``tap0`` to
> +``eth3``), we would use a similar rule, exchanging interfaces and using an
> +appropriate match, e.g.::
> +
> +   tc filter add dev tap0 parent ffff: flower indev tap0 \
> +     action mirred egress redirect dev eth3
> +
> +..
> +   **Note:** ``parent ffff:`` is for TC ``ingress`` while ``parent 1:`` is for TC
> +    ``egress``.
> +
> +Broadcast and promiscuous support
> ++++++++++++++++++++++++++++++++++
> +
> +*Remote* mode requirements:
> +
> +#. When turning the tap netdevice promiscuous, the remote netdevice should
> +   implicitly be turned promiscuous too, to get as many packets as possible.
> +
> +#. Packets matching the destination MAC configured in the tap PMD should be
> +   redirected from the remote without being processed by the stack there in the
> +   kernel.
> +
> +#. In promiscuous mode, an incoming packet should be duplicated to be processed
> +   both by the tap PMD and the remote netdevice itself.
> +
> +#. Incoming packets with broadcast destination MAC (i.e.: ``ff:ff:ff:ff:ff:ff``)
> +   should be duplicated to be processed both by the tap PMD and the remote
> +   netdevice itself.
> +
> +#. Incoming packets with IPv6 multicast destination MAC (i.e.:
> +   ``33:33:00:00:00:00/33:33:00:00:00:00``) should be duplicated to be processed
> +   both by the tap PMD and the remote netdevice itself.
> +
> +#. Incoming packets with broadcast/multicast bit set in the destination MAC
> +   (i.e.: ``01:00:00:00:00:00/01:00:00:00:00:00``) should be duplicated to be
> +   processed both by the tap PMD and the remote netdevice itself.
> +
> +Each of these requirements (except the first one) can be directly translated
> +into a TC rule, e.g.::
> +
> +   # local mac (notice the REDIRECT for mirred action):
> +   tc filter add dev eth3 parent ffff: prio 1 flower dst_mac de:ad:be:ef:01:02 \
> +     action mirred egress redirect dev tap0
> +
> +   # tap promisc:
> +   tc filter add dev eth3 parent ffff: prio 2 basic \
> +     action mirred egress mirror dev tap0
> +
> +   # broadcast:
> +   tc filter add dev eth3 parent ffff: prio 3 flower dst_mac ff:ff:ff:ff:ff:ff \
> +     action mirred egress mirror dev tap0
> +
> +   # broadcast v6 (can't express mac_mask with tc, but it works via netlink):
> +   tc filter add dev eth3 parent ffff: prio 4 flower dst_mac 33:33:00:00:00:00 \
> +     action mirred egress mirror dev tap0
> +
> +   # all_multi (can't express mac_mask with tc, but it works via netlink):
> +   tc filter add dev eth3 parent ffff: prio 5 flower dst_mac 01:00:00:00:00:00 \
> +     action mirred egress mirror dev tap0
> +
> +When promiscuous mode is switched off or on, the first TC rule will be modified
> +to have respectively an empty action (``continue``) or the ``mirror`` action.
> +
> +The first 5 priorities are always reserved, and can only be used for these
> +filters.
> +
> +On top of that, the tap PMD can configure explicit rte_flow rules, translated as
> +TC rules on both the remote netdevice and the tap netdevice. On the remote,
> +those would need to be processed after the default rules handling promiscuous
> +mode, broadcast and all_multi packets.
> +
> +When using the ``mirror`` action, the packet is duplicated and sent to the tap
> +netdevice, while the original packet gets directly processed by the kernel
> +without going through later TC rules for the remote. On the tap netdevice, the
> +duplicated packet will go through tap TC rules and be classified depending on
> +those rules.
> +
> +**Note:** It is possible to combine a ``mirror`` action and a ``continue``
> +action for a single TC rule. Then the original packet would undergo remaining TC
> +rules on the remote netdevice side.
> +
> +When using the ``redirect`` action, the behavior is similar on the tap side, but
> +the packet is not duplicated, no further kernel processing is done for the
> +remote side.
> +
> +The following diagram sums it up. A packet that match a TC rule follows the
> +associated action (the number in the diamond represents the rule prio as set in
> +the above TC rules)::
> +
> +
> +                        Incoming packet |
> +                       on remote (eth3) |
> +                                        | Going through
> +                                        | TC ingress rules
> +                                        v
> +                                       / \
> +                                      / 5 \
> +                                     /     \    yes
> +                                    /  mac  \____________________> tap0
> +                                    \ match?/    duplicated pkt
> +                                     \     /
> +                                      \   /
> +                                       \ /
> +                                        V no, then continue
> +                                        | with TC rules
> +                                        |
> +                                        v
> +                                       / \
> +                                      / 2 \
> +         eth3                  yes   /     \    yes
> +       kernel <____________________ /promisc\____________________> tap0
> +        stack      original pkt     \ match?/    duplicated pkt
> +                                     \     /
> +                                      \   /
> +                                       \ /
> +                                        V no, then continue
> +                                        | with TC rules
> +                                        |
> +                                        v
> +                                       / \
> +                                      / 3 \
> +         eth3                  yes   /     \    yes
> +       kernel <____________________ / bcast \____________________> tap0
> +        stack      original pkt     \ match?/    duplicated pkt
> +                                     \     /
> +                                      \   /
> +                                       \ /
> +                                        V no, then continue
> +                                        | with TC rules
> +                                        |
> +                                        v
> +                                       / \
> +                                      / 4 \
> +         eth3                  yes   /     \    yes
> +       kernel <____________________ / bcast6\____________________> tap0
> +        stack      original pkt     \ match?/    duplicated pkt
> +                                     \     /
> +                                      \   /
> +                                       \ /
> +                                        V no, then continue
> +                                        | with TC rules
> +                                        |
> +                                        v
> +                                       / \
> +                                      / 5 \
> +         eth3                  yes   / all \    yes
> +       kernel <____________________ / multi \____________________> tap0
> +        stack      original pkt     \ match?/    duplicated pkt
> +                                     \     /
> +                                      \   /
> +                                       \ /
> +                                        V no, then continue
> +                                        | with TC rules
> +                                        |
> +                                        v
> +                                        |
> +                                        . remaining TC rules
> +                                        .
> +         eth3                           |
> +       kernel <________________________/
> +        stack      original pkt
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +Associating an rte_flow rule with a TC one
> +==========================================
> +
> +A TC rule is identified by a ``priority`` (16-bit value) and a ``handle``
> +(32-bit value). To delete a rule, the priority must be specified, and if several
> +rules have the same priority, the handle is needed to select the correct one.
> +
> +..
> +   Specifying an empty priority and handle when requesting a TC rule creation will
> +   let the kernel automatically decide what values to set. In fact, the kernel will
> +   start with a high priority (i.e. 49152) and subsequent rules will get decreasing
> +   priorities (lower priorites get evaluated first).
> +
> +To avoid further requests to the kernel to identify what priority/handle has
> +been automatically allocated, the tap PMD can set priorities and handles
> +systematically when creating a rule.
> +
> +In *local* mode, an rte_flow rule should be translated into a single TC flow
> +identified by priority+handle.
> +
> +In *remote* mode, an rte_flow rule requires two TC rules, one on the tap
> +netdevice itself (for the correct action) and another one on the other netdevice
> +where packets are redirected from. Both TC rules' priorities+handles must be
> +stored for a specific rte_flow rule, and associated with the device they are
> +applied on.
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +Considerations regarding Flow API support
> +=========================================
> +
> +Flow rule attributes
> +--------------------
> +
> +Groups and priorities:
> +   There is no native support of groups in TC. Instead, the priority field
> +   (which is part of the netlink TC msg header) can be adapted. The four MSB
> +   would be used to define the group (allowing for 16 groups), while the 12 LSB
> +   would be left to define the actual priority (up to 4096).
> +
> +   Rules with lower priorities are evaluated first. For rules with identical
> +   priorities, the one with the highest handle value gets evaluated first.
> +
> +Direction:
> +   Both ingress and egress filtering can be supported.
> +
> +Meta item types
> +---------------
> +
> +Most applications will use: ``(END | VOID)``
> +
> +END, VOID:
> +   Supported without problem.
> +
> +INVERT:
> +   There is no easy way to support that in TC. It won't be supported
> +
> +   **mlx4 will not support it either.**
> +
> +PF, VF, PORT:
> +   Not applicable to a tap netdevice.
> +
> +Data matching item types
> +------------------------
> +
> +Most applications will use:
> +``ETH / (IPV4 | IPV6 | END) / (TCP | UDP | END) / END``
> +
> +ANY:
> +   Should be supported.
> +
> +   **mlx4 will partially support it.**
> +
> +RAW:
> +   It is not planned to support it for now. Matching Raw packets would require
> +   using a different classifier than "flower", which is the most simple and
> +   applicable for otherwise most other cases. With TC, it's not possible to
> +   support in the same rule both "flower" and raw packets.
> +
> +   **mlx4 will not support it either**.
> +
> +VLAN:
> +   Matching VLAN ID and prio supported.
> +   **Note: linux v4.9 required for VLAN support.**
> +
> +ETH, IPV4, IPV6, UDP, TCP:
> +   Matching source/destination MAC/IP/port is supported, with masks.
> +
> +   **mlx4 does not support partial bit-masks (full or zeroed only).**
> +
> +ICMP:
> +   By specifying the appropriate ether type, ICMP packets can be matched.
> +   However, there is no support for ICMP type or code.
> +
> +   **mlx4 will not support it, however.**
> +
> +SCTP:
> +   By specifying the appropriate IP protocol, SCTP packets can be matched.
> +   However, no specific SCTP fields can be matched.
> +
> +   **mlx4 will not support it, however.**
> +
> +VXLAN:
> +   VXLAN is not recognized by the "flower" classifier. Kernel-managed VXLAN
> +   traffic would come through an additional netdevice, which falls outside
> +   the scope of this project. VXLAN traffic should occur outside VMs anyway.
> +
> +Action types
> +------------
> +
> +Most applications will use: ``(VOID | END | QUEUE | DROP) / END``
> +
> +By default, multiple actions are possible for TC flow rules. However, they are
> +ordered in the kernel. The implementation will need to handle actions in a way
> +that orders them intelligently when creating them.
> +
> +VOID, END:
> +   Supported.
> +
> +PASSTHRU:
> +   The generic "continue" action can be used.
> +
> +   **mlx4 will not support it, however**.
> +
> +MARK / FLAG:
> +   The mark is a field inside an skbuff. However, the tap reads messages (mostly
> +   packet data), without that info. As an alternative, it may be possible to
> +   create a specific queue to pass packets with a specific mark. Further testing
> +   are needed to ensure it is feasable.
> +
> +QUEUE:
> +   The ``skbedit`` action with the ``queue_mapping`` option enables directing
> +   packets to specific queues.
> +
> +   Like rte_flow, specifying several ``skbedit queue_mapping`` actions in TC
> +   only considers the last one.
> +
> +DROP:
> +   The generic "drop" action can be used. Packets will effectively be dropped,
> +   and not left for the kernel to process.
> +
> +COUNT: Stats are automatically stored in the kernel. The COUNT action will thus
> +   be ignored when creating the rule. ``rte_flow_query()`` can be implemented
> +   to request a rule's stats from the kernel.
> +
> +DUP:
> +   Duplicating packets is not supported.
> +
> +RSS:
> +   There's no built-in mechanism for RSS in TC.
> +
> +   By default, incoming packets go to the tap PMD queue 0. To support RSS in
> +   software, several additional queues must be set up. Packets coming in on
> +   queue 0 can be considered as requiring RSS, and the PMD will apply software
> +   rss (using something like ``rte_softrss()``) to select a queue for the
> +   packet.
> +
> +PF, VF:
> +   Not applicable to a tap netdevice.
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +TC limitations for flow collision
> +=================================
> +
> +From TC standpoint, filter rules with identical priorities do not collide, if
> +they do specify values (at least one different) for the same fields in the TC
> +message, with identical fields masks.
> +
> +Unfortunately, some flows that obviously are not colliding can be considered
> +otherwise by the kernel when parsing the TC messages, and thus their creation
> +would be rejected.
> +
> +Here is a table for matching TC fields with their flow API equivalent:
> +
> ++------------------------------+-----------------------------------+-----------+
> +|    TC message field          |      rte_flow API                 | maskable? |
> ++==============================+===================================+===========+
> +| TCA_FLOWER_KEY_ETH_DST       | eth dst                           | yes       |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_ETH_SRC       | eth src                           | yes       |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_ETH_TYPE      | eth type is 0xZZZZ ||             | no        |
> +|                              | eth / {ipv4|ipv6}                 |           |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_IP_PROTO      | eth / {ipv4|ipv6} / {tcp|udp}     | no        |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_IPV4_SRC      | eth / ipv4 src                    | yes       |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_IPV4_DST      | eth / ipv4 dst                    | yes       |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_IPV6_SRC      | eth / ipv6 src                    | yes       |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_IPV6_DST      | eth / ipv6 dst                    | yes       |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_L4_SRC        | eth / {ipv4|ipv6} / {tcp|udp} dst | no        |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_L4_DST        | eth / {ipv4|ipv6} / {tcp|udp} src | no        |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_VLAN_ID       | eth / vlan vid                    | no        |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_VLAN_PRIO     | eth / vlan pcp                    | no        |
> ++------------------------------+-----------------------------------+-----------+
> +| TCA_FLOWER_KEY_VLAN_ETH_TYPE | eth / vlan tpid                   | no        |
> ++------------------------------+-----------------------------------+-----------+
> +
> +When creating rules with identical priorities, one must make sure that they
> +would be translated in TC using the same fields as shown in the above table.
> +
> +The following flow rules can share the same priority, as they use the same
> +fields with identical masks under the hood::
> +
> +   > flow create 0 ingress priority 0 pattern eth / ipv4 / end
> +          actions drop / end
> +   Flow rule #0 created
> +   > flow create 0 ingress priority 0 pattern eth type is 0x86dd / end
> +          actions drop / end
> +   Flow rule #1 created
> +
> +**Note:** Both rules use ETH_TYPE (mask 0xffff) in their TC form.
> +
> +Sadly, the following flow rules cannot share the same priority, since fields for
> +matching IPv4 and IPv6 src/dst addresses are different::
> +
> +   > flow create 0 ingress priority 1 pattern eth / ipv4 src is 1.1.1.1 / end
> +          actions drop / end
> +   Flow rule #0 created
> +   > flow create 0 ingress priority 1 pattern eth / ipv6 src is ::1 / end
> +          actions drop / end
> +   PMD: Kernel refused TC filter rule creation (22): Invalid argument
> +   Caught error type 2 (flow rule (handle)): overlapping rules
> +
> +**Note:** First rule uses ETH_TYPE and IPV4_SRC, while the second uses ETH_TYPE
> +and IPV6_SRC.
> +
> +It is however possible to match different IPvX addresses with the same
> +priority::
> +
> +   > flow create 0 ingress priority 2 pattern eth / ipv4 src is 1.1.1.1 / end
> +          actions drop / end
> +   Flow rule #0 created
> +   > flow create 0 ingress priority 2 pattern eth / ipv4 src is 2.2.2.2 / end
> +          actions drop / end
> +   Flow rule #1 created
> +
> +If the first rule specifies both destination and source addresses, then the
> +other rule with the same priority must too (with identical masks)::
> +
> +   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 1.1.1.1
> +          dst is 1.1.1.2 / end actions drop / end
> +   Flow rule #0 created
> +   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2 / end
> +          actions drop / end
> +   PMD: Kernel refused TC filter rule creation (22): Invalid argument
> +   Caught error type 2 (flow rule (handle)): overlapping rules
> +   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2
> +          dst spec 2.2.2.3 dst mask 255.255.255.0 / end actions drop / end
> +   PMD: Kernel refused TC filter rule creation (22): Invalid argument
> +   Caught error type 2 (flow rule (handle)): overlapping rules
> +   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2
> +          dst is 2.2.2.3 / end actions drop / end
> +   Flow rule #1 created
> +
> +**Note:** First rule uses ETH_TYPE, IPV4_SRC and IPV4_DST (with full masks). The
> +two others must also use those to share the same priority.
> +
> +It is possible to match TCP/UDP packets with different ports whatever the
> +underlying L3, if the same fields are used (thus no l3 addresses specification).
> +For instance::
> +
> +   > flow create 0 ingress priority 4 pattern eth / ipv4 / tcp dst is 3333 / end
> +          actions drop / end
> +   Flow rule #0 created
> +   > flow create 0 ingress priority 4 pattern eth / ipv6 / udp dst is 4444 / end
> +          actions drop / end
> +   Flow rule #1 created
> +   > flow create 0 ingress priority 4 pattern eth / ipv6 / udp src is 5555 / end
> +          actions drop / end
> +   PMD: Kernel refused TC filter rule creation (22): Invalid argument
> +   Caught error type 2 (flow rule (handle)): overlapping rules
> +
> +**Note:** First 2 rules use ETH_TYPE, IP_PROTO and L4_DST with different values
> +but identical masks, so they're OK. Last rule used L4_SRC instead of L4_DST.
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +RSS implementation for tap
> +==========================
> +
> +There are several areas of research for a tap RSS implementation:
> +
> +#. userland implementation in tap PMD
> +#. userland implementation in DPDK (generic)
> +#. userland implementation using combination of TC rules and BPF filters/actions
> +#. kernel-side implementation in tap driver
> +#. kernel-side implementation as a BPF classifier/action
> +#. kernel-side implementation as a separate TC action
> +
> ++--------------+------------------------------+------------------------------+
> +|              | Pros                         | Cons                         |
> ++==============+==============================+==============================+
> +| tap PMD      | - no kernel upstreaming      | - tap PMD is supposed to be  |
> +|              |                              |   simple, and would no longer|
> +|              |                              |   be.                        |
> +|              |                              |                              |
> +|              |                              | - complex rework, with many  |
> +|              |                              |   rings for enqueuing packets|
> +|              |                              |   to the right queue         |
> +|              |                              |                              |
> +|              |                              | - slower                     |
> +|              |                              |                              |
> +|              |                              | - won't be accepted as it    |
> +|              |                              |   doesn't make sense to redo |
> +|              |                              |   what the kernel did        |
> +|              |                              |   previously                 |
> ++--------------+------------------------------+------------------------------+
> +| generic DPDK | - would be useful to others  | - design must be compatible  |
> +|              |                              |   with most PMDs             |
> +|              |                              |                              |
> +|              |                              | - probably the longest to    |
> +|              |                              |   develop                    |
> +|              |                              |                              |
> +|              |                              | - requires DPDK community    |
> +|              |                              |   approval                   |
> +|              |                              |                              |
> +|              |                              | - requires heavy changes in  |
> +|              |                              |   tap PMD itself anyway      |
> ++--------------+------------------------------+------------------------------+
> +| TC rules     | - no kernel upstreaming      | - BPF is complicated to learn|
> +| combination  |                              |                              |
> +|              | - fast                       | - Runtime BPF compilation /  |
> +|              |                              |   or bytecode change, would  |
> +|              | - per-flow RSS               |   be tricky                  |
> +|              |                              |                              |
> +|              | - no change in tap PMD       | - much rework in the tap PMD |
> +|              |   datapath                   |   to handle lots of new      |
> +|              |                              |   netlink messages / actions |
> ++--------------+------------------------------+------------------------------+
> +| tap driver   | - pretty fast as it          | - might not be accepted by   |
> +|              |   intervenes early in packet |   the kernel community as    |
> +|              |   RX                         |   they may cling to their    |
> +|              |                              |   jhash2 hashing function for|
> +|              |                              |   RX.                        |
> +|              |                              |                              |
> +|              |                              | - only a single RSS context  |
> ++--------------+------------------------------+------------------------------+
> +| BPF          | - fast                       | - BPF is complicated to learn|
> +| classifier - |                              |                              |
> +| action       | - per-flow RSS               | - would require changing the |
> +|              |                              |   kernel API to support      |
> +|              |                              |   editing queue_mapping in an|
> +|              |                              |   skb                        |
> +|              |                              |                              |
> +|              |                              | - hashing would be performed |
> +|              |                              |   for each queue of a        |
> +|              |                              |   specific RSS context       |
> +|              |                              |                              |
> +|              |                              | - probably difficult to gain |
> +|              |                              |   community acceptance       |
> ++--------------+------------------------------+------------------------------+
> +| TC action    | - much more flexibility, with| - needs to be in sync with   |
> +|              |   per-flow RSS, multiple     |   iproute2's tc program      |
> +|              |   keys, multiple packet      |                              |
> +|              |   fields for the hash...     | - kernel upstreaming is not  |
> +|              |                              |   necessarily easy           |
> +|              | - it's a separate kernel     |                              |
> +|              |   module that can be         | - rework in tap PMD to       |
> +|              |   maintained out-of-tree and |   support new RSS action and |
> +|              |   optionally upstreamed      |   configuration              |
> +|              |   anytime                    |                              |
> +|              |                              |                              |
> +|              | - most logical to be handled |                              |
> +|              |   in kernel as RSS is        |                              |
> +|              |   supposed to be computed in |                              |
> +|              |   the "NIC" exactly once.    |                              |
> +|              |                              |                              |
> +|              | - fastest                    |                              |
> +|              |                              |                              |
> +|              | - no change in tap PMD       |                              |
> +|              |   datapath                   |                              |
> ++--------------+------------------------------+------------------------------+
> +
> +TC rules using BPF from tap PMD
> +-------------------------------
> +
> +The third solution is the best for userland-based solutions.
> +It does the job well, fast (datapath running in kernel), is logically happening
> +in the kernel in runtime, supports flow-based RSS, has the best potential to
> +be accepted by the community.
> +
> +Advantages with this solution:
> +- hash can be recorded in the packet data and read in tap PMD
> +- no kernel customization, everything in DPDK
> +- packet gets in tap PMD on the correct queue directly
> +
> +Drawbacks:
> +- complicates tap PMD a lot:
> +  - 3 BPF programs
> +  - new implicit rules
> +  - new action and filter support
> +  - packet stripping
> +- numerous TC rules required (in proportion with the number of queues)
> +- fast (kernel + JIT BPF), but several TC rules must be crossed
> +
> +BPF programs controlled from tap PMD will be used to match packets, compute a
> +hash given the configured key, and send packets to tap using the desired queue.
> +
> +Design
> +~~~~~~
> +
> +BPF has a limited set of functions for editing the skb in TC. They are listed
> +in ``linux/net/core/filter.c:tc_cls_act_func_proto()``:
> +
> +- skb_store_bytes
> +- skb_load_bytes
> +- skb_pull_data
> +- csum_diff
> +- csum_update
> +- l3_csum_replace
> +- l4_csum_replace
> +- clone_redirect
> +- get_cgroup_classid
> +- skb_vlan_push
> +- skb_vlan_pop
> +- skb_change_proto
> +- skb_change_type
> +- skb_change_tail
> +- skb_get_tunnel_key
> +- skb_set_tunnel_key
> +- skb_get_tunnel_opt
> +- skb_set_tunnel_opt
> +- redirect
> +- get_route_realm
> +- get_hash_recalc
> +- set_hash_invalid
> +- perf_event_output
> +- get_smp_processor_id
> +- skb_under_cgroup
> +
> +In a BPF program, it is typically not possible to edit the queue_mapping field
> +to direct the packet in the correct queue. That part would be done by chaining a
> +``skbedit queue_mapping`` action.
> +
> +It is not possible either to directly prepend data to a packet (appending works,
> +though).
> +
> +A packet would go through these rules (on the local side of the tap netdevice):
> +
> ++-----+---------------------------+----------------------------------+----------+
> +|PRIO |           Match           |              Action 1            | Action 2 |
> ++=====+===========================+==================================+==========+
> +|  1  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|  2  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
> ++-----+---------------------------+----------------------------------+----------+
> +| ... |                           |                                  |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|  x  | ANY                       | BPF: append NULL 32bits for hash |          |
> +|     |                           |                                  |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|x + 1| ACTUAL FLOW RULE 1 MATCH  | ...                              |          |
> +|     |                           |                                  |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|x + 2| ACTUAL FLOW RULE 2 MATCH  | ...                              |          |
> +|     |                           |                                  |          |
> ++-----+---------------------------+----------------------------------+----------+
> +| ... |                           |                                  |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|  y  | FLOW RULE RSS 1 MATCH     | BPF compute hash into packet     |reclassify|
> +|     |                           | tailroom && set queue in skb->cb |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|y + 1| FLOW RULE RSS 2 MATCH     | BPF compute hash into packet     |reclassify|
> +|     |                           | tailroom && set queue in skb->cb |          |
> ++-----+---------------------------+----------------------------------+----------+
> +| ... |                           |                                  |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|  z  | ANY (default RSS)         | BPF compute hash into packet     |reclassify|
> +|     |                           | tailroom && set queue in skb->cb |          |
> ++-----+---------------------------+----------------------------------+----------+
> +|  z  | ANY (isolate mode)        | DROP                             |          |
> ++-----+---------------------------+----------------------------------+----------+
> +
> +
> +
> +TC kernel action
> +----------------
> +
> +The latest solution (implementing a TC action) would probably be the most simple
> +to implement. It is also very flexible, opening more possibilities for filtering
> +and RSS combined.
> +
> +For this solution, the following parameters could be used to configure RSS in a
> +TC netlink message:
> +
> +``queues`` (u16 \*):
> +  list of queues to spread incoming traffic on. That's actually the reta.
> +  **Note:** the queue in an ``skb`` is on 16-bits, hence the type here.
> +
> +``key`` (u8 \*):
> +  key to use for the Toeplitz-hash in this flow.
> +
> +``hash_fields`` (bitfield):
> +  similar to what's in DPDK, the bitfield should determine what fields in the
> +  packet header to use for hashing. It is likely another means of configuring
> +  which fields to pick would be used actually.
> +
> +``algo`` (unsigned):
> +  an enum value from the kernel act_rss header can be used to determine which
> +  algorithm (implemented in the kernel) to use. Possible algos could be
> +  toeplitz, xor, symmetric hash...
> +
> +**Note:** The number of queues to use is automatically deduced from the
> +``queues`` netlink attribute length. The ``key`` length can be similarly
> +obtained.
> +
> +.. raw:: pdf
> +
> +  PageBreak
> +
> +Appendix: TC netlink message
> +============================
> +
> +**Note:** For deterministic behavior, TC queueing disciplines (QDISC), filters
> +and classes must be flushed before starting to apply TC rules. There is a little
> +bit of boilerplate (with specific netlink messages) to ensure TC rules can be
> +applied. Typically, the TC ``ingress`` QDISC must be created first.
> +
> +For information, netlink messages regarding TC will look like this::
> +
> +      0          8         16         24         32
> +      +----------+----------+----------+----------+  ---
> +    0 |                  Length                   |     \
> +      +---------------------+---------------------+      \
> +    4 |         Type        |        Flags        |      |
> +      +----------- ---------+---------------------+       >--  struct
> +    8 |              Sequence number              |      |     nlmsghdr
> +      +-------------------------------------------+      /
> +   12 |           Process Port ID (PID)           |     /
> +      +==========+==========+==========+==========+  ---
> +   16 |  Family  |  Rsvd1   |      Reserved2      |     \
> +      +----------+----------+---------------------+      \
> +   20 |              Interface index              |      |
> +      +-------------------------------------------+      |
> +   24 |                  Handle                   |      |
> +      +-------------------------------------------+       >--  struct
> +   28 |               Parent handle               |      |     tcmsg
> +      |        MAJOR        +        MINOR        |      |
> +      +-------------------------------------------+      |
> +   32 |                 TCM info                  |      /
> +      |       priority      +      protocol       |     /
> +      +===========================================+  ---
> +      |                                           |
> +      |                  Payload                  |
> +      |                                           |
> +        ........................................
> +      |                                           |
> +      |                                           |
> +      +-------------------------------------------+
> diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
> index 405b49e..9afae5e 100644
> --- a/drivers/net/tap/Makefile
> +++ b/drivers/net/tap/Makefile
> @@ -39,6 +39,9 @@ EXPORT_MAP := rte_pmd_tap_version.map
>  
>  LIBABIVER := 1
>  
> +# TAP_MAX_QUEUES must be a power of 2 as it will be used for masking */
> +TAP_MAX_QUEUES = 16
> +
>  CFLAGS += -O3
>  CFLAGS += -I$(SRCDIR)
>  CFLAGS += -I.
> @@ -47,6 +50,8 @@ LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
>  LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
>  LDLIBS += -lrte_bus_vdev
>  
> +CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
> +
>  #
>  # all source are stored in SRCS-y
>  #
> @@ -89,7 +94,6 @@ tap_autoconf.h: tap_autoconf.h.new
>  		mv '$<' '$@'
>  
>  $(SRCS-$(CONFIG_RTE_LIBRTE_PMD_TAP):.c=.o): tap_autoconf.h
> -
>  clean_tap: FORCE
>  	$Q rm -f -- tap_autoconf.h tap_autoconf.h.new
>  
> diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
> index 829f32f..01ac153 100644
> --- a/drivers/net/tap/rte_eth_tap.h
> +++ b/drivers/net/tap/rte_eth_tap.h
> @@ -45,7 +45,7 @@
>  #include <rte_ether.h>
>  
>  #ifdef IFF_MULTI_QUEUE
> -#define RTE_PMD_TAP_MAX_QUEUES	16
> +#define RTE_PMD_TAP_MAX_QUEUES	TAP_MAX_QUEUES
>  #else
>  #define RTE_PMD_TAP_MAX_QUEUES	1
>  #endif
> @@ -90,6 +90,11 @@ struct pmd_internals {
>  	int ioctl_sock;                   /* socket for ioctl calls */
>  	int nlsk_fd;                      /* Netlink socket fd */
>  	int flow_isolate;                 /* 1 if flow isolation is enabled */
> +	int flower_support;               /* 1 if kernel supports, else 0 */
> +	int flower_vlan_support;          /* 1 if kernel supports, else 0 */
> +	int rss_enabled;                  /* 1 if RSS is enabled, else 0 */
> +	/* implicit rules set when RSS is enabled */
> +	LIST_HEAD(tap_rss_flows, rte_flow) rss_flows;
>  	LIST_HEAD(tap_flows, rte_flow) flows;        /* rte_flow rules */
>  	/* implicit rte_flow rules set when a remote device is active */
>  	LIST_HEAD(tap_implicit_flows, rte_flow) implicit_flows;
> diff --git a/drivers/net/tap/tap_bpf_elf.h b/drivers/net/tap/tap_bpf_elf.h
> new file mode 100644
> index 0000000..f3db1bf
> --- /dev/null
> +++ b/drivers/net/tap/tap_bpf_elf.h
> @@ -0,0 +1,56 @@
> +/*******************************************************************************
> +
> +  Copyright (C) 2015 Daniel Borkmann <daniel@iogearbox.net>
> +
> +  Copied from iproute2's include/bpf_elf.h, available at:
> +    https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
> +
> +  This file is licensed under GNU General Public License (GPL) v2.
> +
> +  The full GNU General Public License is included in this distribution in
> +  the file called "LICENSE.GPL".
> +
> +*******************************************************************************/
> +
> +
> +#ifndef __BPF_ELF__
> +#define __BPF_ELF__
> +
> +#include <asm/types.h>
> +
> +/* Note:
> + *
> + * Below ELF section names and bpf_elf_map structure definition
> + * are not (!) kernel ABI. It's rather a "contract" between the
> + * application and the BPF loader in tc. For compatibility, the
> + * section names should stay as-is. Introduction of aliases, if
> + * needed, are a possibility, though.
> + */
> +
> +/* ELF section names, etc */
> +#define ELF_SECTION_LICENSE	"license"
> +#define ELF_SECTION_MAPS	"maps"
> +#define ELF_SECTION_PROG	"prog"
> +#define ELF_SECTION_CLASSIFIER	"classifier"
> +#define ELF_SECTION_ACTION	"action"
> +
> +#define ELF_MAX_MAPS		64
> +#define ELF_MAX_LICENSE_LEN	128
> +
> +/* Object pinning settings */
> +#define PIN_NONE		0
> +#define PIN_OBJECT_NS		1
> +#define PIN_GLOBAL_NS		2
> +
> +/* ELF map definition */
> +struct bpf_elf_map {
> +	__u32 type;
> +	__u32 size_key;
> +	__u32 size_value;
> +	__u32 max_elem;
> +	__u32 flags;
> +	__u32 id;
> +	__u32 pinning;
> +};
> +
> +#endif /* __BPF_ELF__ */
> diff --git a/drivers/net/tap/tap_flow.c b/drivers/net/tap/tap_flow.c
> index ffc0b85..43bab7d 100644
> --- a/drivers/net/tap/tap_flow.c
> +++ b/drivers/net/tap/tap_flow.c
> @@ -43,6 +43,9 @@
>  #include <tap_autoconf.h>
>  #include <tap_tcmsgs.h>
>  
> +#include <linux/bpf.h>
> +#include <linux/tc_act/tc_bpf.h>
> +
>  #ifndef HAVE_TC_FLOWER
>  /*
>   * For kernels < 4.2, this enum is not defined. Runtime checks will be made to
> @@ -104,6 +107,23 @@ struct remote_rule {
>  	int mirred;
>  };
>  
> +struct action_data {
> +	char id[16];
> +
> +	union {
> +		struct tc_gact gact;
> +		struct tc_mirred mirred;
> +		struct skbedit {
> +			struct tc_skbedit skbedit;
> +			uint16_t queue;
> +		} skbedit;
> +		struct bpf {
> +			int bpf_fd;
> +			char *annotation;
> +		} bpf;
> +	};
> +};
> +
>  static int tap_flow_create_eth(const struct rte_flow_item *item, void *data);
>  static int tap_flow_create_vlan(const struct rte_flow_item *item, void *data);
>  static int tap_flow_create_ipv4(const struct rte_flow_item *item, void *data);
> @@ -134,6 +154,8 @@ struct remote_rule {
>  		 int set,
>  		 struct rte_flow_error *error);
>  
> +static int rss_enable(struct pmd_internals *pmd);
> +
>  static const struct rte_flow_ops tap_flow_ops = {
>  	.validate = tap_flow_validate,
>  	.create = tap_flow_create,
> @@ -816,111 +838,64 @@ struct tap_flow_items {
>  }
>  
>  /**
> - * Transform a DROP/PASSTHRU action item in the provided flow for TC.
> - *
> - * @param[in, out] flow
> - *   Flow to be filled.
> - * @param[in] action
> - *   Appropriate action to be set in the TCA_GACT_PARMS structure.
> - *
> - * @return
> - *   0 if checks are alright, -1 otherwise.
> + * FIXME
>   */
>  static int
> -add_action_gact(struct rte_flow *flow, int action)
> +add_action(struct rte_flow *flow, size_t *act_index, struct action_data *adata)
>  {
>  	struct nlmsg *msg = &flow->msg;
> -	size_t act_index = 1;
> -	struct tc_gact p = {
> -		.action = action
> -	};
>  
> -	if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0)
> -		return -1;
> -	if (nlattr_nested_start(msg, act_index++) < 0)
> +	if (nlattr_nested_start(msg, ++(*act_index)) < 0)
>  		return -1;
> -	nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("gact"), "gact");
> -	if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0)
> -		return -1;
> -	nlattr_add(&msg->nh, TCA_GACT_PARMS, sizeof(p), &p);
> -	nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
> -	nlattr_nested_finish(msg); /* nested act_index */
> -	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
> -	return 0;
> -}
> -
> -/**
> - * Transform a MIRRED action item in the provided flow for TC.
> - *
> - * @param[in, out] flow
> - *   Flow to be filled.
> - * @param[in] ifindex
> - *   Netdevice ifindex, where to mirror/redirect packet to.
> - * @param[in] action_type
> - *   Either TCA_EGRESS_REDIR for redirection or TCA_EGRESS_MIRROR for mirroring.
> - *
> - * @return
> - *   0 if checks are alright, -1 otherwise.
> - */
> -static int
> -add_action_mirred(struct rte_flow *flow, uint16_t ifindex, uint16_t action_type)
> -{
> -	struct nlmsg *msg = &flow->msg;
> -	size_t act_index = 1;
> -	struct tc_mirred p = {
> -		.eaction = action_type,
> -		.ifindex = ifindex,
> -	};
>  
> -	if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0)
> -		return -1;
> -	if (nlattr_nested_start(msg, act_index++) < 0)
> -		return -1;
> -	nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("mirred"), "mirred");
> +	nlattr_add(&msg->nh, TCA_ACT_KIND, strlen(adata->id), adata->id);
>  	if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0)
>  		return -1;
> -	if (action_type == TCA_EGRESS_MIRROR)
> -		p.action = TC_ACT_PIPE;
> -	else /* REDIRECT */
> -		p.action = TC_ACT_STOLEN;
> -	nlattr_add(&msg->nh, TCA_MIRRED_PARMS, sizeof(p), &p);
> +	if (strcmp("gact", adata->id) == 0) {
> +		nlattr_add(&msg->nh, TCA_GACT_PARMS, sizeof(adata->gact),
> +			   &adata->gact);
> +	} else if (strcmp("mirred", adata->id) == 0) {
> +		if (adata->mirred.eaction == TCA_EGRESS_MIRROR)
> +			adata->mirred.action = TC_ACT_PIPE;
> +		else /* REDIRECT */
> +			adata->mirred.action = TC_ACT_STOLEN;
> +		nlattr_add(&msg->nh, TCA_MIRRED_PARMS, sizeof(adata->mirred),
> +			   &adata->mirred);
> +	} else if (strcmp("skbedit", adata->id) == 0) {
> +		nlattr_add(&msg->nh, TCA_SKBEDIT_PARMS,
> +			   sizeof(adata->skbedit.skbedit),
> +			   &adata->skbedit.skbedit);
> +		nlattr_add16(&msg->nh, TCA_SKBEDIT_QUEUE_MAPPING,
> +			     adata->skbedit.queue);
> +	} else if (strcmp("bpf", adata->id) == 0) {
> +		nlattr_add32(&msg->nh, TCA_ACT_BPF_FD, adata->bpf.bpf_fd);
> +		nlattr_add(&msg->nh, TCA_ACT_BPF_NAME,
> +			   strlen(adata->bpf.annotation),
> +			   adata->bpf.annotation);
> +	} else {
> +		return -1;
> +	}
>  	nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
>  	nlattr_nested_finish(msg); /* nested act_index */
> -	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
>  	return 0;
>  }
>  
>  /**
> - * Transform a QUEUE action item in the provided flow for TC.
> - *
> - * @param[in, out] flow
> - *   Flow to be filled.
> - * @param[in] queue
> - *   Queue id to use.
> - *
> - * @return
> - *   0 if checks are alright, -1 otherwise.
> + * FIXME
>   */
>  static int
> -add_action_skbedit(struct rte_flow *flow, uint16_t queue)
> +add_actions(struct rte_flow *flow, int nb_actions, struct action_data *data,
> +	    int classifier_action)
>  {
>  	struct nlmsg *msg = &flow->msg;
> -	size_t act_index = 1;
> -	struct tc_skbedit p = {
> -		.action = TC_ACT_PIPE
> -	};
> +	size_t act_index = 0;
> +	int i;
>  
> -	if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0)
> -		return -1;
> -	if (nlattr_nested_start(msg, act_index++) < 0)
> +	if (nlattr_nested_start(msg, classifier_action) < 0)
>  		return -1;
> -	nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("skbedit"), "skbedit");
> -	if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0)
> -		return -1;
> -	nlattr_add(&msg->nh, TCA_SKBEDIT_PARMS, sizeof(p), &p);
> -	nlattr_add16(&msg->nh, TCA_SKBEDIT_QUEUE_MAPPING, queue);
> -	nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
> -	nlattr_nested_finish(msg); /* nested act_index */
> +	for (i = 0; i < nb_actions; i++)
> +		if (add_action(flow, &act_index, data + i) < 0)
> +			return -1;
>  	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
>  	return 0;
>  }
> @@ -1053,7 +1028,12 @@ struct tap_flow_items {
>  		}
>  	}
>  	if (mirred && flow) {
> -		uint16_t if_index = pmd->if_index;
> +		struct action_data adata = {
> +			.id = "mirred",
> +			.mirred = {
> +				.eaction = mirred,
> +			},
> +		};
>  
>  		/*
>  		 * If attr->egress && mirred, then this is a special
> @@ -1061,9 +1041,13 @@ struct tap_flow_items {
>  		 * redirect packets coming from the DPDK App, out
>  		 * through the remote netdevice.
>  		 */
> -		if (attr->egress)
> -			if_index = pmd->remote_if_index;
> -		if (add_action_mirred(flow, if_index, mirred) < 0)
> +		adata.mirred.ifindex = attr->ingress ? pmd->if_index :
> +			pmd->remote_if_index;
> +		if (mirred == TCA_EGRESS_MIRROR)
> +			adata.mirred.action = TC_ACT_PIPE;
> +		else
> +			adata.mirred.action = TC_ACT_STOLEN;
> +		if (add_actions(flow, 1, &adata, TCA_FLOWER_ACT) < 0)
>  			goto exit_action_not_supported;
>  		else
>  			goto end;
> @@ -1077,14 +1061,33 @@ struct tap_flow_items {
>  			if (action)
>  				goto exit_action_not_supported;
>  			action = 1;
> -			if (flow)
> -				err = add_action_gact(flow, TC_ACT_SHOT);
> +			if (flow) {
> +				struct action_data adata = {
> +					.id = "gact",
> +					.gact = {
> +						.action = TC_ACT_SHOT,
> +					},
> +				};
> +
> +				err = add_actions(flow, 1, &adata,
> +						  TCA_FLOWER_ACT);
> +			}
>  		} else if (actions->type == RTE_FLOW_ACTION_TYPE_PASSTHRU) {
>  			if (action)
>  				goto exit_action_not_supported;
>  			action = 1;
> -			if (flow)
> -				err = add_action_gact(flow, TC_ACT_UNSPEC);
> +			if (flow) {
> +				struct action_data adata = {
> +					.id = "gact",
> +					.gact = {
> +						/* continue */
> +						.action = TC_ACT_UNSPEC,
> +					},
> +				};
> +
> +				err = add_actions(flow, 1, &adata,
> +						  TCA_FLOWER_ACT);
> +			}
>  		} else if (actions->type == RTE_FLOW_ACTION_TYPE_QUEUE) {
>  			const struct rte_flow_action_queue *queue =
>  				(const struct rte_flow_action_queue *)
> @@ -1096,22 +1099,30 @@ struct tap_flow_items {
>  			if (!queue ||
>  			    (queue->index > pmd->dev->data->nb_rx_queues - 1))
>  				goto exit_action_not_supported;
> -			if (flow)
> -				err = add_action_skbedit(flow, queue->index);
> +			if (flow) {
> +				struct action_data adata = {
> +					.id = "skbedit",
> +					.skbedit = {
> +						.skbedit = {
> +							.action = TC_ACT_PIPE,
> +						},
> +						.queue = queue->index,
> +					},
> +				};
> +
> +				err = add_actions(flow, 1, &adata,
> +					TCA_FLOWER_ACT);
> +			}
>  		} else if (actions->type == RTE_FLOW_ACTION_TYPE_RSS) {
> -			/* Fake RSS support. */
>  			const struct rte_flow_action_rss *rss =
>  				(const struct rte_flow_action_rss *)
>  				actions->conf;
>  
> -			if (action)
> -				goto exit_action_not_supported;
> -			action = 1;
> -			if (!rss || rss->num < 1 ||
> -			    (rss->queue[0] > pmd->dev->data->nb_rx_queues - 1))
> +			if (action++)
>  				goto exit_action_not_supported;
> -			if (flow)
> -				err = add_action_skbedit(flow, rss->queue[0]);
> +			if (!pmd->rss_enabled)
> +				err = rss_enable(pmd);
> +			(void)rss;
>  		} else {
>  			goto exit_action_not_supported;
>  		}
> @@ -1632,6 +1643,127 @@ int tap_flow_implicit_destroy(struct pmd_internals *pmd,
>  	return 0;
>  }
>  
> +#define BPF_PROGRAM "tap_bpf_program.o"
> +
> +/**
> + * Enable RSS on tap: create leading TC rules for queuing.
> + */
> +static int rss_enable(struct pmd_internals *pmd)
> +{
> +	struct rte_flow *rss_flow = NULL;
> +	char section[64];
> +	struct nlmsg *msg = NULL;
> +	/* 4096 is the maximum number of instructions for a BPF program */
> +	char annotation[256];
> +	int bpf_fd;
> +	int i;
> +
> +	/*
> +	 * Add a rule per queue to match reclassified packets and direct them to
> +	 * the correct queue.
> +	 */
> +	for (i = 0; i < pmd->dev->data->nb_rx_queues; i++) {
> +		struct action_data adata = {
> +			.id = "skbedit",
> +			.skbedit = {
> +				.skbedit = {
> +					.action = TC_ACT_PIPE,
> +				},
> +				.queue = i,
> +			},
> +		};
> +
> +		bpf_fd = 0;
> +
> +		rss_flow = rte_malloc(__func__, sizeof(struct rte_flow), 0);
> +		if (!rss_flow) {
> +			RTE_LOG(ERR, PMD,
> +				"Cannot allocate memory for rte_flow");
> +			return -1;
> +		}
> +		msg = &rss_flow->msg;
> +		tc_init_msg(msg, pmd->if_index, RTM_NEWTFILTER, NLM_F_REQUEST |
> +			    NLM_F_ACK | NLM_F_EXCL | NLM_F_CREATE);
> +		msg->t.tcm_info = TC_H_MAKE((i + PRIORITY_OFFSET) << 16,
> +					    htons(ETH_P_ALL));
> +		msg->t.tcm_parent = TC_H_MAKE(MULTIQ_MAJOR_HANDLE, 0);
> +		tap_flow_set_handle(rss_flow);
> +		nlattr_add(&msg->nh, TCA_KIND, sizeof("bpf"), "bpf");
> +		if (nlattr_nested_start(msg, TCA_OPTIONS) < 0)
> +			return -1;
> +		nlattr_add32(&msg->nh, TCA_BPF_FD, bpf_fd);
> +		snprintf(annotation, sizeof(annotation), "%s:[%s]",
> +			 BPF_PROGRAM, section);
> +		nlattr_add(&msg->nh, TCA_BPF_NAME, strlen(annotation),
> +			   annotation);
> +
> +		if (add_actions(rss_flow, 1, &adata, TCA_BPF_ACT) < 0)
> +			return -1;
> +		nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
> +		/* Netlink message is now ready to be sent */
> +		if (nl_send(pmd->nlsk_fd, &msg->nh) < 0)
> +			return -1;
> +		if (nl_recv_ack(pmd->nlsk_fd) < 0)
> +			return -1;
> +		LIST_INSERT_HEAD(&pmd->rss_flows, rss_flow, next);
> +	}
> +
> +	snprintf(annotation, sizeof(annotation), "%s:[%s]", BPF_PROGRAM,
> +		 section);
> +	rss_flow = rte_malloc(__func__, sizeof(struct rte_flow), 0);
> +	if (!rss_flow) {
> +		RTE_LOG(ERR, PMD,
> +			"Cannot allocate memory for rte_flow");
> +		return -1;
> +	}
> +	msg = &rss_flow->msg;
> +	tc_init_msg(msg, pmd->if_index, RTM_NEWTFILTER,
> +		    NLM_F_REQUEST | NLM_F_ACK | NLM_F_EXCL | NLM_F_CREATE);
> +	msg->t.tcm_info =
> +		TC_H_MAKE((RTE_PMD_TAP_MAX_QUEUES + PRIORITY_OFFSET) << 16,
> +			  htons(ETH_P_ALL));
> +	msg->t.tcm_parent = TC_H_MAKE(MULTIQ_MAJOR_HANDLE, 0);
> +	tap_flow_set_handle(rss_flow);
> +	nlattr_add(&msg->nh, TCA_KIND, sizeof("flower"), "flower");
> +	if (nlattr_nested_start(msg, TCA_OPTIONS) < 0)
> +		return -1;
> +
> +	/* no fields for matching: all packets must match */
> +	{
> +		/* Actions */
> +		struct action_data data[2] = {
> +			[0] = {
> +				.id = "bpf",
> +				.bpf = {
> +					.bpf_fd = bpf_fd,
> +					.annotation = annotation,
> +				},
> +			},
> +			[1] = {
> +				.id = "gact",
> +				.gact = {
> +					/* continue */
> +					.action = TC_ACT_UNSPEC,
> +				},
> +			},
> +		};
> +
> +		if (add_actions(rss_flow, 2, data, TCA_FLOWER_ACT) < 0)
> +			return -1;
> +	}
> +	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
> +	nlattr_nested_finish(msg); /* nested TCA_OPTIONS */
> +	/* Netlink message is now ready to be sent */
> +	if (nl_send(pmd->nlsk_fd, &msg->nh) < 0)
> +		return -1;
> +	if (nl_recv_ack(pmd->nlsk_fd) < 0)
> +		return -1;
> +	LIST_INSERT_HEAD(&pmd->rss_flows, rss_flow, next);
> +
> +	pmd->rss_enabled = 1;
> +	return 0;
> +}
> +
>  /**
>   * Manage filter operations.
>   *