From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f173.google.com (mail-wr0-f173.google.com [209.85.128.173]) by dpdk.org (Postfix) with ESMTP id AE8052A5D for ; Tue, 5 Dec 2017 08:53:36 +0100 (CET) Received: by mail-wr0-f173.google.com with SMTP id s66so19939152wrc.9 for ; Mon, 04 Dec 2017 23:53:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding:content-language; bh=vfZQadPy7aDGMdcGf+guI066O/RVFfJyG7S8E3nddxM=; b=Iw2S2PDeYdyBOqjybfoHGczxjjCZWSRarnvZeoQGmFUQ8VyO6cA5VgaJhqr5Akcdlz EHzjNipVM1OABHWmeWRg06zH2giZN+3UoUmlnMoCQLhB1vJddom0McCoptpSZ+mccULS sY69GFDg3KAMSHtaseo2X0sWmL9gubv7IEq0j0rTa3oY4cr9sKLUaN47C3EV65J6sGsN nTKN7sn+Zf8C9YMvnrzp/T/wfgSweYMLX+ibRR6wAA5o4wYQJrQcZ14CvYzNGygLgwFX E+zsjYLjveAxDoWORT0EERDf/wtIWG/H1HF48U3qdy3WjMdjlhirAhcqfpiAre3qIkld 4K3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=vfZQadPy7aDGMdcGf+guI066O/RVFfJyG7S8E3nddxM=; b=WB+8QhNiocYHhhpL241cdAsXZsDZhIwuaYHy64gjoWM7GNQeiewocjyTSwdRIHmt0D 2w1TaT3f52KYCAqlhXHpGSllTdyV5zDdxj43M30osElUPoCUvSLcTKAQC9i+xDgQM+f3 7zX83Sj3tkSqiDk182sYPYGiQqptfsuM89H67g/hLL8EwQd5STtaTPpXSjrPnn2oekXx iyVCkvDzjfRrKK6HbOXDoU3pJcqKtY2BObrsE1FuMMUytV46Hvw0aRdqgnshLzTCrLe/ vY9GZrmwmIz8/Cmq6Y2NKlZh+T0OzkOlDlJW5XYClPQTs4QYe+IiXepWjBmwoTB/8IwZ LW1A== X-Gm-Message-State: AJaThX7m/fgoHVNCnyngSEFZ4M8tzctyiMoo9BFH3/SULt0rLjKPTV1U YKvVoJUWTLPZ4Y4W9LBsiLReKOpGr2I= X-Google-Smtp-Source: AGs4zMb2U+OkBOnJPpaxa17U0xxPpyO+JcmGD/FVvT6cx/XF+yZrxLgB0C95rr6NIdXilHMJtKV69w== X-Received: by 10.223.147.133 with SMTP id 5mr14907222wrp.230.1512460414844; Mon, 04 Dec 2017 23:53:34 -0800 (PST) Received: from [10.16.18.139] (host.78.145.23.62.rev.coltfrance.com. [62.23.145.78]) by smtp.gmail.com with ESMTPSA id 128sm5682784wmi.28.2017.12.04.23.53.33 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 04 Dec 2017 23:53:34 -0800 (PST) To: Ophir Munk , dev@dpdk.org Cc: Thomas Monjalon , Olga Shern References: <1512028870-13597-1-git-send-email-ophirmu@mellanox.com> From: Pascal Mazon Message-ID: <49b99e45-f898-3e33-b435-8803ce80e1ab@6wind.com> Date: Tue, 5 Dec 2017 08:53:35 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <1512028870-13597-1-git-send-email-ophirmu@mellanox.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Content-Language: en-US Subject: Re: [dpdk-dev] [RFC 1/2] net/tap: add eBPF to TAP device X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Dec 2017 07:53:36 -0000 Hi Ophir, I wrote that doc (rte_flow_tap) a while ago (10 months+), it is no longer accurate and would very much need updating. I might have written some of the code you propose, but I definitely didn't see this patch before you sent it, you shouldn't use my sign-off. That goes for the second patch in the series, too. On 30/11/2017 09:01, Ophir Munk wrote: > The DPDK traffic classifier is the rte_flow API and the tap PMD > must support it including RSS queue mapping actions. > An example usage for this requirement is failsafe transparent > switching from a PCI device to TAP device while RSS queues are the > same on both devices. > > TC was chosen as TAP classifier but TC alone does not support RSS > queue mapping. This commit uses a combination of TC rules and eBPF > actions in order to support TAP RSS. > > eBPF requires Linux version 3.19. eBPF is effective only when running > with an appropriate kernel version. It must be compiled with the > appropriate Linux kernel headers. In case the kernel headers do > not include eBPF definitions a warning will be issued during > compilation time and TAP RSS will not be supported. > > Signed-off-by: Pascal Mazon > Signed-off-by: Ophir Munk > --- > > The DPDK traffic classifier is the rte_flow API and the tap PMD > must support it including RSS queue mapping actions. > An example usage for this requirement is failsafe transparent > switching from a PCI device to TAP device while RSS queues are the > same on both devices. > TC was chosen as TAP classifier but TC alone does not support RSS > queue mapping. This RFC suggests using a combination of TC rules and eBPF > actions in order to support TAP RSS. > eBPF requires Linux version 3.19. eBPF is effective only when running > with an appropriate kernel version. It must be compiled with the > appropriate Linux kernel headers. In case the kernel headers do > not include eBPF definitions a warning will be issued during > compilation time and TAP RSS will not be supported. > The C source file (tap_bpf_insns.c) includes eBPF "assembly > instructions" in the format of an array of struct bpf_insns. > This array is passed to the kernel for execution in BPF system call. > The C language source file (tap_bpf_program.c) from which the > "assembly instructions" were generated is included in TAP source tree, > however it does not take part in dpdk compilation. > TAP documentation will detail the process of eBPF "assembly instructions" > generation. > > eBPF programs controlled from tap PMD will be used to match packets, compute a > hash given the configured key, and send packets using the desired queue. > In an eBPF program, it is typically not possible to edit the queue_mapping field > in skb to direct the packet in the correct queue. That part would be addressed by > chaining a ``skbedit queue_mapping`` action. > > A packet would go through these TC rules (on the local side of the tap netdevice): > > +-----+---------------------------+----------------------------------+----------+ > |PRIO | Match | Action 1 | Action 2 | > +=====+===========================+==================================+==========+ > | 1 | marked? | skbedit queue 'mark' --> DPDK | | > +-----+---------------------------+----------------------------------+----------+ > | 2 | marked? | skbedit queue 'mark' --> DPDK | | > +-----+---------------------------+----------------------------------+----------+ > | ... | | | | > +-----+---------------------------+----------------------------------+----------+ > | x | ANY | BPF: append NULL 32bits for hash | | > | | | | | > +-----+---------------------------+----------------------------------+----------+ > |x + 1| ACTUAL FLOW RULE 1 MATCH | ... | | > | | | | | > +-----+---------------------------+----------------------------------+----------+ > |x + 2| ACTUAL FLOW RULE 2 MATCH | ... | | > | | | | | > +-----+---------------------------+----------------------------------+----------+ > | ... | | | | > +-----+---------------------------+----------------------------------+----------+ > | y | FLOW RULE RSS 1 MATCH | BPF compute hash into packet |reclassify| > | | | tailroom && set queue in skb->cb | | > +-----+---------------------------+----------------------------------+----------+ > |y + 1| FLOW RULE RSS 2 MATCH | BPF compute hash into packet |reclassify| > | | | tailroom && set queue in skb->cb | | > +-----+---------------------------+----------------------------------+----------+ > | ... | | | | > +-----+---------------------------+----------------------------------+----------+ > | z | ANY (default RSS) | BPF compute hash into packet |reclassify| > | | | tailroom && set queue in skb->cb | | > +-----+---------------------------+----------------------------------+----------+ > | z | ANY (isolate mode) | DROP | | > +-----+---------------------------+----------------------------------+----------+ > > Rules 1..x will match marked packets and will redirect them to their queues however > on first classification packets are not marked there will not be redirected. > Only when going later through RSS rules y..z BPF computes RSS hash, > sets queue in dkb->cb, and reclassifies packets. Then packets are classified again > through rules 1-x while being marked and will be redirected. > Rules (x+1)..y are non-RSS TC rules already used in dpdk versions prior to 18.02 > > doc/guides/prog_guide/rte_flow_tap.rst | 962 +++++++++++++++++++++++++++++++++ > drivers/net/tap/Makefile | 6 +- > drivers/net/tap/rte_eth_tap.h | 7 +- > drivers/net/tap/tap_bpf_elf.h | 56 ++ > drivers/net/tap/tap_flow.c | 336 ++++++++---- > 5 files changed, 1263 insertions(+), 104 deletions(-) > create mode 100644 doc/guides/prog_guide/rte_flow_tap.rst > create mode 100644 drivers/net/tap/tap_bpf_elf.h > > diff --git a/doc/guides/prog_guide/rte_flow_tap.rst b/doc/guides/prog_guide/rte_flow_tap.rst > new file mode 100644 > index 0000000..04ddda6 > --- /dev/null > +++ b/doc/guides/prog_guide/rte_flow_tap.rst > @@ -0,0 +1,962 @@ > +===================================== > +Flow API support in TAP PMD, using TC > +===================================== > + > +.. contents:: > +.. sectnum:: > + > +.. footer:: > + > + v0.8 - page ###Page### > + > +.. raw:: pdf > + > + PageBreak > + > +Rationale > +========= > + > +For this project, the tap PMD has to receive selected traffic from a different > +netdevice (refer to *VM migration with Microsoft Hyper-V and Mellanox > +ConnectX-3* document) and only cover the same set of rules as supported by the > +mlx4 PMD. > + > +The DPDK traffic classifier is the rte_flow API, and the tap PMD must therefore > +implement it. For that, TC was chosen for several reasons: > + > +- it happens very early in the kernel stack for ingress (faster than netfilter). > +- it supports dropping packets given a specific flow. > +- it supports redirecting packets to a different netdevice. > +- it has a "flower" classifier type that meets mostly the pattern items in > + rte_flow. > +- it can be configured through a netlink socket, without an external tool. > + > +Modes of operation > +================== > + > +There should be two modes of operation for the tap PMD regarding rte_flow: > +*local* and *remote*. Only one mode can be in use at a time for a specific tap > +interface. > + > +The *local* mode would be the default one, if no specific parameter is specified > +in the command line. To start the application with tap in *remote* mode, set the > +``remote`` tap parameter to the interface you want to redirect packets from, > +e.g.:: > + > + testpmd -n 4 -c 0xf -m 1024 --vdev=net_tap,iface=tap0,remote=eth3 -- \ > + -i --burst=64 --coremask=0x2 > + > +*Local* mode > +------------ > + > +In *local* mode, flow rules would be applied as-is, on the tap netdevice itself > +(e.g.: ``tap0``). > + > +The typical use-case is having a linux program (e.g. a webserver) communicating > +with the DPDK app through the tap netdevice:: > + > + +-------------------------+ > + | DPDK application | > + +-------------------------+ > + | ^ > + | rte_flow rte_flow | > + v egress ingress | > + +-------------------------+ > + | Tap PMD | > + +-------------------------+ > + | ^ > + | TC TC | > + v ingress egress | > + +-------------------------+ +-------------------------+ > + | |<-------------| | > + | Tap netdevice (tap0) | | Linux app (webserver) | > + | |------------->| | > + +-------------------------+ +-------------------------+ > + > +.. raw:: pdf > + > + PageBreak > + > +*Remote* mode > +------------- > + > +In *remote* mode, flow rules would be applied on the tap netdevice (e.g.: > +``tap0``), and use a similar match to redirect specific packets from another > +netdevice (e.g.: ``eth3``, a NetVSC netdevice in our project scenario):: > + > + +-------------------------+ > + | DPDK application | > + +-------------------------+ > + | ^ > + | rte_flow rte_flow | > + v egress ingress | > + +-------------------------+ > + | Tap PMD | > + +-------------------------+ > + | ^ > + | TC TC | > + v ingress egress | > + +-------------------------+ +-------------------------+ > + | |<------------------redirection-------\ | > + | Tap netdevice (tap0) | | | | > + | |------------->|-\ eth3 | | > + +-------------------------+ +--|--------------------|-+ > + | TC TC ^ > + | egress ingress | > + v | > + > +.. raw:: pdf > + > + PageBreak > + > +rte_flow rules conversion > +========================= > + > +Netlink > +------- > + > +The only way to create TC rules in the kernel is through netlink messages. > +Two possibilities arise for managing TC rules: > + > +- Using native netlink API calls in the tap PMD > +- Calling the ``tc`` command from iproute2 inside our PMD, via ``system()``. > + > +The former will be done, as library calls are faster than changing context and > +executing an external program from within the tap PMD. Moreover, the kernel TC > +API might propose features not yet implemented in iproute2. Furthermore, a > +custom implementation enables finer tuning and better control. > + > +.. > + Some implementations for TC configuration through Netlink exist already. It's a > + good source of inspiration on how to do it: > + > + - iproute2's tc `source code`__ > + - ovs's tc implementation__ (not yet upstream) > + > + __ https://github.com/shemminger/iproute2/tree/master/tc > + __ https://mail.openvswitch.org/pipermail/ovs-dev/2016-November/324693.html > + > +Conversion examples > +------------------- > + > +Here are a few examples of rules and how they can be translated from rte_flow > +rules to TC rules. rte_flow rules will be expressed using testpmd's ``flow`` > +command syntax, while TC rules will use iproute2 ``tc`` command syntax. > + > +**Notes**: > + - rte_flow ``ingress`` direction can be translated into a TC ``egress`` rule, > + and vice versa, when it applies to a tap interface, as TC considers the > + kernel netdevice standpoint. > + - in TC, redirecting a packet works by taking a packet from ``ingress`` and > + sending to another device's ``egress``. > + > +*Local* mode > +~~~~~~~~~~~~ > + > +#. Flow rule to give packets coming on the ``tap0`` interface to RX queue 0: > + > + Using rte_flow:: > + > + flow validate 0 ingress pattern port index is 0 / end \ > + actions queue index 0 / end > + > + Using ``tc``:: > + > + tc filter add dev tap0 parent 1: flower indev tap0 \ > + action skbedit queue_mapping 0 > + > +#. Flow rule to get packets with source mac ``de:ad:ca:fe:00:02`` on RX queue 2: > + > + Using rte_flow:: > + > + flow create 0 ingress pattern eth src is de:ad:ca:fe:00:02 / end \ > + actions queue 2 / end > + > + Using ``tc``:: > + > + tc filter add dev tap0 parent 1: flower src_mac de:ad:ca:fe:00:02 \ > + action skbedit queue_mapping 2 > + > +#. Flow rule to drop packets matching specific 5-tuple info: > + > + Using rte_flow:: > + > + flow create 0 ingress pattern eth dst is 3a:80:ce:61:36:54 \ > + src is 52:43:7b:fd:ac:f3 / ipv4 src is 1.1.1.1 dst is 2.2.2.2 \ > + / udp src is 4444 dst is 5555 / end actions drop / end > + > + Using ``tc``:: > + > + tc filter add dev tap0 parent 1: flower dst_mac 3a:80:ce:61:36:54 \ > + src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \ > + ip_proto udp src_port 4444 dst_port 5555 action drop > + > +*Remote* mode > +~~~~~~~~~~~~~ > + > +In *remote* mode, an additional rule for redirecting packet is systematically > +required. The examples are similar to the previous section (the rte_flow rule > +will thus be omitted). > + > +#. TC rules to give packets coming on the ``eth3`` interface to ``tap0`` RX > + queue 0:: > + > + # redirection rule > + tc filter add dev eth3 parent ffff: flower indev eth3 \ > + action mirred egress redirect dev tap0 > + # actual tap rule > + tc filter add dev tap0 parent 1: flower indev tap0 \ > + action skbedit queue_mapping 0 > + > +#. TC rules to get packets with source mac ``de:ad:ca:fe:00:02`` on RX queue 2:: > + > + # redirection rule > + tc filter add dev eth3 parent ffff: flower src_mac de:ad:ca:fe:00:02 \ > + action mirred egress redirect dev tap0 > + # actual tap rule > + tc filter add dev tap0 parent 1: flower src_mac de:ad:ca:fe:00:02 \ > + action skbedit queue_mapping 2 > + > +#. TC rules to drop packets matching specific 5-tuple info:: > + > + # redirection rule > + tc filter add dev eth3 parent ffff: flower dst_mac 3a:80:ce:61:36:54 \ > + src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \ > + ip_proto udp src_port 4444 dst_port 5555 \ > + action mirred egress redirect dev tap0 > + # actual tap rule > + tc filter add dev tap0 parent 1: flower dst_mac 3a:80:ce:61:36:54 \ > + src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \ > + ip_proto udp src_port 4444 dst_port 5555 action drop > + > +One last thing, to redirect packets the other way around (from ``tap0`` to > +``eth3``), we would use a similar rule, exchanging interfaces and using an > +appropriate match, e.g.:: > + > + tc filter add dev tap0 parent ffff: flower indev tap0 \ > + action mirred egress redirect dev eth3 > + > +.. > + **Note:** ``parent ffff:`` is for TC ``ingress`` while ``parent 1:`` is for TC > + ``egress``. > + > +Broadcast and promiscuous support > ++++++++++++++++++++++++++++++++++ > + > +*Remote* mode requirements: > + > +#. When turning the tap netdevice promiscuous, the remote netdevice should > + implicitly be turned promiscuous too, to get as many packets as possible. > + > +#. Packets matching the destination MAC configured in the tap PMD should be > + redirected from the remote without being processed by the stack there in the > + kernel. > + > +#. In promiscuous mode, an incoming packet should be duplicated to be processed > + both by the tap PMD and the remote netdevice itself. > + > +#. Incoming packets with broadcast destination MAC (i.e.: ``ff:ff:ff:ff:ff:ff``) > + should be duplicated to be processed both by the tap PMD and the remote > + netdevice itself. > + > +#. Incoming packets with IPv6 multicast destination MAC (i.e.: > + ``33:33:00:00:00:00/33:33:00:00:00:00``) should be duplicated to be processed > + both by the tap PMD and the remote netdevice itself. > + > +#. Incoming packets with broadcast/multicast bit set in the destination MAC > + (i.e.: ``01:00:00:00:00:00/01:00:00:00:00:00``) should be duplicated to be > + processed both by the tap PMD and the remote netdevice itself. > + > +Each of these requirements (except the first one) can be directly translated > +into a TC rule, e.g.:: > + > + # local mac (notice the REDIRECT for mirred action): > + tc filter add dev eth3 parent ffff: prio 1 flower dst_mac de:ad:be:ef:01:02 \ > + action mirred egress redirect dev tap0 > + > + # tap promisc: > + tc filter add dev eth3 parent ffff: prio 2 basic \ > + action mirred egress mirror dev tap0 > + > + # broadcast: > + tc filter add dev eth3 parent ffff: prio 3 flower dst_mac ff:ff:ff:ff:ff:ff \ > + action mirred egress mirror dev tap0 > + > + # broadcast v6 (can't express mac_mask with tc, but it works via netlink): > + tc filter add dev eth3 parent ffff: prio 4 flower dst_mac 33:33:00:00:00:00 \ > + action mirred egress mirror dev tap0 > + > + # all_multi (can't express mac_mask with tc, but it works via netlink): > + tc filter add dev eth3 parent ffff: prio 5 flower dst_mac 01:00:00:00:00:00 \ > + action mirred egress mirror dev tap0 > + > +When promiscuous mode is switched off or on, the first TC rule will be modified > +to have respectively an empty action (``continue``) or the ``mirror`` action. > + > +The first 5 priorities are always reserved, and can only be used for these > +filters. > + > +On top of that, the tap PMD can configure explicit rte_flow rules, translated as > +TC rules on both the remote netdevice and the tap netdevice. On the remote, > +those would need to be processed after the default rules handling promiscuous > +mode, broadcast and all_multi packets. > + > +When using the ``mirror`` action, the packet is duplicated and sent to the tap > +netdevice, while the original packet gets directly processed by the kernel > +without going through later TC rules for the remote. On the tap netdevice, the > +duplicated packet will go through tap TC rules and be classified depending on > +those rules. > + > +**Note:** It is possible to combine a ``mirror`` action and a ``continue`` > +action for a single TC rule. Then the original packet would undergo remaining TC > +rules on the remote netdevice side. > + > +When using the ``redirect`` action, the behavior is similar on the tap side, but > +the packet is not duplicated, no further kernel processing is done for the > +remote side. > + > +The following diagram sums it up. A packet that match a TC rule follows the > +associated action (the number in the diamond represents the rule prio as set in > +the above TC rules):: > + > + > + Incoming packet | > + on remote (eth3) | > + | Going through > + | TC ingress rules > + v > + / \ > + / 5 \ > + / \ yes > + / mac \____________________> tap0 > + \ match?/ duplicated pkt > + \ / > + \ / > + \ / > + V no, then continue > + | with TC rules > + | > + v > + / \ > + / 2 \ > + eth3 yes / \ yes > + kernel <____________________ /promisc\____________________> tap0 > + stack original pkt \ match?/ duplicated pkt > + \ / > + \ / > + \ / > + V no, then continue > + | with TC rules > + | > + v > + / \ > + / 3 \ > + eth3 yes / \ yes > + kernel <____________________ / bcast \____________________> tap0 > + stack original pkt \ match?/ duplicated pkt > + \ / > + \ / > + \ / > + V no, then continue > + | with TC rules > + | > + v > + / \ > + / 4 \ > + eth3 yes / \ yes > + kernel <____________________ / bcast6\____________________> tap0 > + stack original pkt \ match?/ duplicated pkt > + \ / > + \ / > + \ / > + V no, then continue > + | with TC rules > + | > + v > + / \ > + / 5 \ > + eth3 yes / all \ yes > + kernel <____________________ / multi \____________________> tap0 > + stack original pkt \ match?/ duplicated pkt > + \ / > + \ / > + \ / > + V no, then continue > + | with TC rules > + | > + v > + | > + . remaining TC rules > + . > + eth3 | > + kernel <________________________/ > + stack original pkt > + > +.. raw:: pdf > + > + PageBreak > + > +Associating an rte_flow rule with a TC one > +========================================== > + > +A TC rule is identified by a ``priority`` (16-bit value) and a ``handle`` > +(32-bit value). To delete a rule, the priority must be specified, and if several > +rules have the same priority, the handle is needed to select the correct one. > + > +.. > + Specifying an empty priority and handle when requesting a TC rule creation will > + let the kernel automatically decide what values to set. In fact, the kernel will > + start with a high priority (i.e. 49152) and subsequent rules will get decreasing > + priorities (lower priorites get evaluated first). > + > +To avoid further requests to the kernel to identify what priority/handle has > +been automatically allocated, the tap PMD can set priorities and handles > +systematically when creating a rule. > + > +In *local* mode, an rte_flow rule should be translated into a single TC flow > +identified by priority+handle. > + > +In *remote* mode, an rte_flow rule requires two TC rules, one on the tap > +netdevice itself (for the correct action) and another one on the other netdevice > +where packets are redirected from. Both TC rules' priorities+handles must be > +stored for a specific rte_flow rule, and associated with the device they are > +applied on. > + > +.. raw:: pdf > + > + PageBreak > + > +Considerations regarding Flow API support > +========================================= > + > +Flow rule attributes > +-------------------- > + > +Groups and priorities: > + There is no native support of groups in TC. Instead, the priority field > + (which is part of the netlink TC msg header) can be adapted. The four MSB > + would be used to define the group (allowing for 16 groups), while the 12 LSB > + would be left to define the actual priority (up to 4096). > + > + Rules with lower priorities are evaluated first. For rules with identical > + priorities, the one with the highest handle value gets evaluated first. > + > +Direction: > + Both ingress and egress filtering can be supported. > + > +Meta item types > +--------------- > + > +Most applications will use: ``(END | VOID)`` > + > +END, VOID: > + Supported without problem. > + > +INVERT: > + There is no easy way to support that in TC. It won't be supported > + > + **mlx4 will not support it either.** > + > +PF, VF, PORT: > + Not applicable to a tap netdevice. > + > +Data matching item types > +------------------------ > + > +Most applications will use: > +``ETH / (IPV4 | IPV6 | END) / (TCP | UDP | END) / END`` > + > +ANY: > + Should be supported. > + > + **mlx4 will partially support it.** > + > +RAW: > + It is not planned to support it for now. Matching Raw packets would require > + using a different classifier than "flower", which is the most simple and > + applicable for otherwise most other cases. With TC, it's not possible to > + support in the same rule both "flower" and raw packets. > + > + **mlx4 will not support it either**. > + > +VLAN: > + Matching VLAN ID and prio supported. > + **Note: linux v4.9 required for VLAN support.** > + > +ETH, IPV4, IPV6, UDP, TCP: > + Matching source/destination MAC/IP/port is supported, with masks. > + > + **mlx4 does not support partial bit-masks (full or zeroed only).** > + > +ICMP: > + By specifying the appropriate ether type, ICMP packets can be matched. > + However, there is no support for ICMP type or code. > + > + **mlx4 will not support it, however.** > + > +SCTP: > + By specifying the appropriate IP protocol, SCTP packets can be matched. > + However, no specific SCTP fields can be matched. > + > + **mlx4 will not support it, however.** > + > +VXLAN: > + VXLAN is not recognized by the "flower" classifier. Kernel-managed VXLAN > + traffic would come through an additional netdevice, which falls outside > + the scope of this project. VXLAN traffic should occur outside VMs anyway. > + > +Action types > +------------ > + > +Most applications will use: ``(VOID | END | QUEUE | DROP) / END`` > + > +By default, multiple actions are possible for TC flow rules. However, they are > +ordered in the kernel. The implementation will need to handle actions in a way > +that orders them intelligently when creating them. > + > +VOID, END: > + Supported. > + > +PASSTHRU: > + The generic "continue" action can be used. > + > + **mlx4 will not support it, however**. > + > +MARK / FLAG: > + The mark is a field inside an skbuff. However, the tap reads messages (mostly > + packet data), without that info. As an alternative, it may be possible to > + create a specific queue to pass packets with a specific mark. Further testing > + are needed to ensure it is feasable. > + > +QUEUE: > + The ``skbedit`` action with the ``queue_mapping`` option enables directing > + packets to specific queues. > + > + Like rte_flow, specifying several ``skbedit queue_mapping`` actions in TC > + only considers the last one. > + > +DROP: > + The generic "drop" action can be used. Packets will effectively be dropped, > + and not left for the kernel to process. > + > +COUNT: Stats are automatically stored in the kernel. The COUNT action will thus > + be ignored when creating the rule. ``rte_flow_query()`` can be implemented > + to request a rule's stats from the kernel. > + > +DUP: > + Duplicating packets is not supported. > + > +RSS: > + There's no built-in mechanism for RSS in TC. > + > + By default, incoming packets go to the tap PMD queue 0. To support RSS in > + software, several additional queues must be set up. Packets coming in on > + queue 0 can be considered as requiring RSS, and the PMD will apply software > + rss (using something like ``rte_softrss()``) to select a queue for the > + packet. > + > +PF, VF: > + Not applicable to a tap netdevice. > + > +.. raw:: pdf > + > + PageBreak > + > +TC limitations for flow collision > +================================= > + > +From TC standpoint, filter rules with identical priorities do not collide, if > +they do specify values (at least one different) for the same fields in the TC > +message, with identical fields masks. > + > +Unfortunately, some flows that obviously are not colliding can be considered > +otherwise by the kernel when parsing the TC messages, and thus their creation > +would be rejected. > + > +Here is a table for matching TC fields with their flow API equivalent: > + > ++------------------------------+-----------------------------------+-----------+ > +| TC message field | rte_flow API | maskable? | > ++==============================+===================================+===========+ > +| TCA_FLOWER_KEY_ETH_DST | eth dst | yes | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_ETH_SRC | eth src | yes | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_ETH_TYPE | eth type is 0xZZZZ || | no | > +| | eth / {ipv4|ipv6} | | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_IP_PROTO | eth / {ipv4|ipv6} / {tcp|udp} | no | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_IPV4_SRC | eth / ipv4 src | yes | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_IPV4_DST | eth / ipv4 dst | yes | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_IPV6_SRC | eth / ipv6 src | yes | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_IPV6_DST | eth / ipv6 dst | yes | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_L4_SRC | eth / {ipv4|ipv6} / {tcp|udp} dst | no | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_L4_DST | eth / {ipv4|ipv6} / {tcp|udp} src | no | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_VLAN_ID | eth / vlan vid | no | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_VLAN_PRIO | eth / vlan pcp | no | > ++------------------------------+-----------------------------------+-----------+ > +| TCA_FLOWER_KEY_VLAN_ETH_TYPE | eth / vlan tpid | no | > ++------------------------------+-----------------------------------+-----------+ > + > +When creating rules with identical priorities, one must make sure that they > +would be translated in TC using the same fields as shown in the above table. > + > +The following flow rules can share the same priority, as they use the same > +fields with identical masks under the hood:: > + > + > flow create 0 ingress priority 0 pattern eth / ipv4 / end > + actions drop / end > + Flow rule #0 created > + > flow create 0 ingress priority 0 pattern eth type is 0x86dd / end > + actions drop / end > + Flow rule #1 created > + > +**Note:** Both rules use ETH_TYPE (mask 0xffff) in their TC form. > + > +Sadly, the following flow rules cannot share the same priority, since fields for > +matching IPv4 and IPv6 src/dst addresses are different:: > + > + > flow create 0 ingress priority 1 pattern eth / ipv4 src is 1.1.1.1 / end > + actions drop / end > + Flow rule #0 created > + > flow create 0 ingress priority 1 pattern eth / ipv6 src is ::1 / end > + actions drop / end > + PMD: Kernel refused TC filter rule creation (22): Invalid argument > + Caught error type 2 (flow rule (handle)): overlapping rules > + > +**Note:** First rule uses ETH_TYPE and IPV4_SRC, while the second uses ETH_TYPE > +and IPV6_SRC. > + > +It is however possible to match different IPvX addresses with the same > +priority:: > + > + > flow create 0 ingress priority 2 pattern eth / ipv4 src is 1.1.1.1 / end > + actions drop / end > + Flow rule #0 created > + > flow create 0 ingress priority 2 pattern eth / ipv4 src is 2.2.2.2 / end > + actions drop / end > + Flow rule #1 created > + > +If the first rule specifies both destination and source addresses, then the > +other rule with the same priority must too (with identical masks):: > + > + > flow create 0 ingress priority 3 pattern eth / ipv4 src is 1.1.1.1 > + dst is 1.1.1.2 / end actions drop / end > + Flow rule #0 created > + > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2 / end > + actions drop / end > + PMD: Kernel refused TC filter rule creation (22): Invalid argument > + Caught error type 2 (flow rule (handle)): overlapping rules > + > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2 > + dst spec 2.2.2.3 dst mask 255.255.255.0 / end actions drop / end > + PMD: Kernel refused TC filter rule creation (22): Invalid argument > + Caught error type 2 (flow rule (handle)): overlapping rules > + > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2 > + dst is 2.2.2.3 / end actions drop / end > + Flow rule #1 created > + > +**Note:** First rule uses ETH_TYPE, IPV4_SRC and IPV4_DST (with full masks). The > +two others must also use those to share the same priority. > + > +It is possible to match TCP/UDP packets with different ports whatever the > +underlying L3, if the same fields are used (thus no l3 addresses specification). > +For instance:: > + > + > flow create 0 ingress priority 4 pattern eth / ipv4 / tcp dst is 3333 / end > + actions drop / end > + Flow rule #0 created > + > flow create 0 ingress priority 4 pattern eth / ipv6 / udp dst is 4444 / end > + actions drop / end > + Flow rule #1 created > + > flow create 0 ingress priority 4 pattern eth / ipv6 / udp src is 5555 / end > + actions drop / end > + PMD: Kernel refused TC filter rule creation (22): Invalid argument > + Caught error type 2 (flow rule (handle)): overlapping rules > + > +**Note:** First 2 rules use ETH_TYPE, IP_PROTO and L4_DST with different values > +but identical masks, so they're OK. Last rule used L4_SRC instead of L4_DST. > + > +.. raw:: pdf > + > + PageBreak > + > +RSS implementation for tap > +========================== > + > +There are several areas of research for a tap RSS implementation: > + > +#. userland implementation in tap PMD > +#. userland implementation in DPDK (generic) > +#. userland implementation using combination of TC rules and BPF filters/actions > +#. kernel-side implementation in tap driver > +#. kernel-side implementation as a BPF classifier/action > +#. kernel-side implementation as a separate TC action > + > ++--------------+------------------------------+------------------------------+ > +| | Pros | Cons | > ++==============+==============================+==============================+ > +| tap PMD | - no kernel upstreaming | - tap PMD is supposed to be | > +| | | simple, and would no longer| > +| | | be. | > +| | | | > +| | | - complex rework, with many | > +| | | rings for enqueuing packets| > +| | | to the right queue | > +| | | | > +| | | - slower | > +| | | | > +| | | - won't be accepted as it | > +| | | doesn't make sense to redo | > +| | | what the kernel did | > +| | | previously | > ++--------------+------------------------------+------------------------------+ > +| generic DPDK | - would be useful to others | - design must be compatible | > +| | | with most PMDs | > +| | | | > +| | | - probably the longest to | > +| | | develop | > +| | | | > +| | | - requires DPDK community | > +| | | approval | > +| | | | > +| | | - requires heavy changes in | > +| | | tap PMD itself anyway | > ++--------------+------------------------------+------------------------------+ > +| TC rules | - no kernel upstreaming | - BPF is complicated to learn| > +| combination | | | > +| | - fast | - Runtime BPF compilation / | > +| | | or bytecode change, would | > +| | - per-flow RSS | be tricky | > +| | | | > +| | - no change in tap PMD | - much rework in the tap PMD | > +| | datapath | to handle lots of new | > +| | | netlink messages / actions | > ++--------------+------------------------------+------------------------------+ > +| tap driver | - pretty fast as it | - might not be accepted by | > +| | intervenes early in packet | the kernel community as | > +| | RX | they may cling to their | > +| | | jhash2 hashing function for| > +| | | RX. | > +| | | | > +| | | - only a single RSS context | > ++--------------+------------------------------+------------------------------+ > +| BPF | - fast | - BPF is complicated to learn| > +| classifier - | | | > +| action | - per-flow RSS | - would require changing the | > +| | | kernel API to support | > +| | | editing queue_mapping in an| > +| | | skb | > +| | | | > +| | | - hashing would be performed | > +| | | for each queue of a | > +| | | specific RSS context | > +| | | | > +| | | - probably difficult to gain | > +| | | community acceptance | > ++--------------+------------------------------+------------------------------+ > +| TC action | - much more flexibility, with| - needs to be in sync with | > +| | per-flow RSS, multiple | iproute2's tc program | > +| | keys, multiple packet | | > +| | fields for the hash... | - kernel upstreaming is not | > +| | | necessarily easy | > +| | - it's a separate kernel | | > +| | module that can be | - rework in tap PMD to | > +| | maintained out-of-tree and | support new RSS action and | > +| | optionally upstreamed | configuration | > +| | anytime | | > +| | | | > +| | - most logical to be handled | | > +| | in kernel as RSS is | | > +| | supposed to be computed in | | > +| | the "NIC" exactly once. | | > +| | | | > +| | - fastest | | > +| | | | > +| | - no change in tap PMD | | > +| | datapath | | > ++--------------+------------------------------+------------------------------+ > + > +TC rules using BPF from tap PMD > +------------------------------- > + > +The third solution is the best for userland-based solutions. > +It does the job well, fast (datapath running in kernel), is logically happening > +in the kernel in runtime, supports flow-based RSS, has the best potential to > +be accepted by the community. > + > +Advantages with this solution: > +- hash can be recorded in the packet data and read in tap PMD > +- no kernel customization, everything in DPDK > +- packet gets in tap PMD on the correct queue directly > + > +Drawbacks: > +- complicates tap PMD a lot: > + - 3 BPF programs > + - new implicit rules > + - new action and filter support > + - packet stripping > +- numerous TC rules required (in proportion with the number of queues) > +- fast (kernel + JIT BPF), but several TC rules must be crossed > + > +BPF programs controlled from tap PMD will be used to match packets, compute a > +hash given the configured key, and send packets to tap using the desired queue. > + > +Design > +~~~~~~ > + > +BPF has a limited set of functions for editing the skb in TC. They are listed > +in ``linux/net/core/filter.c:tc_cls_act_func_proto()``: > + > +- skb_store_bytes > +- skb_load_bytes > +- skb_pull_data > +- csum_diff > +- csum_update > +- l3_csum_replace > +- l4_csum_replace > +- clone_redirect > +- get_cgroup_classid > +- skb_vlan_push > +- skb_vlan_pop > +- skb_change_proto > +- skb_change_type > +- skb_change_tail > +- skb_get_tunnel_key > +- skb_set_tunnel_key > +- skb_get_tunnel_opt > +- skb_set_tunnel_opt > +- redirect > +- get_route_realm > +- get_hash_recalc > +- set_hash_invalid > +- perf_event_output > +- get_smp_processor_id > +- skb_under_cgroup > + > +In a BPF program, it is typically not possible to edit the queue_mapping field > +to direct the packet in the correct queue. That part would be done by chaining a > +``skbedit queue_mapping`` action. > + > +It is not possible either to directly prepend data to a packet (appending works, > +though). > + > +A packet would go through these rules (on the local side of the tap netdevice): > + > ++-----+---------------------------+----------------------------------+----------+ > +|PRIO | Match | Action 1 | Action 2 | > ++=====+===========================+==================================+==========+ > +| 1 | marked? | skbedit queue 'mark' --> DPDK | | > ++-----+---------------------------+----------------------------------+----------+ > +| 2 | marked? | skbedit queue 'mark' --> DPDK | | > ++-----+---------------------------+----------------------------------+----------+ > +| ... | | | | > ++-----+---------------------------+----------------------------------+----------+ > +| x | ANY | BPF: append NULL 32bits for hash | | > +| | | | | > ++-----+---------------------------+----------------------------------+----------+ > +|x + 1| ACTUAL FLOW RULE 1 MATCH | ... | | > +| | | | | > ++-----+---------------------------+----------------------------------+----------+ > +|x + 2| ACTUAL FLOW RULE 2 MATCH | ... | | > +| | | | | > ++-----+---------------------------+----------------------------------+----------+ > +| ... | | | | > ++-----+---------------------------+----------------------------------+----------+ > +| y | FLOW RULE RSS 1 MATCH | BPF compute hash into packet |reclassify| > +| | | tailroom && set queue in skb->cb | | > ++-----+---------------------------+----------------------------------+----------+ > +|y + 1| FLOW RULE RSS 2 MATCH | BPF compute hash into packet |reclassify| > +| | | tailroom && set queue in skb->cb | | > ++-----+---------------------------+----------------------------------+----------+ > +| ... | | | | > ++-----+---------------------------+----------------------------------+----------+ > +| z | ANY (default RSS) | BPF compute hash into packet |reclassify| > +| | | tailroom && set queue in skb->cb | | > ++-----+---------------------------+----------------------------------+----------+ > +| z | ANY (isolate mode) | DROP | | > ++-----+---------------------------+----------------------------------+----------+ > + > + > + > +TC kernel action > +---------------- > + > +The latest solution (implementing a TC action) would probably be the most simple > +to implement. It is also very flexible, opening more possibilities for filtering > +and RSS combined. > + > +For this solution, the following parameters could be used to configure RSS in a > +TC netlink message: > + > +``queues`` (u16 \*): > + list of queues to spread incoming traffic on. That's actually the reta. > + **Note:** the queue in an ``skb`` is on 16-bits, hence the type here. > + > +``key`` (u8 \*): > + key to use for the Toeplitz-hash in this flow. > + > +``hash_fields`` (bitfield): > + similar to what's in DPDK, the bitfield should determine what fields in the > + packet header to use for hashing. It is likely another means of configuring > + which fields to pick would be used actually. > + > +``algo`` (unsigned): > + an enum value from the kernel act_rss header can be used to determine which > + algorithm (implemented in the kernel) to use. Possible algos could be > + toeplitz, xor, symmetric hash... > + > +**Note:** The number of queues to use is automatically deduced from the > +``queues`` netlink attribute length. The ``key`` length can be similarly > +obtained. > + > +.. raw:: pdf > + > + PageBreak > + > +Appendix: TC netlink message > +============================ > + > +**Note:** For deterministic behavior, TC queueing disciplines (QDISC), filters > +and classes must be flushed before starting to apply TC rules. There is a little > +bit of boilerplate (with specific netlink messages) to ensure TC rules can be > +applied. Typically, the TC ``ingress`` QDISC must be created first. > + > +For information, netlink messages regarding TC will look like this:: > + > + 0 8 16 24 32 > + +----------+----------+----------+----------+ --- > + 0 | Length | \ > + +---------------------+---------------------+ \ > + 4 | Type | Flags | | > + +----------- ---------+---------------------+ >-- struct > + 8 | Sequence number | | nlmsghdr > + +-------------------------------------------+ / > + 12 | Process Port ID (PID) | / > + +==========+==========+==========+==========+ --- > + 16 | Family | Rsvd1 | Reserved2 | \ > + +----------+----------+---------------------+ \ > + 20 | Interface index | | > + +-------------------------------------------+ | > + 24 | Handle | | > + +-------------------------------------------+ >-- struct > + 28 | Parent handle | | tcmsg > + | MAJOR + MINOR | | > + +-------------------------------------------+ | > + 32 | TCM info | / > + | priority + protocol | / > + +===========================================+ --- > + | | > + | Payload | > + | | > + ........................................ > + | | > + | | > + +-------------------------------------------+ > diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile > index 405b49e..9afae5e 100644 > --- a/drivers/net/tap/Makefile > +++ b/drivers/net/tap/Makefile > @@ -39,6 +39,9 @@ EXPORT_MAP := rte_pmd_tap_version.map > > LIBABIVER := 1 > > +# TAP_MAX_QUEUES must be a power of 2 as it will be used for masking */ > +TAP_MAX_QUEUES = 16 > + > CFLAGS += -O3 > CFLAGS += -I$(SRCDIR) > CFLAGS += -I. > @@ -47,6 +50,8 @@ LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring > LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash > LDLIBS += -lrte_bus_vdev > > +CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES) > + > # > # all source are stored in SRCS-y > # > @@ -89,7 +94,6 @@ tap_autoconf.h: tap_autoconf.h.new > mv '$<' '$@' > > $(SRCS-$(CONFIG_RTE_LIBRTE_PMD_TAP):.c=.o): tap_autoconf.h > - > clean_tap: FORCE > $Q rm -f -- tap_autoconf.h tap_autoconf.h.new > > diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h > index 829f32f..01ac153 100644 > --- a/drivers/net/tap/rte_eth_tap.h > +++ b/drivers/net/tap/rte_eth_tap.h > @@ -45,7 +45,7 @@ > #include > > #ifdef IFF_MULTI_QUEUE > -#define RTE_PMD_TAP_MAX_QUEUES 16 > +#define RTE_PMD_TAP_MAX_QUEUES TAP_MAX_QUEUES > #else > #define RTE_PMD_TAP_MAX_QUEUES 1 > #endif > @@ -90,6 +90,11 @@ struct pmd_internals { > int ioctl_sock; /* socket for ioctl calls */ > int nlsk_fd; /* Netlink socket fd */ > int flow_isolate; /* 1 if flow isolation is enabled */ > + int flower_support; /* 1 if kernel supports, else 0 */ > + int flower_vlan_support; /* 1 if kernel supports, else 0 */ > + int rss_enabled; /* 1 if RSS is enabled, else 0 */ > + /* implicit rules set when RSS is enabled */ > + LIST_HEAD(tap_rss_flows, rte_flow) rss_flows; > LIST_HEAD(tap_flows, rte_flow) flows; /* rte_flow rules */ > /* implicit rte_flow rules set when a remote device is active */ > LIST_HEAD(tap_implicit_flows, rte_flow) implicit_flows; > diff --git a/drivers/net/tap/tap_bpf_elf.h b/drivers/net/tap/tap_bpf_elf.h > new file mode 100644 > index 0000000..f3db1bf > --- /dev/null > +++ b/drivers/net/tap/tap_bpf_elf.h > @@ -0,0 +1,56 @@ > +/******************************************************************************* > + > + Copyright (C) 2015 Daniel Borkmann > + > + Copied from iproute2's include/bpf_elf.h, available at: > + https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git > + > + This file is licensed under GNU General Public License (GPL) v2. > + > + The full GNU General Public License is included in this distribution in > + the file called "LICENSE.GPL". > + > +*******************************************************************************/ > + > + > +#ifndef __BPF_ELF__ > +#define __BPF_ELF__ > + > +#include > + > +/* Note: > + * > + * Below ELF section names and bpf_elf_map structure definition > + * are not (!) kernel ABI. It's rather a "contract" between the > + * application and the BPF loader in tc. For compatibility, the > + * section names should stay as-is. Introduction of aliases, if > + * needed, are a possibility, though. > + */ > + > +/* ELF section names, etc */ > +#define ELF_SECTION_LICENSE "license" > +#define ELF_SECTION_MAPS "maps" > +#define ELF_SECTION_PROG "prog" > +#define ELF_SECTION_CLASSIFIER "classifier" > +#define ELF_SECTION_ACTION "action" > + > +#define ELF_MAX_MAPS 64 > +#define ELF_MAX_LICENSE_LEN 128 > + > +/* Object pinning settings */ > +#define PIN_NONE 0 > +#define PIN_OBJECT_NS 1 > +#define PIN_GLOBAL_NS 2 > + > +/* ELF map definition */ > +struct bpf_elf_map { > + __u32 type; > + __u32 size_key; > + __u32 size_value; > + __u32 max_elem; > + __u32 flags; > + __u32 id; > + __u32 pinning; > +}; > + > +#endif /* __BPF_ELF__ */ > diff --git a/drivers/net/tap/tap_flow.c b/drivers/net/tap/tap_flow.c > index ffc0b85..43bab7d 100644 > --- a/drivers/net/tap/tap_flow.c > +++ b/drivers/net/tap/tap_flow.c > @@ -43,6 +43,9 @@ > #include > #include > > +#include > +#include > + > #ifndef HAVE_TC_FLOWER > /* > * For kernels < 4.2, this enum is not defined. Runtime checks will be made to > @@ -104,6 +107,23 @@ struct remote_rule { > int mirred; > }; > > +struct action_data { > + char id[16]; > + > + union { > + struct tc_gact gact; > + struct tc_mirred mirred; > + struct skbedit { > + struct tc_skbedit skbedit; > + uint16_t queue; > + } skbedit; > + struct bpf { > + int bpf_fd; > + char *annotation; > + } bpf; > + }; > +}; > + > static int tap_flow_create_eth(const struct rte_flow_item *item, void *data); > static int tap_flow_create_vlan(const struct rte_flow_item *item, void *data); > static int tap_flow_create_ipv4(const struct rte_flow_item *item, void *data); > @@ -134,6 +154,8 @@ struct remote_rule { > int set, > struct rte_flow_error *error); > > +static int rss_enable(struct pmd_internals *pmd); > + > static const struct rte_flow_ops tap_flow_ops = { > .validate = tap_flow_validate, > .create = tap_flow_create, > @@ -816,111 +838,64 @@ struct tap_flow_items { > } > > /** > - * Transform a DROP/PASSTHRU action item in the provided flow for TC. > - * > - * @param[in, out] flow > - * Flow to be filled. > - * @param[in] action > - * Appropriate action to be set in the TCA_GACT_PARMS structure. > - * > - * @return > - * 0 if checks are alright, -1 otherwise. > + * FIXME > */ > static int > -add_action_gact(struct rte_flow *flow, int action) > +add_action(struct rte_flow *flow, size_t *act_index, struct action_data *adata) > { > struct nlmsg *msg = &flow->msg; > - size_t act_index = 1; > - struct tc_gact p = { > - .action = action > - }; > > - if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0) > - return -1; > - if (nlattr_nested_start(msg, act_index++) < 0) > + if (nlattr_nested_start(msg, ++(*act_index)) < 0) > return -1; > - nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("gact"), "gact"); > - if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0) > - return -1; > - nlattr_add(&msg->nh, TCA_GACT_PARMS, sizeof(p), &p); > - nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */ > - nlattr_nested_finish(msg); /* nested act_index */ > - nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */ > - return 0; > -} > - > -/** > - * Transform a MIRRED action item in the provided flow for TC. > - * > - * @param[in, out] flow > - * Flow to be filled. > - * @param[in] ifindex > - * Netdevice ifindex, where to mirror/redirect packet to. > - * @param[in] action_type > - * Either TCA_EGRESS_REDIR for redirection or TCA_EGRESS_MIRROR for mirroring. > - * > - * @return > - * 0 if checks are alright, -1 otherwise. > - */ > -static int > -add_action_mirred(struct rte_flow *flow, uint16_t ifindex, uint16_t action_type) > -{ > - struct nlmsg *msg = &flow->msg; > - size_t act_index = 1; > - struct tc_mirred p = { > - .eaction = action_type, > - .ifindex = ifindex, > - }; > > - if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0) > - return -1; > - if (nlattr_nested_start(msg, act_index++) < 0) > - return -1; > - nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("mirred"), "mirred"); > + nlattr_add(&msg->nh, TCA_ACT_KIND, strlen(adata->id), adata->id); > if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0) > return -1; > - if (action_type == TCA_EGRESS_MIRROR) > - p.action = TC_ACT_PIPE; > - else /* REDIRECT */ > - p.action = TC_ACT_STOLEN; > - nlattr_add(&msg->nh, TCA_MIRRED_PARMS, sizeof(p), &p); > + if (strcmp("gact", adata->id) == 0) { > + nlattr_add(&msg->nh, TCA_GACT_PARMS, sizeof(adata->gact), > + &adata->gact); > + } else if (strcmp("mirred", adata->id) == 0) { > + if (adata->mirred.eaction == TCA_EGRESS_MIRROR) > + adata->mirred.action = TC_ACT_PIPE; > + else /* REDIRECT */ > + adata->mirred.action = TC_ACT_STOLEN; > + nlattr_add(&msg->nh, TCA_MIRRED_PARMS, sizeof(adata->mirred), > + &adata->mirred); > + } else if (strcmp("skbedit", adata->id) == 0) { > + nlattr_add(&msg->nh, TCA_SKBEDIT_PARMS, > + sizeof(adata->skbedit.skbedit), > + &adata->skbedit.skbedit); > + nlattr_add16(&msg->nh, TCA_SKBEDIT_QUEUE_MAPPING, > + adata->skbedit.queue); > + } else if (strcmp("bpf", adata->id) == 0) { > + nlattr_add32(&msg->nh, TCA_ACT_BPF_FD, adata->bpf.bpf_fd); > + nlattr_add(&msg->nh, TCA_ACT_BPF_NAME, > + strlen(adata->bpf.annotation), > + adata->bpf.annotation); > + } else { > + return -1; > + } > nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */ > nlattr_nested_finish(msg); /* nested act_index */ > - nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */ > return 0; > } > > /** > - * Transform a QUEUE action item in the provided flow for TC. > - * > - * @param[in, out] flow > - * Flow to be filled. > - * @param[in] queue > - * Queue id to use. > - * > - * @return > - * 0 if checks are alright, -1 otherwise. > + * FIXME > */ > static int > -add_action_skbedit(struct rte_flow *flow, uint16_t queue) > +add_actions(struct rte_flow *flow, int nb_actions, struct action_data *data, > + int classifier_action) > { > struct nlmsg *msg = &flow->msg; > - size_t act_index = 1; > - struct tc_skbedit p = { > - .action = TC_ACT_PIPE > - }; > + size_t act_index = 0; > + int i; > > - if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0) > - return -1; > - if (nlattr_nested_start(msg, act_index++) < 0) > + if (nlattr_nested_start(msg, classifier_action) < 0) > return -1; > - nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("skbedit"), "skbedit"); > - if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0) > - return -1; > - nlattr_add(&msg->nh, TCA_SKBEDIT_PARMS, sizeof(p), &p); > - nlattr_add16(&msg->nh, TCA_SKBEDIT_QUEUE_MAPPING, queue); > - nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */ > - nlattr_nested_finish(msg); /* nested act_index */ > + for (i = 0; i < nb_actions; i++) > + if (add_action(flow, &act_index, data + i) < 0) > + return -1; > nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */ > return 0; > } > @@ -1053,7 +1028,12 @@ struct tap_flow_items { > } > } > if (mirred && flow) { > - uint16_t if_index = pmd->if_index; > + struct action_data adata = { > + .id = "mirred", > + .mirred = { > + .eaction = mirred, > + }, > + }; > > /* > * If attr->egress && mirred, then this is a special > @@ -1061,9 +1041,13 @@ struct tap_flow_items { > * redirect packets coming from the DPDK App, out > * through the remote netdevice. > */ > - if (attr->egress) > - if_index = pmd->remote_if_index; > - if (add_action_mirred(flow, if_index, mirred) < 0) > + adata.mirred.ifindex = attr->ingress ? pmd->if_index : > + pmd->remote_if_index; > + if (mirred == TCA_EGRESS_MIRROR) > + adata.mirred.action = TC_ACT_PIPE; > + else > + adata.mirred.action = TC_ACT_STOLEN; > + if (add_actions(flow, 1, &adata, TCA_FLOWER_ACT) < 0) > goto exit_action_not_supported; > else > goto end; > @@ -1077,14 +1061,33 @@ struct tap_flow_items { > if (action) > goto exit_action_not_supported; > action = 1; > - if (flow) > - err = add_action_gact(flow, TC_ACT_SHOT); > + if (flow) { > + struct action_data adata = { > + .id = "gact", > + .gact = { > + .action = TC_ACT_SHOT, > + }, > + }; > + > + err = add_actions(flow, 1, &adata, > + TCA_FLOWER_ACT); > + } > } else if (actions->type == RTE_FLOW_ACTION_TYPE_PASSTHRU) { > if (action) > goto exit_action_not_supported; > action = 1; > - if (flow) > - err = add_action_gact(flow, TC_ACT_UNSPEC); > + if (flow) { > + struct action_data adata = { > + .id = "gact", > + .gact = { > + /* continue */ > + .action = TC_ACT_UNSPEC, > + }, > + }; > + > + err = add_actions(flow, 1, &adata, > + TCA_FLOWER_ACT); > + } > } else if (actions->type == RTE_FLOW_ACTION_TYPE_QUEUE) { > const struct rte_flow_action_queue *queue = > (const struct rte_flow_action_queue *) > @@ -1096,22 +1099,30 @@ struct tap_flow_items { > if (!queue || > (queue->index > pmd->dev->data->nb_rx_queues - 1)) > goto exit_action_not_supported; > - if (flow) > - err = add_action_skbedit(flow, queue->index); > + if (flow) { > + struct action_data adata = { > + .id = "skbedit", > + .skbedit = { > + .skbedit = { > + .action = TC_ACT_PIPE, > + }, > + .queue = queue->index, > + }, > + }; > + > + err = add_actions(flow, 1, &adata, > + TCA_FLOWER_ACT); > + } > } else if (actions->type == RTE_FLOW_ACTION_TYPE_RSS) { > - /* Fake RSS support. */ > const struct rte_flow_action_rss *rss = > (const struct rte_flow_action_rss *) > actions->conf; > > - if (action) > - goto exit_action_not_supported; > - action = 1; > - if (!rss || rss->num < 1 || > - (rss->queue[0] > pmd->dev->data->nb_rx_queues - 1)) > + if (action++) > goto exit_action_not_supported; > - if (flow) > - err = add_action_skbedit(flow, rss->queue[0]); > + if (!pmd->rss_enabled) > + err = rss_enable(pmd); > + (void)rss; > } else { > goto exit_action_not_supported; > } > @@ -1632,6 +1643,127 @@ int tap_flow_implicit_destroy(struct pmd_internals *pmd, > return 0; > } > > +#define BPF_PROGRAM "tap_bpf_program.o" > + > +/** > + * Enable RSS on tap: create leading TC rules for queuing. > + */ > +static int rss_enable(struct pmd_internals *pmd) > +{ > + struct rte_flow *rss_flow = NULL; > + char section[64]; > + struct nlmsg *msg = NULL; > + /* 4096 is the maximum number of instructions for a BPF program */ > + char annotation[256]; > + int bpf_fd; > + int i; > + > + /* > + * Add a rule per queue to match reclassified packets and direct them to > + * the correct queue. > + */ > + for (i = 0; i < pmd->dev->data->nb_rx_queues; i++) { > + struct action_data adata = { > + .id = "skbedit", > + .skbedit = { > + .skbedit = { > + .action = TC_ACT_PIPE, > + }, > + .queue = i, > + }, > + }; > + > + bpf_fd = 0; > + > + rss_flow = rte_malloc(__func__, sizeof(struct rte_flow), 0); > + if (!rss_flow) { > + RTE_LOG(ERR, PMD, > + "Cannot allocate memory for rte_flow"); > + return -1; > + } > + msg = &rss_flow->msg; > + tc_init_msg(msg, pmd->if_index, RTM_NEWTFILTER, NLM_F_REQUEST | > + NLM_F_ACK | NLM_F_EXCL | NLM_F_CREATE); > + msg->t.tcm_info = TC_H_MAKE((i + PRIORITY_OFFSET) << 16, > + htons(ETH_P_ALL)); > + msg->t.tcm_parent = TC_H_MAKE(MULTIQ_MAJOR_HANDLE, 0); > + tap_flow_set_handle(rss_flow); > + nlattr_add(&msg->nh, TCA_KIND, sizeof("bpf"), "bpf"); > + if (nlattr_nested_start(msg, TCA_OPTIONS) < 0) > + return -1; > + nlattr_add32(&msg->nh, TCA_BPF_FD, bpf_fd); > + snprintf(annotation, sizeof(annotation), "%s:[%s]", > + BPF_PROGRAM, section); > + nlattr_add(&msg->nh, TCA_BPF_NAME, strlen(annotation), > + annotation); > + > + if (add_actions(rss_flow, 1, &adata, TCA_BPF_ACT) < 0) > + return -1; > + nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */ > + /* Netlink message is now ready to be sent */ > + if (nl_send(pmd->nlsk_fd, &msg->nh) < 0) > + return -1; > + if (nl_recv_ack(pmd->nlsk_fd) < 0) > + return -1; > + LIST_INSERT_HEAD(&pmd->rss_flows, rss_flow, next); > + } > + > + snprintf(annotation, sizeof(annotation), "%s:[%s]", BPF_PROGRAM, > + section); > + rss_flow = rte_malloc(__func__, sizeof(struct rte_flow), 0); > + if (!rss_flow) { > + RTE_LOG(ERR, PMD, > + "Cannot allocate memory for rte_flow"); > + return -1; > + } > + msg = &rss_flow->msg; > + tc_init_msg(msg, pmd->if_index, RTM_NEWTFILTER, > + NLM_F_REQUEST | NLM_F_ACK | NLM_F_EXCL | NLM_F_CREATE); > + msg->t.tcm_info = > + TC_H_MAKE((RTE_PMD_TAP_MAX_QUEUES + PRIORITY_OFFSET) << 16, > + htons(ETH_P_ALL)); > + msg->t.tcm_parent = TC_H_MAKE(MULTIQ_MAJOR_HANDLE, 0); > + tap_flow_set_handle(rss_flow); > + nlattr_add(&msg->nh, TCA_KIND, sizeof("flower"), "flower"); > + if (nlattr_nested_start(msg, TCA_OPTIONS) < 0) > + return -1; > + > + /* no fields for matching: all packets must match */ > + { > + /* Actions */ > + struct action_data data[2] = { > + [0] = { > + .id = "bpf", > + .bpf = { > + .bpf_fd = bpf_fd, > + .annotation = annotation, > + }, > + }, > + [1] = { > + .id = "gact", > + .gact = { > + /* continue */ > + .action = TC_ACT_UNSPEC, > + }, > + }, > + }; > + > + if (add_actions(rss_flow, 2, data, TCA_FLOWER_ACT) < 0) > + return -1; > + } > + nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */ > + nlattr_nested_finish(msg); /* nested TCA_OPTIONS */ > + /* Netlink message is now ready to be sent */ > + if (nl_send(pmd->nlsk_fd, &msg->nh) < 0) > + return -1; > + if (nl_recv_ack(pmd->nlsk_fd) < 0) > + return -1; > + LIST_INSERT_HEAD(&pmd->rss_flows, rss_flow, next); > + > + pmd->rss_enabled = 1; > + return 0; > +} > + > /** > * Manage filter operations. > *