DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ivan Malov <Ivan.Malov@oktetlabs.ru>
To: Ilya Maximets <i.maximets@ovn.org>,
	Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>,
	dev@dpdk.org
Cc: Eli Britstein <elibr@nvidia.com>,
	Smadar Fuks <smadarf@marvell.com>,
	Hyong Youb Kim <hyonkim@cisco.com>, Ori Kam <orika@nvidia.com>,
	Jerin Jacob <jerinj@marvell.com>, John Daley <johndale@cisco.com>,
	Thomas Monjalon <thomas@monjalon.net>,
	Ferruh Yigit <ferruh.yigit@intel.com>
Subject: Re: [dpdk-dev] [RFC PATCH] ethdev: clarify flow action PORT ID semantics
Date: Wed, 2 Jun 2021 22:35:09 +0300	[thread overview]
Message-ID: <c0acb703-1f66-ba01-4653-d00a76bf7f4e@oktetlabs.ru> (raw)
In-Reply-To: <87d9d57c-b41c-b594-b80e-aa8af3deeefe@ovn.org>

On 02/06/2021 20:35, Ilya Maximets wrote:
> (Dropped Broadcom folks from CC.  Mail server refuses to accept their
> emails for some reason: "Recipient address rejected: Domain not found."
> Please, try to ad them back on reply.)
> 
> On 6/2/21 6:26 PM, Andrew Rybchenko wrote:
>> On 6/2/21 3:46 PM, Ilya Maximets wrote:
>>> On 6/1/21 4:28 PM, Ivan Malov wrote:
>>>> Hi Ilya,
>>>>
>>>> Thank you for reviewing the proposal at such short notice. I'm afraid that prior discussions overlook the simple fact that the whole problem is not limited to just VF representors. Action PORT_ID is also used with respect to the admin PF's ethdev, which "represents itself" (and by no means it represents the underlying physical/network port). In this case, one cannot state that the application treats it as a physical port, just like one states that the application perceives representors as VFs themselves.
>>>
>>>
>>> I don't think that it was overlooked.  If device is in a switchdev mode than
>>> there is a PF representor and VF representors.  Application typically works
>>> only with representors in this case is it doesn't make much sense to have
>>> representor and the upstream port attached to the same application at the
>>> same time.  Configuration that is applied by application to the representor
>>> (PF or VF, it doesn't matter) applies to the corresponding upstream port
>>> (actual PF or VF) by default.
>>
>> PF is not necessarily associated with a network port. It
>> could  be many PFs and just one network port on NIC.
>> Extra PFs are like VFs in this case. These PFs may be
>> passed to a VM in a similar way. So, we can have PF
>> representors similar to VF representors. I.e. it is
>> incorrect to say that PF in the case of switchdev is
>> a representor of a network port.
>>
>> If we prefer to talk in representors terminology, we
>> need 4 types of prepresentors:
>>   - PF representor for PCIe physical function
>>   - VF representor for PCIe virtual function
>>   - SF representor for PCIe sub-function (PASID)
>>   - network port representor
>> In fact above is PCIe oriented, but there are
>> other buses and ways to deliver traffic to applications.
>> Basically representor for any virtual port in virtual
>> switch which DPDK app can control using transfer rules.
>>
>>> Exactly same thing here with PORT_ID action.  You have a packet and action
>>> to send it to the port, but it's not specified if HW needs to send it to
>>> the representor or the upstream port (again, VF or PF, it doesn't matter).
>>> Since there is no extra information, HW should send it to the upstream
>>> port by default.  The same as configuration applies by default to the
>>> upstream port.
>>>
>>> Let's look at some workflow examples:
>>>
>>>        DPDK Application
>>>          |         |
>>>          |         |
>>>     +--PF-rep------VF-rep---+
>>>     |                       |
>>>     |    NIC (switchdev)    |
>>>     |                       |
>>>     +---PF---------VF-------+
>>>         |          |
>>>         |          |
>>>     External       VM or whatever
>>>     Network
>>
>> See above. PF <-> External Network is incorrect above
>> since it not always the case. It should be
>> "NP <-> External network" and "NP-rep" above (NP -
>> network port). Sometimes PF is an NP-rep, but sometimes
>> it is not. It is just a question of default rules in
>> switchdev on what to do with traffic incoming from
>> network port.
>>
>> A bit more complicated picture is:
>>
>>      +----------------------------------------+
>>      |            DPDK Application            |
>>      +----+---------+---------+---------+-----+
>>           |PF0      |PF1      |         |
>>           |         |         |         |
>>      +--NP1-rep---NP2-rep---PF2-rep---VF-rep--+
>>      |                                        |
>>      |             NIC (switchdev)            |
>>      |                                        |
>>      +---NP1-------NP2-------PF2--------VF----+
>>           |         |         |         |
>>           |         |         |         |
>>       External   External    VM or     VM or
>>      Network 1  Network 2  whatever   whatever
>>
>> So, sometimes PF plays network port representor role (PF0,
>> PF1), sometimes it requires representor itself (PF2).
>> What to do if PF2 itself is attached to application?
>> Can we route traffic to it using PORT_ID action?
>> It has DPDK ethdev port. It is one of arguments why
>> plain PORT_ID should route DPDK application.
> 
> OK.  This is not very different from my understanding.  The key
> is that there is a pair of interfaces, one is more visible than
> the other one.
> 
>>
>> Of course, some applications would like to see it as
>> (simpler is better):
>>
>>      +----------------------------------------+
>>      |            DPDK Application            |
>>      |                                        |
>>      +---PF0-------PF1------PF2-rep---VF-rep--+
>>           |         |         |         |
>>           |         |         |         |
>>       External   External    VM or     VM or
>>      Network 1  Network 2  whatever   whatever
>>
>> but some, I believe, require full picture. For examples,
>> I'd really like to know how much traffic goes via all 8
>> switchdev ports and running rte_eth_stats_get(0, ...)
>> (i.e. DPDK port 0 attached to PF0) I'd like to get
>> NP1-rep stats (not NP1 stats). It will match exactly
>> what I see in DPDK application. It is an argument why
>> plain PORT_ID should be treated as a DPDK ethdev port,
>> not a represented (upstream) entity.
> 
> The point is that if application doesn't require full picture,
> it should not care.  If application requires the full picture,
> it could take extra steps by setting extra bits.  I don't
> understand why we need to force all applications to care about
> the full picture if we can avoid that?
> 
>>
>>> a. Workflow for "DPDK Application" to set MAC to VF:
>>>
>>> 1. "DPDK Application" calls rte_set_etheraddr("VF-rep", new_mac);
>>> 2.  DPDK sets MAC for "VF".
>>>
>>> b. Workflow for "DPDK Application" to set MAC to PF:
>>>
>>> 1. "DPDK Application" calls rte_set_etheraddr("PF-rep", new_mac);
>>> 2.  DPDK sets MAC for "PF".
>>>
>>> c. Workflow for "DPDK Application" to send packet to the external network:
>>>
>>> 1. "DPDK Application" calls rte_eth_tx_burst("PF-rep", packet);
>>> 2. NIC receives the packet from "PF-rep" and sends it to "PF".
>>> 3. packet egresses to the external network from "PF".
>>>
>>> d. Workflow for "DPDK Application" to send packet to the "VM or whatever":
>>>
>>> 1. "DPDK Application" calls rte_eth_tx_burst("VF-rep", packet);
>>> 2. NIC receives the packet from "VF-rep" and sends it to "VF".
>>> 3. "VM or whatever" receives the packet from "VF".
>>>
>>> In two workflows above there is no rte_flow processing on step 2, i.e.,
>>> NIC does not perform any lookups/matches/actions, because it's not possible
>>> to configure actions for packets received from "PF-rep" or
>>> "VF-rep" as these ports doesn't own a port id and all the configuration
>>> and rte_flow actions translated and applied for the devices that these
>>> ports represents ("PF" and "VF") and not representors themselves ("PF-rep"
>>> or "VF-rep").
>>>
>>> e. Workflow for the packet received on PF and PORT_ID action:
>>>
>>> 1. "DPDK Application" configures rte_flow for all packets from "PF-rep"
>>>     to execute PORT_ID "VF-rep".
>>> 2. NIC receives packet on "PF".
>>> 3. NIC executes 'PORT_ID "VF-rep"' action by sending packet to "VF".
>>> 4. "VM or whatever" receives the packet from "VF".
>>>
>>> f. Workflow for the packet received on VF and PORT_ID action:
>>>
>>> 1. "DPDK Application" configures rte_flow for all packets from "VF-rep"
>>>     to execute 'PORT_ID "PF-rep"'.
>>> 2. NIC receives packet on "VF".
>>> 3. NIC executes 'PORT_ID "PF-rep"' action by sending packet to "PF".
>>> 4. Packet egresses from the "PF" to the external network.
>>>
>>> Above is what, IMHO, the logic should look like and this matches with
>>> the overall switchdev design in kernel.
>>>
>>> I understand that this logic could seem flipped-over from the HW point
>>> of view, but it's perfectly logical from the user's perspective, because
>>> user should not care if the application works with representors or
>>> some real devices.  If application configures that all packets from port
>>> A should be sent to port B, user will expect that these packets will
>>> egress from port B once received from port A.  That will be highly
>>> inconvenient if the packet will ingress from port B back to the
>>> application instead.
>>>
>>>        DPDK Application
>>>          |          |
>>>          |          |
>>>       port A     port B
>>>          |          |
>>>        *****MAGIC*****
>>>          |          |
>>>     External      Another Network
>>>     Network       or VM or whatever
>>>
>>> It should not matter if there is an extra layer between ports A and B
>>> and the external network and VM.  Everything should work in exactly the
>>> same way, transparently for the application.
>>>
>>> The point of hardware offloading, and therefore rte_flow API, is to take
>>> what user does in software and make this "magically" work in hardware in
>>> the exactly same way.  And this will be broken if user will have to
>>> use different logic based on the mode the hardware works in, i.e. based on
>>> the fact if the application works with ports or their representors.
>>>
>>> If some specific use case requires application to know if it's an
>>> upstream port or the representor and demystify the internals of the switchdev
>>> NIC, there should be a different port id for the representor itself that
>>> could be used in all DPDK APIs including rte_flow API or a special bit for
>>> that matter.  IIRC, there was an idea to add a bit directly to the port_id
>>> for that purpose that will flip over behavior in all the workflow scenarios
>>> that I described above.
>>
>> As I understand we're basically on the same page, but just
>> fighting for defaults in DPDK.
> 
> Yep.
> 
>>
>>>>
>>>> Given these facts, it would not be quite right to just align the documentation with the de-facto action meaning assumed by OvS.
>>>
>>> It's not a "meaning assumed by OvS", it's the original design and the
>>> main idea of a switchdev based on a common sense.
>>
>> If so, common sense is not that common :)
>> My "common sense" says me that PORT_ID action
>> should route traffic to DPDK ethdev port to be
>> received by the DPDK application.
> 
> By this logic rte_eth_tx_burst("VF-rep", packet) should send a packet
> to "VF-rep", i.e. this packet will be received back by the application
> on this same interface.  But that is counter-intuitive and this is not
> how it works in linux kernel if you're opening socket and sending a
> packet to the "VF-rep" network interface.
> 
> And if rte_eth_tx_burst("VF-rep", packet) sends packet to "VF" and not
> to "VF-rep", than I don't understand why PORT_ID action should work in
> the opposite way.

There's no contradiction here.

In rte_eth_tx_burst(X, packet) example, "X" is the port which the 
application sits on and from where it sends the packet. In other words, 
it's the point where the packet originates from, and not where it goes to.

At the same time, flow *action* PORT_ID (ID = "X") is clearly the 
opposite: it specifies where the packet will go. Port ID is the 
characteristic of a DPDK ethdev. So the packet goes *to* an ethdev with 
the given ID ("X").

Perhaps consider action PHY_PORT: the index is the characteristic of the 
network port. The packet goes *to* network through this NP. And not the 
opposite way. Hopefully, nobody is going to claim that action PHY_PORT 
should mean re-injecting the packet back to the HW flow engine "as if it 
just came from the network port". Then why does one try to skew the 
PORT_ID meaning this way? PORT_ID points to an ethdev - the packet goes 
*to* the ethdev. Isn't that simple?

> 
> Application receives a packet from port A and puts it to the port B.
> TC rule to forward packets from port A to port B will provide same result.
> So, why the similar rte_flow should do the opposite and send the packet
> back to the application?

Please see above. Action VF sends the packet *to* VF and *not* to the 
upstream entity which this VF is connected to. Action PHY_PORT sends the 
packet *to* network and does *not* make it appear as if it entered the 
NIC from the network side. Action QUEUE sends the packet *to* the Rx 
queue and does *not* make it appear as if it just egressed from the Tx 
queue with the same index. Action PORT_ID sends the packet *to* an 
ethdev with the given ID and *not* to the upstream entity which this 
ethdev is connected to. It's just that transparent. It's just "do what 
the name suggests".

Yes, an application (say, OvS) might have a high level design which 
perceives the "high-level" ports plugged to it as a "patch-panel" of 
sorts. Yes, when a high-level mechanism/logic of such application 
invokes a *datapath-unaware* wrapper to offload a rule and request that 
the packet be delivered to the given "high-level" port, it therefore 
requests that the packet be delivered to the opposite end of the wire. 
But then the lower-level datapath-specific (DPDK) handler kicks in. 
Since it's DPDK-specific, it knows *everything* about the underlying 
flow library it works with. In particular it knows that action PORT_ID 
delivers the packet to an *ethdev*, at the same time, it knows that the 
upper caller (high-level logic) for sure wants the opposite, so it (the 
lower-level DPDK component) sets the "upstream" bit when translating the 
higher-level port action to an RTE action "PORT_ID". Then the resulting 
action is correct, and the packet indeed doesn't end up in the ethdev 
but goes to the opposite end of the wire. That's it.

I have an impression that for some reason people are tempted to ignore 
the two nominal "layers" in such applications (generic, or high-level 
one and DPDK-specific one) thus trying to align DPDK logic with 
high-level logic of the applications. That's simply not right. What I'm 
trying to point out is that it *is* the true job of DPDK-specific data 
path handler in such application - to properly translate generic flow 
actions to DPDK-specific ones. It's the duty of DPDK component in such 
applications to be aware of the genuine meaning of action PORT_ID.

This way, mixing up the two meanings is ruled out.

> 
>>
>>>>
>>>> On 01/06/2021 15:10, Ilya Maximets wrote:
>>>>> On 6/1/21 1:14 PM, Ivan Malov wrote:
>>>>>> By its very name, action PORT_ID means that packets hit an ethdev with the
>>>>>> given DPDK port ID. At least the current comments don't state the opposite.
>>>>>> That said, since port representors had been adopted, applications like OvS
>>>>>> have been misusing the action. They misread its purpose as sending packets
>>>>>> to the opposite end of the "wire" plugged to the given ethdev, for example,
>>>>>> redirecting packets to the VF itself rather than to its representor ethdev.
>>>>>> Another example: OvS relies on this action with the admin PF's ethdev port
>>>>>> ID specified in it in order to send offloaded packets to the physical port.
>>>>>>
>>>>>> Since there might be applications which use this action in its valid sense,
>>>>>> one can't just change the documentation to greenlight the opposite meaning.
>>>>>> This patch adds an explicit bit to the action configuration which will let
>>>>>> applications, depending on their needs, leverage the two meanings properly.
>>>>>> Applications like OvS, as well as PMDs, will have to be corrected when the
>>>>>> patch has been applied. But the improved clarity of the action is worth it.
>>>>>>
>>>>>> The proposed change is not the only option. One could avoid changes in OvS
>>>>>> and PMDs if the new configuration field had the opposite meaning, with the
>>>>>> action itself meaning delivery to the represented port and not to DPDK one.
>>>>>> Alternatively, one could define a brand new action with the said behaviour.
>>>>>
>>>>> We had already very similar discussions regarding the understanding of what
>>>>> the representor really is from the DPDK API's point of view, and the last
>>>>> time, IIUC, it was concluded by a tech. board that representor should be
>>>>> a "ghost of a VF", i.e. DPDK APIs should apply configuration by default to
>>>>> VF and not to the representor device:
>>>>>     https://patches.dpdk.org/project/dpdk/cover/20191029185051.32203-1-thomas@monjalon.net/#104376
>>>>> This wasn't enforced though, IIUC, for existing code and semantics is still mixed.
>>>>>
>>>>> I still think that configuration should be applied to VF, and the same applies
>>>>> to rte_flow API.  IMHO, average application should not care if device is
>>>>> a VF itself or its representor.  Everything should work exactly the same.
>>>>> I think this matches with the original idea/design of the switchdev functionality
>>>>> in the linux kernel and also matches with how the average user thinks about
>>>>> representor devices.
>>>>>
>>>>> If some specific use-case requires to distinguish VF from the representor,
>>>>> there should probably be a separate special API/flag for that.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>
>>

-- 
Ivan M

  reply	other threads:[~2021-06-02 19:35 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-01 11:14 Ivan Malov
2021-06-01 12:10 ` Ilya Maximets
2021-06-01 13:24   ` Eli Britstein
2021-06-01 14:35     ` Andrew Rybchenko
2021-06-01 14:44       ` Eli Britstein
2021-06-01 14:50         ` Ivan Malov
2021-06-01 14:53         ` Andrew Rybchenko
2021-06-02  9:57           ` Eli Britstein
2021-06-02 10:50             ` Andrew Rybchenko
2021-06-02 11:21               ` Eli Britstein
2021-06-02 11:57                 ` Andrew Rybchenko
2021-06-02 12:36                 ` Ivan Malov
2021-06-03  9:18                   ` Ori Kam
2021-06-03  9:55                     ` Andrew Rybchenko
2021-06-07  8:28                       ` Thomas Monjalon
2021-06-07  9:42                         ` Andrew Rybchenko
2021-06-07 12:08                           ` Ori Kam
2021-06-07 13:21                             ` Ilya Maximets
2021-06-07 16:07                               ` Thomas Monjalon
2021-06-08 16:13                                 ` Thomas Monjalon
2021-06-08 16:32                                   ` Andrew Rybchenko
2021-06-08 18:49                                     ` Thomas Monjalon
2021-06-09 14:31                                       ` Andrew Rybchenko
2021-06-01 14:49     ` Ivan Malov
2021-06-01 14:28   ` Ivan Malov
2021-06-02 12:46     ` Ilya Maximets
2021-06-02 16:26       ` Andrew Rybchenko
2021-06-02 17:35         ` Ilya Maximets
2021-06-02 19:35           ` Ivan Malov [this message]
2021-06-03  9:29             ` Ilya Maximets
2021-06-03 10:33               ` Andrew Rybchenko
2021-06-03 11:05                 ` Ilya Maximets
2021-06-03 11:29               ` Ivan Malov
2021-06-07 19:27                 ` Ilya Maximets
2021-06-07 20:39                   ` Ivan Malov
2021-06-25 13:04       ` Ferruh Yigit
2021-06-02 12:16   ` Thomas Monjalon
2021-06-02 12:53     ` Ilya Maximets
2021-06-02 13:10     ` Andrew Rybchenko
2021-09-03  7:46 ` [dpdk-dev] [PATCH v1] " Andrew Rybchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0acb703-1f66-ba01-4653-d00a76bf7f4e@oktetlabs.ru \
    --to=ivan.malov@oktetlabs.ru \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=dev@dpdk.org \
    --cc=elibr@nvidia.com \
    --cc=ferruh.yigit@intel.com \
    --cc=hyonkim@cisco.com \
    --cc=i.maximets@ovn.org \
    --cc=jerinj@marvell.com \
    --cc=johndale@cisco.com \
    --cc=orika@nvidia.com \
    --cc=smadarf@marvell.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).