From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 1491EA00C2;
	Mon, 26 Sep 2022 18:37:00 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id B190D4113C;
	Mon, 26 Sep 2022 18:36:59 +0200 (CEST)
Received: from mail-qk1-f175.google.com (mail-qk1-f175.google.com
 [209.85.222.175])
 by mails.dpdk.org (Postfix) with ESMTP id 5EE6C40146
 for <dev@dpdk.org>; Mon, 26 Sep 2022 18:36:58 +0200 (CEST)
Received: by mail-qk1-f175.google.com with SMTP id g2so4464018qkk.1
 for <dev@dpdk.org>; Mon, 26 Sep 2022 09:36:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=content-transfer-encoding:in-reply-to:from:references:cc:to
 :content-language:subject:user-agent:mime-version:date:message-id
 :from:to:cc:subject:date;
 bh=RQiCOktmIY6gTO3JgIoKKWdfIqs34SMSnZzpkp/X75M=;
 b=ByeYDJp9jRZ5Ms6rMYl67JOAK72meMcIB9STMa4LkMypr5dtwzTXq0wD5iurzRXzjz
 1STPCTWCEFHypazvniSwr/Pj3PmEh6tleCiFuEP7RIJOX9jiD+X5VsD1fv2qo2IZXIcD
 PlVPZuV0TMbdCHx4oZ104F2DvGShPlohIxMmd0tum1IKzrewEBgxTcltNt6osbAi2dHb
 U/0CVyCTE2nHDXnycUFPcE/9tR0GKdD0poFh3dO42xHjj7THC6gUFC16wg/lPwlj4FTQ
 M3CSTCRhSCHPbYJyc8n9ZTs7YUVOSu/583l8/5epKRleLemicDoeyf75b4aWL9kYVHqo
 LYgw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:in-reply-to:from:references:cc:to
 :content-language:subject:user-agent:mime-version:date:message-id
 :x-gm-message-state:from:to:cc:subject:date;
 bh=RQiCOktmIY6gTO3JgIoKKWdfIqs34SMSnZzpkp/X75M=;
 b=eTTLFSF1XgJ5mWccX77obqdUwbKXJ3V6fx55brGHJCh18W+KzEAiErSuurKn/evCf8
 QW95Wqh+L5bAVn81dLLgudB/csNHSq6pit9gIjn8+Pwyf3CZV3PkU8Ew+Bro/plPcYUU
 8x1Hkegp5LS4lZDubB9LA6oN1cmjzpdm7jUgqHfasNoaulhTRwMR9yhvW2sCf9wS5iNp
 QPMLQ9KdbYbFzVpIySRjDT7pp4OJD9X3C1AUiiHRKIlS7dVjcTu1vHbzF0iMlNS07Es4
 YElVPsFoAexuYKeCZyG5b81tUP2RHRldD3zMpPsJVmjduKzzWC+OI7g4QDYG9Yx4vdYM
 I+Vw==
X-Gm-Message-State: ACrzQf0n4AxzCC45zCEBB8j/2QjAUOwYt2q84j+jNT5WIstW/J58HFS9
 GV7rhEiP7nuhsoEQfQTZYdE=
X-Google-Smtp-Source: AMsMyM4fgMI20lPx3N1VkSnX2PFiAMO19WJdP9V+XZdRHmbjSe74vxWLZ1HZllqW/pbFAHhO9KaWKA==
X-Received: by 2002:a05:620a:4407:b0:6cb:e111:32c2 with SMTP id
 v7-20020a05620a440700b006cbe11132c2mr15354227qkp.415.1664210217480; 
 Mon, 26 Sep 2022 09:36:57 -0700 (PDT)
Received: from ?IPV6:2600:4040:225b:ea00:6063:8c9b:774a:6cf4?
 ([2600:4040:225b:ea00:6063:8c9b:774a:6cf4])
 by smtp.googlemail.com with ESMTPSA id
 j12-20020ac8440c000000b0035d0655b079sm10639986qtn.30.2022.09.26.09.36.56
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Mon, 26 Sep 2022 09:36:56 -0700 (PDT)
Message-ID: <ebc46d4c-af05-b3ac-6e64-1142fad14b99@gmail.com>
Date: Mon, 26 Sep 2022 12:36:56 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.12.0
Subject: Re: [PATCH v2 1/3] net/bonding: support Tx prepare
Content-Language: en-US
To: Konstantin Ananyev <konstantin.ananyev@huawei.com>,
 Fengchengwen <fengchengwen@huawei.com>,
 Ferruh Yigit <ferruh.yigit@xilinx.com>,
 "thomas@monjalon.net" <thomas@monjalon.net>,
 "andrew.rybchenko@oktetlabs.ru" <andrew.rybchenko@oktetlabs.ru>,
 "konstantin.ananyev@intel.com" <konstantin.ananyev@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, "chas3@att.com" <chas3@att.com>,
 "humin (Q)" <humin29@huawei.com>
References: <1619171202-28486-2-git-send-email-tangchengchang@huawei.com>
 <20220725040842.35027-1-fengchengwen@huawei.com>
 <20220725040842.35027-2-fengchengwen@huawei.com>
 <495fb2f0-60c2-f1c9-2985-0d08bb463ad0@xilinx.com>
 <4b4af3e8-710a-ae75-8171-331ebfe4e4f7@huawei.com>
 <6c91f993-b11d-987c-6d20-38ee11f9f9db@gmail.com>
 <509a1984-841a-e42c-05c1-707b024ef7a8@huawei.com>
 <863016bd-a20b-8a9f-8edc-cfddc0593546@gmail.com>
 <c069825180964d32b1f0f18412b10ec9@huawei.com>
 <4a978f38-4f38-4630-ca91-fb96a5789d6f@gmail.com>
 <e899b2c8046f4672a003d5d1184329ce@huawei.com>
From: Chas Williams <3chas3@gmail.com>
In-Reply-To: <e899b2c8046f4672a003d5d1184329ce@huawei.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org


On 9/26/22 06:18, Konstantin Ananyev wrote:
> 
> Hi everyone,
> 
> Sorry for late reply.
> 
>>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>>        devices.
>>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>>        link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>>        bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>>        failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>>        and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>>        IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>>        pre-calc).
>>>>>>>
>>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>>> branch in drivers/net/bonding.
>>>>>
>>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>>> is concerned about impacting performance.
>>>>>
>>>>> As a trade-off, I think we can add the enable/disable API.
>>>>
>>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>>> performance adversly, that is not a bonding problem. All applications
>>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>>> as I previously mentioned, you are just trading the location of
>>>> the branch, especially in the bonding case.
>>>>
>>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>>> should be improved or rewritten. There are PMD that require you to use
>>>> that API. Locally, we had maintained a patch to eliminate the use of
>>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>>> was marginal.
>>>>
>>>>>
>>>>>>
>>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>>> the following patch locally which I never got around to submitting.
>>>>>
>>>>> You are right, I will send V3 fix it.
>>>>>
>>>>>>
>>>>>>
>>>>>>    From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>>
>>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>>> headers, but it may need to do other things.
>>>>>>
>>>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>>>
>>>>> ...
>>>>>
>>>>>>                  * ring if transmission fails so the packet isn't lost.
>>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>>
>>>>>>         /* Transmit burst on each active slave */
>>>>>>         for (i = 0; i < num_of_slaves; i++) {
>>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +        uint16_t nb_prep;
>>>>>> +
>>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>>                         bufs, nb_pkts);
>>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +                    bufs, nb_prep);
>>>>>
>>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>>
>>>> This routine is already broken. You can't just increment the refcount
>>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>>> rte_vlan_insert.
>>>
>>> Hmm interesting....
>>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>>> was introduced.
>>
>> Is that documented anywhere?
> 
> I looked through, but couldn't find too much except what was already mentioned by Fengcheng:
> rte_eth_tx_prepare() notes:
>      * Since this function can modify packet data, provided mbufs must be safely
>      * writable (e.g. modified data cannot be in shared segment).
> Probably that's not explicit enough, as it doesn't forbid modifying packets in tx_burst clearly.

This certainly seems like one of those gray areas in the DPDK APIs. It
should be made clear what is expected as far as behavior.

> 
>> It's been my experience that the device PMD
>> can do practically anything and you need to protect yourself.  Currently,
>> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
>> the virtio driver also used to call rte_vlan_insert during its transmit
>> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
>> header.
> Interesting, usually apps that trying to use zero-copy multi-cast TX have packet-header portion
> in a separate segment, so it might even keep working.. But definetly doesn't look right to me:
> if mbuf->refnct > 1,  I think it should be treated as read-only.


rte_vlan_insert might be a problem with broadcast mode. If the refcnt is
> 1, rte_vlan_insert is going to fail. So, the current broadcast mode
implementation probably doesn't work if any PMD uses rte_vlan_insert.

So again, a solution is call tx_pkt_prepare once, then increment the
reference count, and send to the all the members. That works if your
PMD correctly implements tx_pkt_prepare. If it doesn't and call rte_vlan_insert
in the transmit routine, that PMD will need to be fixed to work with
bonding.

  
>   Regardless, it looks like rte_eth_dev_tx_prepare should always be
>> called.
> 
> Again, as I remember, initial agreement was: if any TX offload is enabled,
> tx_prepare() needs to be called (or user has implement similar stuff on his own).
> If no TX offload flags were specified for the packet, tx_prepare() is not necessary.

For the bonding driver, we potentially have a mix of PMDs for the members.
It's difficult to know in advance if your packets will have TX offload flags
or not. If you have a tx_pkt_prepare stub, there's a good chance that your
packets will have some TX offload flags. So, calling tx_pkt_prepare is likely
the "best" intermediate solution.

> 
>   > Handling that correctly in broadcast mode probably means always
>> make a deep copy of the packet, or check to see if all the members are
>> the same PMD type. If so, you can just call prepare once. You could track
>> the mismatched nature during additional/removal of the members. Or just
>> assume people aren't going to mismatch bonding members.
>>
>>
>>>> You should at least perform a clone of the packet so
>>>> that the mbuf headers aren't mangled by each PMD.
> 
> Usually you don't need to clone the whole packet. In many cases it is enough to just attach
> as first segment l2/l3/l4 header portion of the packet.
> At least that's how ip_multicast sample works.

Yes, that's what I meant by deep copy the packet headers. You just copy
enough to modify what you need and keep the bulk of the packet otherwise.

> 
> Just to be safe you
>>>> should perform a partial deep copy of the packet headers in case some
>>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>>> not need an rte_vlan_insert.
>>>>
>>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>>> worse.
>>>>
>>>>>
>>>>>>
>>>>>>             if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>>                 tx_failed_flag = 1;