From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 0921EA0C44;
	Mon, 14 Jun 2021 16:16:56 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id EB1864068E;
	Mon, 14 Jun 2021 16:16:55 +0200 (CEST)
Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113])
 by mails.dpdk.org (Postfix) with ESMTP id 5456C4068C
 for <dev@dpdk.org>; Mon, 14 Jun 2021 16:16:54 +0200 (CEST)
Received: from [192.168.1.71] (unknown [188.170.85.171])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by shelob.oktetlabs.ru (Postfix) with ESMTPSA id B2B337F40D;
 Mon, 14 Jun 2021 17:16:53 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru B2B337F40D
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru;
 s=default; t=1623680214;
 bh=AWgeTdGAbczR5MNxa3jpoJXxZHKcuc/fO3IqsYntfFA=;
 h=Subject:To:Cc:References:From:Date:In-Reply-To;
 b=g4g10ZSJU8ScmlNav4TE93y9sRyYFx3X5AIkMml8OQwEL4DpCWCjWN3ITTVKr4qmY
 lPMzOMaerIftazh2BDSLz8J0zYxjQYZfpuzlsAQ7m3OiwXDKAHSLmf+CiT0/ieuJq8
 799j87wC0Em6dRCBbOTmEObOtUVajuSS9Dx/hOKo=
To: Chengchang Tang <tangchengchang@huawei.com>, dev@dpdk.org
Cc: linuxarm@huawei.com, chas3@att.com, humin29@huawei.com,
 ferruh.yigit@intel.com, konstantin.ananyev@intel.com
References: <1618571071-5927-1-git-send-email-tangchengchang@huawei.com>
 <1619171202-28486-1-git-send-email-tangchengchang@huawei.com>
 <1619171202-28486-2-git-send-email-tangchengchang@huawei.com>
 <4ce1752a-b605-8f35-ab06-13a671ef0e6c@oktetlabs.ru>
 <070e2dc9-7ab8-44f3-7d78-af1b35f18908@huawei.com>
 <04598999-c517-43b8-3661-8c0bb25e65e7@oktetlabs.ru>
 <f8748924-21b5-a23d-4231-72ab5c236fbb@huawei.com>
From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Message-ID: <f4e85766-1146-e1ec-b89a-508f7caa403f@oktetlabs.ru>
Date: Mon, 14 Jun 2021 17:16:53 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.0
MIME-Version: 1.0
In-Reply-To: <f8748924-21b5-a23d-4231-72ab5c236fbb@huawei.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for
 bonding
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 6/10/21 10:32 AM, Chengchang Tang wrote:
> On 2021/6/9 17:35, Andrew Rybchenko wrote:
>> On 6/9/21 9:42 AM, Chengchang Tang wrote:
>>> Hi, Andrew and Ferruh
>>>
>>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>>> "for bonding" is redundant in the summary since it is already "net/bonding".
>>>>
>>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>>>> some adjustment to the packets before sending them (e.g. processing
>>>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>>>> callback of the bond driver is not implemented. Therefore, related
>>>>> offloads can not be used unless the upper layer users process the packet
>>>>> properly in their own application. But it is bad for the
>>>>> transplantability.
>>>>>
>>>>> However, it is difficult to design the tx_prepare callback for bonding
>>>>> driver. Because when a bonded device sends packets, the bonded device
>>>>> allocates the packets to different slave devices based on the real-time
>>>>> link status and bonding mode. That is, it is very difficult for the
>>>>> bonding device to determine which slave device's prepare function should
>>>>> be invoked. In addition, if the link status changes after the packets are
>>>>> prepared, the packets may fail to be sent because packets allocation may
>>>>> change.
>>>>>
>>>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>>>> In these cases, the impact on performance will be very limited. It is
>>>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>>>> needs to be used. The information from dev_config/queue_setup is
>>>>> sufficient for them to make these decisions.
>>>>>
>>>>> Note:
>>>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>>>> the sent packet may be incorrect.
>>>>>
>>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>>> ---
>>>>>   drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>>>   drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>>>   2 files changed, 24 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>>>> index 874aa91..1e6cc6d 100644
>>>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>>>   int
>>>>>   rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>>>
>>>>> -
>>>>>   #ifdef __cplusplus
>>>>>   }
>>>>>   #endif
>>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> index 2e9cea5..84af348 100644
>>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>>>   	/* Send packet burst on each slave device */
>>>>>   	for (i = 0; i < num_of_slaves; i++) {
>>>>>   		if (slave_nb_pkts[i] > 0) {
>>>>> +			int nb_prep_pkts;
>>>>> +
>>>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>>>> +					slave_nb_pkts[i]);
>>>>> +
>>>>
>>>> Shouldn't it be called iff queue Tx offloads are not zero?
>>>> It will allow to decrease performance degradation if no
>>>> Tx offloads are enabled. Same in all cases below.
>>>
>>> Regarding this point, it has been discussed in the previous RFC:
>>> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
>>>
>>> According to the TX_OFFLOAD status of the current device, PMDs can determine
>>> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
>>> to NULL, so that the actual tx_prepare processing will be skipped directly in
>>> rte_eth_tx_prepare().
>>
>> I still think that the following is right:
>> No Tx offloads at all => Tx prepare is not necessary
>>
>> Am I wrong?
>>
> 
> Let PMDs determine whether tx_prepare() need be done could reduce the performance
> loss in more scenarios. For example, some offload do not need a Tx prepare, and PMDs
> could set tx_prepare to NULL in this scenario. Even if rte_eth_tx_prepare() is called，
> it will not perform the tx_prepare callback, and then return. In this case, there is
> only one judgment logic. If we judge whether tx_offloads are not zero, one more logical
> judgment is added.

I'll wait for net/bonding maintainers decision here.

IMHO all above assumptions should be proven by performance measurements.

> Of course, some PMDs currently do not optimize tx_prepare, which may have a performance
> impact. We hope to force them to optimize tx_prepare in this way, just like they optimize
> tx_burst. This makes it easier for users to use tx_prepare(), and no longer need to
> consider that using tx_prepare() will introduce unnecessary performance degradation.
> 
> IMHO tx_prepare() should be extended to all scenarios for use, and the impact on
> performance should be optimized by PMDs. Let the application consider when it should be
> used and when it should not be used, in many cases it will not be used and then introduced
> some problem.
> 
>>>>
>>>>>   			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>>>> +					slave_bufs[i], nb_prep_pkts);
>>>>
>>>> In fact it is a problem here and really big problems.
>>>> Tx prepare may fail and return less packets. Tx prepare
>>>> of some packet may always fail. If application tries to
>>>> send packets in a loop until success, it will be a
>>>> forever loop here. Since application calls Tx burst,
>>>> it is 100% legal behaviour of the function to return 0
>>>> if Tx ring is full. It is not an error indication.
>>>> However, in the case of Tx prepare it is an error
>>>> indication.
>>>>
>>>> Should we change Tx burst description and enforce callers
>>>> to check for rte_errno? It sounds like a major change...
>>>>
>>>
>>> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
>>> But what about the failure caused by other reasons? At present, it is possible
>>> for some PMDs to fail during tx_burst due to other reasons. In this case,
>>> repeated tries to send will also fail.
>>
>> If so, packet should be simply dropped by Tx burst and Tx burst
>> should move on. If a packet cannot be transmitted, it must be
>> dropped (counted) and Tx burst should move to the next packet.
>>
>>> I'm not sure if all PMDs need to support the behavior of sending packets in a
>>> loop until it succeeds. If not, I think the current problem can be reminded to
>>> the user by adding a description to the bonding. If it is necessary, I think the
>>> description of tx_burst should also add related instructions, so that the developers
>>> of PMDs can better understand how tx_burst should be designed, such as putting all
>>> hardware-related constraint checks into tx_prepare. And another prerequisite for
>>> the above behavior is that the packets must be prepared (i.e. checked by
>>> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
>>> to use rte_eth_tx_prepare() in more scenarios.
>>
>> IMHO any PMD specific behaviour is a nightmare to application
>> developer and must be avoided. Ideally application should not
>> care if it is running on top of tap, virtio, failsafe or
>> bonding. It should talk to ethdev in terms of ethdev API that's
>> it. I know that net/bonding is designed that application should
>> know about it, but IMHO the places where it requires the
>> knowledge must be minimized to make applications more portable
>> across various PMDs/HW.
>>
>> I think that the only sensible solution for above problem is
>> to skip a packet which prepare dislikes. count it as dropped
>> and try to prepare/transmit subsequent packets.
> 
> Agree, I will fix this in the next version.
> 
>>
>> It is an interesting effect of the Tx prepare just before
>> Tx burst inside bonding PMD. If Tx burst fails to send
>> something because ring is full, a number of packets will
>> be processed by Tx prepare again and again. I guess it is
>> unavoidable.
>>
>>> What's Ferruh's opinion on this?
>>>
>>>> [snip]