From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 91005A00C5; Sun, 25 Sep 2022 12:32:16 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 23077400D4; Sun, 25 Sep 2022 12:32:16 +0200 (CEST) Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) by mails.dpdk.org (Postfix) with ESMTP id B115F4003F for ; Sun, 25 Sep 2022 12:32:14 +0200 (CEST) Received: by mail-qk1-f174.google.com with SMTP id d17so2605924qko.13 for ; Sun, 25 Sep 2022 03:32:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date; bh=H1ft37bpsivg2gZG2KqccIe6IR+OXmMts8XQAcGPeQI=; b=Ss36L9WIVit3IdmaGzJMeYLsDFU37kL8K98/SBhndboZOQye7o0sIOoJV594cnhL4z gvV8l95o/lz9MPPsZNcD2Fr+n4OCXu4zQQtnMggE7xtEL50yUtyjZ2Mqd7gyK4c8lhzu pqloYQdFl+QW6iL8N4eUurhT4HDKoTbcAqwiwVvchtizofb/dx5Ci/6VJkiFVM3ZXTIA Cle0mou5OvQOklk0A/VxFvZ6FaxAokaLW9WOxdQaZ5duhJpfbxpFzs0tyjlSYNxsMV/v A5Diae4TE7Qy/zA/fV62GI/qXZO34MxcSRxHAaSHXQQKKYYkDFiqKR79g1slMNlHbgLz vJNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date; bh=H1ft37bpsivg2gZG2KqccIe6IR+OXmMts8XQAcGPeQI=; b=lsqc2fRm5j/jJJl72oZmedkFmtPF+XD5XjhcddCMWmNpu1EK0rUqmSpYpCXoo4Q+kB 2qENYfUNBuvCLN+kH4Yyzy2eqpkPEHZZp1kgSQ0+/M2F65xzfkAcDx1WSKiR+PwFRm0x U1L3PiNHI2RZ0/dHFWzPLGqWL6L42Mwnh95OUdmi0Z4OjT05BTcNCU0YMlLWxpaN1S2A Rox52iczAN6tTRn1ogcGmkppKZ6aRYEOkMxR/NHbqazieuBq61cTBHzWbEpmEzXabBgi rRxZ1X/Nbnw9KqmkKnap8KckCZZmaRdf19gIq9XkEBv5D4umAlKDSzzsRzIQ+qBQPVC/ 8BZA== X-Gm-Message-State: ACrzQf1+AjGggCu/pTrmmzLEW1mtCLvgn5sZVMDZ3CwYkaqHGz27Odym f0O4W/HPSvuhGf/YX+hlqLI= X-Google-Smtp-Source: AMsMyM63Gsj7XYw8EarOcHwq0CvLTZapimBTN4nTvmiiiEasGUPyQzdLuCVg8Kabcla8uxcslbo5vw== X-Received: by 2002:a05:620a:2591:b0:6c9:cc85:87e3 with SMTP id x17-20020a05620a259100b006c9cc8587e3mr11178102qko.577.1664101934061; Sun, 25 Sep 2022 03:32:14 -0700 (PDT) Received: from ?IPV6:2600:4040:225b:ea00:6063:8c9b:774a:6cf4? ([2600:4040:225b:ea00:6063:8c9b:774a:6cf4]) by smtp.googlemail.com with ESMTPSA id b13-20020a05622a020d00b0035a6f14b3cesm9415424qtx.27.2022.09.25.03.32.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 25 Sep 2022 03:32:13 -0700 (PDT) Message-ID: Date: Sun, 25 Sep 2022 06:32:12 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.12.0 Subject: Re: [PATCH v2 1/3] net/bonding: support Tx prepare Content-Language: en-US To: fengchengwen , Konstantin Ananyev , Ferruh Yigit , "thomas@monjalon.net" , "andrew.rybchenko@oktetlabs.ru" , Ferruh Yigit Cc: "dev@dpdk.org" , "chas3@att.com" , "humin (Q)" References: <1619171202-28486-2-git-send-email-tangchengchang@huawei.com> <20220725040842.35027-1-fengchengwen@huawei.com> <20220725040842.35027-2-fengchengwen@huawei.com> <495fb2f0-60c2-f1c9-2985-0d08bb463ad0@xilinx.com> <4b4af3e8-710a-ae75-8171-331ebfe4e4f7@huawei.com> <6c91f993-b11d-987c-6d20-38ee11f9f9db@gmail.com> <509a1984-841a-e42c-05c1-707b024ef7a8@huawei.com> <863016bd-a20b-8a9f-8edc-cfddc0593546@gmail.com> <4a978f38-4f38-4630-ca91-fb96a5789d6f@gmail.com> <66c366e5-3634-3ecb-c605-a4c23278ed88@huawei.com> From: Chas Williams <3chas3@gmail.com> In-Reply-To: <66c366e5-3634-3ecb-c605-a4c23278ed88@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 9/21/22 22:12, fengchengwen wrote: > > > On 2022/9/20 7:02, Chas Williams wrote: >> >> >> On 9/19/22 10:07, Konstantin Ananyev wrote: >>> >>>> >>>> On 9/16/22 22:35, fengchengwen wrote: >>>>> Hi Chas, >>>>> >>>>> On 2022/9/15 0:59, Chas Williams wrote: >>>>>> On 9/13/22 20:46, fengchengwen wrote: >>>>>>> >>>>>>> The main problem is hard to design a tx_prepare for bonding device: >>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave >>>>>>>       devices. >>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device >>>>>>>       link down or remove), and if the changes happens between bond-tx-prepare and >>>>>>>       bond-tx-burst, the output slave will changes, and this may lead to checksum >>>>>>>       failed. (Note: a bond device with slave devices may from different vendors, >>>>>>>       and slave devices may have different requirements, e.g. slave-A support calc >>>>>>>       IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver >>>>>>>       pre-calc). >>>>>>> >>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and >>>>>>> in addition, bond devices are not transparent to applications, I think it's a >>>>>>> practical method to provide tx-prepare support in this way. >>>>>>> >>>>>> >>>>>> >>>>>> I don't think you need to export an enable/disable routine for the use of >>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't >>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a >>>>>> branch in drivers/net/bonding. >>>>> >>>>> Our first patch was just like yours (just add tx-prepare default), but community >>>>> is concerned about impacting performance. >>>>> >>>>> As a trade-off, I think we can add the enable/disable API. >>>> >>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects >>>> performance adversly, that is not a bonding problem. All applications >>>> should be calling rte_eth_dev_tx_prepare. There's no defined API >>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore, >>>> applications should always call rte_eth_dev_tx_prepare. Regardless, >>>> as I previously mentioned, you are just trading the location of >>>> the branch, especially in the bonding case. >>>> >>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API >>>> should be improved or rewritten. There are PMD that require you to use >>>> that API. Locally, we had maintained a patch to eliminate the use of >>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder >>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare >>>> was marginal. >>>> >>>>> >>>>>> >>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using >>>>>> the following patch locally which I never got around to submitting. >>>>> >>>>> You are right, I will send V3 fix it. >>>>> >>>>>> >>>>>> >>>>>>   From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001 >>>>>> From: "Charles (Chas) Williams" >>>>>> Date: Tue, 3 May 2022 16:52:37 -0400 >>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst >>>>>> >>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the >>>>>> packets for transmission. Typically, the prepare step handles the VLAN >>>>>> headers, but it may need to do other things. >>>>>> >>>>>> Signed-off-by: Chas Williams >>>>> >>>>> ... >>>>> >>>>>>                 * ring if transmission fails so the packet isn't lost. >>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs, >>>>>> >>>>>>        /* Transmit burst on each active slave */ >>>>>>        for (i = 0; i < num_of_slaves; i++) { >>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id, >>>>>> +        uint16_t nb_prep; >>>>>> + >>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id, >>>>>>                        bufs, nb_pkts); >>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id, >>>>>> +                    bufs, nb_prep); >>>>> >>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves, >>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ? >>>> >>>> This routine is already broken. You can't just increment the refcount >>>> and send the packet into a PMD's transmit routine. Nothing guarantees >>>> that a transmit routine will not modify the packet. Many PMDs perform an >>>> rte_vlan_insert. >>> >>> Hmm interesting.... >>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata >>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool). >>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine >>> was introduced. >> >> Is that documented anywhere? It's been my experience that the device PMD >> can do practically anything and you need to protect yourself.  Currently, >> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019, >> the virtio driver also used to call rte_vlan_insert during its transmit >> path. Of course, rte_vlan_insert modifies the packet data and the mbuf >> header. Regardless, it looks like rte_eth_dev_tx_prepare should always be >> called. Handling that correctly in broadcast mode probably means always >> make a deep copy of the packet, or check to see if all the members are >> the same PMD type. If so, you can just call prepare once. You could track >> the mismatched nature during additional/removal of the members. Or just >> assume people aren't going to mismatch bonding members. > > the rte_eth_tx_prepare has notes: > * Since this function can modify packet data, provided mbufs must be safely > * writable (e.g. modified data cannot be in shared segment). > but rte_eth_tx_burst have not such requirement. > > except above examples of rte_vlan_insert, there are also some PMDs modify mbuf's header > and data, e.g. hns3/ark/bnxt will invoke rte_pktmbuf_append in case of the pkt-len too small. > > I prefer the rte_eth_tx_burst add such restricts: the PMD should not modify the mbuf except refcnt==1. > so that application could rely on there explicit definition to do business. > > > As for this bonding scenario, we have three alternatives: > 1) as Chas provided patch, always do tx-prepare before tx-burst. it was simple, but have: it > may modify the mbuf but application could not detect (unless especial documents) > 2) my patch, application could invoke the prepare_enable/disable to control whether to do prepare. > 3) implement bonding PMD's tx-prepare, it do tx-preare for each slave, but existing some problem: > if the slave device changes (e.g. add new device), some packet errors may occur because we have not > do prepare for the new add device. > > note1: the above 1/2 both violate rte_eth_tx_burst's requirement, so we should especial document. > note2: we can do some optimization for 3, e.g. if the same driver name is detected on multiple slave > devices, here only need to perform tx-prepare once. but the problem above descripe still exist > because of dynamic slave devices at runtime. > > hope for more discuess. @Ferruh @Chas @Humin @Konstantin I don't think adding additional API due to concerns about performance is the solution to the performance problem. If the tx_prepare API is slow, that's what needs to be fixed. I imagine that more drivers will be using the tx_prepare API over time not less. It would be a good idea to get used to calling it. As for broadcast mode, let's just call tx_prepare once for any given packet. For now, assume that no one would attempt to bond different PMDs together. In my experience, that would be unusual. I have never seen anyone do that in a production context. If a bug report comes in about this failing for someone, we can fix it then. >>>> You should at least perform a clone of the packet so >>>> that the mbuf headers aren't mangled by each PMD. Just to be safe you >>>> should perform a partial deep copy of the packet headers in case some >>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do >>>> not need an rte_vlan_insert. >>>> >>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much >>>> worse. >>>> >>>>> >>>>>> >>>>>>            if (unlikely(slave_tx_total[i] < nb_pkts)) >>>>>>                tx_failed_flag = 1; >> .