From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by dpdk.org (Postfix) with ESMTP id 878EE5A9D for ; Thu, 20 Aug 2015 11:06:54 +0200 (CEST) Received: by wijp15 with SMTP id p15so9992181wij.0 for ; Thu, 20 Aug 2015 02:06:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=Pcd6u8aXRFqZutAdfJ6QgFAIxi3ZkLrF1XYvhfwv9zo=; b=dHusaN1KTzD2Fl2mrkX9DMpY9X12QIyoQ4gSTGfbdCf/WQYlOy/cgRUdTV7BA9yyhK OhF/2Iv0I3zdkd2sHjE+KIu6hMGVFmOsVxsJyCNET+huhqzZHnpDzxLMwG1oJv//TJkt zjbXnK3UZzd7aRhtsVYgueB9QcSxQ/7/HWE+ksW61GmyHVE5DG+Ybsjagwub9izQ0OKx G3x8B+oHkoSJllCYpWS/MAKsYU6mM9flKwAk/pjnPVMvKX5VT3R/QlWveCm1oSQKfi0g S5rnzotWiNUr4U6wQP7FTh3w2nUekBJjGh12B322j//UyzXRbgXRwmMebDX8ZeAVRXSL ZzMw== X-Gm-Message-State: ALoCoQmKf4vgNaZj7ujKLg7Y0fXyOxiuqSBzIvfJb9S4JomMkSw6mMoUkWMTZ3bn/M8oIcS7aQHk X-Received: by 10.180.84.196 with SMTP id b4mr15763207wiz.32.1440061614349; Thu, 20 Aug 2015 02:06:54 -0700 (PDT) Received: from [10.0.0.166] ([37.142.229.250]) by smtp.googlemail.com with ESMTPSA id o9sm4601867wja.29.2015.08.20.02.06.52 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 20 Aug 2015 02:06:53 -0700 (PDT) To: "Ananyev, Konstantin" , "Lu, Wenzhuo" References: <1439489195-31553-1-git-send-email-vladz@cloudius-systems.com> <55CD7EA5.6060100@cloudius-systems.com> <6A0DE07E22DDAD4C9103DF62FEBC0909D3E116@shsmsx102.ccr.corp.intel.com> <2601191342CEEE43887BDE71AB97725836A76983@irsmsx105.ger.corp.intel.com> <55D4543E.9000808@cloudius-systems.com> <2601191342CEEE43887BDE71AB97725836A7BEC4@irsmsx105.ger.corp.intel.com> <55D59639.7070206@cloudius-systems.com> <55D59852.30002@cloudius-systems.com> From: Vlad Zolotarov Message-ID: <55D598AA.50209@cloudius-systems.com> Date: Thu, 20 Aug 2015 12:06:50 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <55D59852.30002@cloudius-systems.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Aug 2015 09:06:54 -0000 On 08/20/15 12:05, Vlad Zolotarov wrote: > > > On 08/20/15 11:56, Vlad Zolotarov wrote: >> >> >> On 08/20/15 11:41, Ananyev, Konstantin wrote: >>> Hi Vlad, >>> >>>> -----Original Message----- >>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com] >>>> Sent: Wednesday, August 19, 2015 11:03 AM >>>> To: Ananyev, Konstantin; Lu, Wenzhuo >>>> Cc: dev@dpdk.org >>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh >>>> above 1 for all NICs but 82598 >>>> >>>> >>>> >>>> On 08/19/15 10:43, Ananyev, Konstantin wrote: >>>>> Hi Vlad, >>>>> Sorry for delay with review, I am OOO till next week. >>>>> Meanwhile, few questions/comments from me. >>>> Hi, Konstantin, long time no see... ;) >>>> >>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a >>>>>> seastar-based >>>>>>>>>> application on x540 NIC. >>>>>>>>> Could you help to share with us how to reproduce the tx hang >>>>>>>>> issue, >>>>>> with using >>>>>>>>> typical DPDK examples? >>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to >>>>>>>> help u >>>>>>>> here. However this is quite irrelevant since without this this >>>>>>>> patch >>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained >>>>>>>> above. >>>>>>>> >>>>>>>> We saw the issue when u stressed the xmit path with a lot of >>>>>>>> highly >>>>>>>> fragmented TCP frames (packets with up to 33 fragments with >>>>>>>> non-headers >>>>>>>> fragments as small as 4 bytes) with all offload features enabled. >>>>> Could you provide us with the pcap file to reproduce the issue? >>>> Well, the thing is it takes some time to reproduce it (a few >>>> minutes of >>>> heavy load) therefore a pcap would be quite large. >>> Probably you can upload it to some place, from which we will be able >>> to download it? >> >> I'll see what I can do but no promises... > > On a second thought pcap file won't help u much since in order to > reproduce the issue u have to reproduce exactly the same structure of > clusters i give to HW and it's not what u see on wire in a TSO case. And not only in a TSO case... ;) > >> >>> Or might be you have some sort of scapy script to generate it? >>> I suppose we'll need something to reproduce the issue and verify the >>> fix. >> >> Since the original code abuses the HW spec u don't have to... ;) >> >>> >>>>> My concern with you approach is that it would affect TX performance. >>>> It certainly will ;) But it seem inevitable. See below. >>>> >>>>> Right now, for simple TX PMD usually reads only >>>>> (nb_tx_desc/tx_rs_thresh) TXDs, >>>>> While with your patch (if I understand it correctly) it has to >>>>> read all TXDs in the HW TX ring. >>>> If by "simple" u refer an always single fragment per Tx packet - >>>> then u >>>> are absolutely correct. >>>> >>>> My initial patch was to only set RS on every EOP descriptor without >>>> changing the rs_thresh value and this patch worked. >>>> However HW spec doesn't ensure in a general case that packets are >>>> always >>>> handled/completion write-back completes in the same order the packets >>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599 >>>> spec for instance). Therefore AFAIU one should not assume that if >>>> packet[x+1] DD bit is set then packet[x] is completed too. >>> From my understanding, TX arbitration controls the order in which >>> TXDs from >>> different queues are fetched/processed. >>> But descriptors from the same TX queue are processed in FIFO order. >>> So, I think that - yes, if TXD[x+1] DD bit is set, then TXD[x] is >>> completed too, >>> and setting RS on every EOP TXD should be enough. >> >> Ok. I'll rework the patch under this assumption then. >> >>> >>>> That's why I changed the patch to be as u see it now. However if I >>>> miss >>>> something here and your HW people ensure the in-order completion >>>> this of >>>> course may be changed back. >>>> >>>>> Even if we really need to setup RS bit in each TXD (I still doubt >>>>> we really do) - , >>>> Well, if u doubt u may ask the guys from the Intel networking division >>>> that wrote the 82599 and x540 HW specs where they clearly state >>>> that. ;) >>> Good point, we'll see what we can do here :) >>> Konstantin >>> >>>>> I think inside PMD it still should be possible to check TX >>>>> completion in chunks. >>>>> Konstantin >>>>> >>>>> >>>>>>>> Thanks, >>>>>>>> vlad >>>>>>>>>> Signed-off-by: Vlad Zolotarov >>>>>>>>>> --- >>>>>>>>>> drivers/net/ixgbe/ixgbe_ethdev.c | 9 +++++++++ >>>>>>>>>> drivers/net/ixgbe/ixgbe_rxtx.c | 23 >>>>>>>>>> ++++++++++++++++++++++- >>>>>>>>>> 2 files changed, 31 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>>>> index b8ee1e9..6714fd9 100644 >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev >>>>>>>>>> *dev, >>>>>>>> struct >>>>>>>>>> rte_eth_dev_info *dev_info) >>>>>>>>>> .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS | >>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS, >>>>>>>>>> }; >>>>>>>>>> + >>>>>>>>>> + /* >>>>>>>>>> + * According to 82599 and x540 specifications RS bit >>>>>>>>>> *must* be >>>>>> set on >>>>>>>> the >>>>>>>>>> + * last descriptor of *every* packet. Therefore we will >>>>>>>>>> not allow >>>>>> the >>>>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598. >>>>>>>>>> + */ >>>>>>>>>> + if (hw->mac.type > ixgbe_mac_82598EB) >>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1; >>>>>>>>>> + >>>>>>>>>> dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * >>>>>>>>>> sizeof(uint32_t); >>>>>>>>>> dev_info->reta_size = ETH_RSS_RETA_SIZE_128; >>>>>>>>>> dev_info->flow_type_rss_offloads = >>>>>>>>>> IXGBE_RSS_OFFLOAD_ALL; diff -- >>>>>>>> git >>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>> index >>>>>>>>>> 91023b9..8dbdffc 100644 >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct >>>>>>>>>> rte_eth_dev >>>>>>>>>> *dev, >>>>>>>>>> struct ixgbe_tx_queue *txq; >>>>>>>>>> struct ixgbe_hw *hw; >>>>>>>>>> uint16_t tx_rs_thresh, tx_free_thresh; >>>>>>>>>> + bool rs_deferring_allowed; >>>>>>>>>> >>>>>>>>>> PMD_INIT_FUNC_TRACE(); >>>>>>>>>> hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); >>>>>>>>>> >>>>>>>>>> /* >>>>>>>>>> + * According to 82599 and x540 specifications RS bit >>>>>>>>>> *must* be >>>>>> set on >>>>>>>> the >>>>>>>>>> + * last descriptor of *every* packet. Therefore we will >>>>>>>>>> not allow >>>>>> the >>>>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598. >>>>>>>>>> + */ >>>>>>>>>> + rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB); >>>>>>>>>> + >>>>>>>>>> + /* >>>>>>>>>> * Validate number of transmit descriptors. >>>>>>>>>> * It must not exceed hardware maximum, and must be >>>>>>>>>> multiple >>>>>>>>>> * of IXGBE_ALIGN. >>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct >>>>>>>>>> rte_eth_dev >>>>>>>> *dev, >>>>>>>>>> * to transmit a packet is greater than the number of >>>>>>>>>> free TX >>>>>>>>>> * descriptors. >>>>>>>>>> * The following constraints must be satisfied: >>>>>>>>>> + * tx_rs_thresh must be less than 2 for NICs for which RS >>>>>> deferring is >>>>>>>>>> + * forbidden (all but 82598). >>>>>>>>>> * tx_rs_thresh must be greater than 0. >>>>>>>>>> * tx_rs_thresh must be less than the size of the ring >>>>>>>>>> minus 2. >>>>>>>>>> * tx_rs_thresh must be less than or equal to >>>>>>>>>> tx_free_thresh. >>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct >>>>>>>>>> rte_eth_dev >>>>>>>> *dev, >>>>>>>>>> * When set to zero use default values. >>>>>>>>>> */ >>>>>>>>>> tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ? >>>>>>>>>> - tx_conf->tx_rs_thresh : >>>>>>>>>> DEFAULT_TX_RS_THRESH); >>>>>>>>>> + tx_conf->tx_rs_thresh : >>>>>>>>>> + (rs_deferring_allowed ? >>>>>>>>>> DEFAULT_TX_RS_THRESH : >>>>>> 1)); >>>>>>>>>> tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ? >>>>>>>>>> tx_conf->tx_free_thresh : >>>>>>>>>> DEFAULT_TX_FREE_THRESH); >>>>>>>>>> + >>>>>>>>>> + if (!rs_deferring_allowed && tx_rs_thresh > 1) { >>>>>>>>>> + PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than >>>>>>>>>> 2 since >>>>>> RS >>>>>>>> " >>>>>>>>>> + "must be set for every packet for this >>>>>> HW. " >>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)", >>>>>>>>>> + (unsigned int)tx_rs_thresh, >>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx); >>>>>>>>>> + return -(EINVAL); >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> if (tx_rs_thresh >= (nb_desc - 2)) { >>>>>>>>>> PMD_INIT_LOG(ERR, "tx_rs_thresh must be less >>>>>>>>>> than the >>>>>>>> number " >>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u >>>>>> " >>>>>>>>>> -- >>>>>>>>>> 2.1.0 >> >