DPDK patches and discussions
 help / color / mirror / Atom feed
From: Vlad Zolotarov <vladz@cloudius-systems.com>
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
	"Lu, Wenzhuo" <wenzhuo.lu@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
Date: Thu, 20 Aug 2015 12:05:22 +0300	[thread overview]
Message-ID: <55D59852.30002@cloudius-systems.com> (raw)
In-Reply-To: <55D59639.7070206@cloudius-systems.com>



On 08/20/15 11:56, Vlad Zolotarov wrote:
>
>
> On 08/20/15 11:41, Ananyev, Konstantin wrote:
>> Hi Vlad,
>>
>>> -----Original Message-----
>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>>> Sent: Wednesday, August 19, 2015 11:03 AM
>>> To: Ananyev, Konstantin; Lu, Wenzhuo
>>> Cc: dev@dpdk.org
>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
>>> above 1 for all NICs but 82598
>>>
>>>
>>>
>>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>>> Hi Vlad,
>>>> Sorry for delay with review, I am OOO till next week.
>>>> Meanwhile, few questions/comments from me.
>>> Hi, Konstantin, long time no see... ;)
>>>
>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>>>>> seastar-based
>>>>>>>>> application on x540 NIC.
>>>>>>>> Could you help to share with us how to reproduce the tx hang 
>>>>>>>> issue,
>>>>> with using
>>>>>>>> typical DPDK examples?
>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to 
>>>>>>> help u
>>>>>>> here. However this is quite irrelevant since without this this 
>>>>>>> patch
>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained above.
>>>>>>>
>>>>>>> We saw the issue when u stressed the xmit path with a lot of highly
>>>>>>> fragmented TCP frames (packets with up to 33 fragments with 
>>>>>>> non-headers
>>>>>>> fragments as small as 4 bytes) with all offload features enabled.
>>>> Could you provide us with the pcap file to reproduce the issue?
>>> Well, the thing is it takes some time to reproduce it (a few minutes of
>>> heavy load) therefore a pcap would be quite large.
>> Probably you can upload it to some place, from which we will be able 
>> to download it?
>
> I'll see what I can do but no promises...

On a second thought pcap file won't help u much since in order to 
reproduce the issue u have to reproduce exactly the same structure of 
clusters i give to HW and it's not what u see on wire in a TSO case.

>
>> Or might be you have some sort of scapy script to generate it?
>> I suppose we'll need something to reproduce the issue and verify the 
>> fix.
>
> Since the original code abuses the HW spec u don't have to... ;)
>
>>
>>>> My concern with you approach is that it would affect TX performance.
>>> It certainly will ;) But it seem inevitable. See below.
>>>
>>>> Right now, for simple TX PMD usually reads only 
>>>> (nb_tx_desc/tx_rs_thresh) TXDs,
>>>> While with your patch (if I understand it correctly) it has to read 
>>>> all TXDs in the HW TX ring.
>>> If by "simple" u refer an always single fragment per Tx packet - then u
>>> are absolutely correct.
>>>
>>> My initial patch was to only set RS on every EOP descriptor without
>>> changing the rs_thresh value and this patch worked.
>>> However HW spec doesn't ensure in a general case that packets are 
>>> always
>>> handled/completion write-back completes in the same order the packets
>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599
>>> spec for instance). Therefore AFAIU one should not assume that if
>>> packet[x+1] DD bit is set then packet[x] is completed too.
>>  From my understanding, TX arbitration controls the order in which 
>> TXDs from
>> different queues are fetched/processed.
>> But descriptors from the same TX queue are processed in FIFO order.
>> So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is 
>> completed too,
>> and setting RS on every EOP TXD should be enough.
>
> Ok. I'll rework the patch under this assumption then.
>
>>
>>> That's why I changed the patch to be as u see it now. However if I miss
>>> something here and your HW people ensure the in-order completion 
>>> this of
>>> course may be changed back.
>>>
>>>> Even if we really need to setup RS bit in each TXD (I still doubt 
>>>> we really do) - ,
>>> Well, if u doubt u may ask the guys from the Intel networking division
>>> that wrote the 82599 and x540 HW specs where they clearly state 
>>> that. ;)
>> Good point, we'll see what we can do here :)
>> Konstantin
>>
>>>> I think inside PMD it still should be possible to check TX 
>>>> completion in chunks.
>>>> Konstantin
>>>>
>>>>
>>>>>>> Thanks,
>>>>>>> vlad
>>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>>>>>>>>>     drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
>>>>>>>>>     2 files changed, 31 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> index b8ee1e9..6714fd9 100644
>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev 
>>>>>>>>> *dev,
>>>>>>> struct
>>>>>>>>> rte_eth_dev_info *dev_info)
>>>>>>>>>              .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS,
>>>>>>>>>      };
>>>>>>>>> +
>>>>>>>>> +  /*
>>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>>> set on
>>>>>>> the
>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not 
>>>>>>>>> allow
>>>>> the
>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>> +   */
>>>>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)
>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1;
>>>>>>>>> +
>>>>>>>>>      dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * 
>>>>>>>>> sizeof(uint32_t);
>>>>>>>>>      dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>>>>>>>>      dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; 
>>>>>>>>> diff --
>>>>>>> git
>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> index
>>>>>>>>> 91023b9..8dbdffc 100644
>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct 
>>>>>>>>> rte_eth_dev
>>>>>>>>> *dev,
>>>>>>>>>      struct ixgbe_tx_queue *txq;
>>>>>>>>>      struct ixgbe_hw     *hw;
>>>>>>>>>      uint16_t tx_rs_thresh, tx_free_thresh;
>>>>>>>>> +  bool rs_deferring_allowed;
>>>>>>>>>
>>>>>>>>>      PMD_INIT_FUNC_TRACE();
>>>>>>>>>      hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>>>>>
>>>>>>>>>      /*
>>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>>> set on
>>>>>>> the
>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not 
>>>>>>>>> allow
>>>>> the
>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>> +   */
>>>>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>>>>>>>>> +
>>>>>>>>> +  /*
>>>>>>>>>       * Validate number of transmit descriptors.
>>>>>>>>>       * It must not exceed hardware maximum, and must be multiple
>>>>>>>>>       * of IXGBE_ALIGN.
>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>>> *dev,
>>>>>>>>>       * to transmit a packet is greater than the number of 
>>>>>>>>> free TX
>>>>>>>>>       * descriptors.
>>>>>>>>>       * The following constraints must be satisfied:
>>>>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
>>>>> deferring is
>>>>>>>>> +   *  forbidden (all but 82598).
>>>>>>>>>       *  tx_rs_thresh must be greater than 0.
>>>>>>>>>       *  tx_rs_thresh must be less than the size of the ring 
>>>>>>>>> minus 2.
>>>>>>>>>       *  tx_rs_thresh must be less than or equal to 
>>>>>>>>> tx_free_thresh.
>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct 
>>>>>>>>> rte_eth_dev
>>>>>>> *dev,
>>>>>>>>>       * When set to zero use default values.
>>>>>>>>>       */
>>>>>>>>>      tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>>>>>>>>> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
>>>>>>>>> +                  tx_conf->tx_rs_thresh :
>>>>>>>>> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :
>>>>> 1));
>>>>>>>>>      tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>>>>>>>>                      tx_conf->tx_free_thresh : 
>>>>>>>>> DEFAULT_TX_FREE_THRESH);
>>>>>>>>> +
>>>>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>>>>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 
>>>>>>>>> since
>>>>> RS
>>>>>>> "
>>>>>>>>> + "must be set for every packet for this
>>>>> HW. "
>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)",
>>>>>>>>> +                       (unsigned int)tx_rs_thresh,
>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx);
>>>>>>>>> +          return -(EINVAL);
>>>>>>>>> +  }
>>>>>>>>> +
>>>>>>>>>      if (tx_rs_thresh >= (nb_desc - 2)) {
>>>>>>>>>              PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 
>>>>>>>>> the
>>>>>>> number "
>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u
>>>>> "
>>>>>>>>> -- 
>>>>>>>>> 2.1.0
>

  reply	other threads:[~2015-08-20  9:05 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-13 18:06 Vlad Zolotarov
2015-08-13 20:28 ` Zhang, Helin
2015-08-14  5:37   ` Vlad Zolotarov
2015-08-19  0:42     ` Lu, Wenzhuo
2015-08-19  4:55       ` Vladislav Zolotarov
2015-08-19  7:43         ` Ananyev, Konstantin
2015-08-19 10:02           ` Vlad Zolotarov
2015-08-20  8:41             ` Ananyev, Konstantin
2015-08-20  8:56               ` Vlad Zolotarov
2015-08-20  9:05                 ` Vlad Zolotarov [this message]
2015-08-20  9:06                   ` Vlad Zolotarov
2015-08-25 17:33                     ` Ananyev, Konstantin
2015-08-25 17:39                       ` Avi Kivity
2015-08-19 17:29         ` Zhang, Helin
2015-08-25 18:13           ` Zhang, Helin
2015-08-25 18:33             ` Vladislav Zolotarov
2015-08-25 18:43               ` Zhang, Helin
2015-08-25 18:52                 ` Vlad Zolotarov
2015-08-25 19:16                   ` Zhang, Helin
2015-08-25 19:23                     ` Avi Kivity
2015-08-25 19:30                     ` Vladislav Zolotarov
2015-08-25 20:07                       ` Vlad Zolotarov
2015-08-25 20:13                       ` Zhang, Helin
2015-09-09 12:18                         ` Thomas Monjalon
2015-09-09 13:19                           ` Ananyev, Konstantin
2015-09-11 15:17                             ` Vladislav Zolotarov
2015-09-11 14:25                   ` didier.pallard
2015-09-11 14:47                     ` Avi Kivity
2015-09-11 14:55                       ` Thomas Monjalon
2015-09-11 15:12                         ` Vladislav Zolotarov
2015-09-11 15:43                           ` Avi Kivity
2015-09-11 16:04                             ` Vladislav Zolotarov
2015-09-11 16:07                               ` Richardson, Bruce
2015-09-11 16:14                                 ` Vladislav Zolotarov
2015-09-11 17:44                                 ` Avi Kivity
2015-09-11 16:08                             ` Thomas Monjalon
2015-09-11 16:18                               ` Vladislav Zolotarov
2015-09-11 17:17                                 ` Matthew Hall
2015-09-11 17:42                                   ` Ananyev, Konstantin
2015-09-11 17:58                                     ` Matthew Hall
2015-09-11 17:48                               ` Avi Kivity
2015-09-13 11:47                                 ` Ananyev, Konstantin
2015-09-13 12:24                                   ` Vlad Zolotarov
2015-09-13 12:32                                   ` Avi Kivity
2015-09-13 15:54                                     ` Ananyev, Konstantin
2015-09-13 16:01                                       ` Avi Kivity
2015-09-11 16:00                           ` Richardson, Bruce
2015-09-11 16:13                             ` Vladislav Zolotarov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55D59852.30002@cloudius-systems.com \
    --to=vladz@cloudius-systems.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@intel.com \
    --cc=wenzhuo.lu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).