[dpdk-dev] RFC: i40e xmit path HW limitation

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] RFC: i40e xmit path HW limitation
@ 2015-07-30 14:57 Vlad Zolotarov
  2015-07-30 16:10 ` [dpdk-dev] " Zhang, Helin
  2015-07-30 16:17 ` [dpdk-dev] RFC: " Stephen Hemminger
  0 siblings, 2 replies; 13+ messages in thread
From: Vlad Zolotarov @ 2015-07-30 14:57 UTC (permalink / raw)
  To: dev, Ananyev, Konstantin, Helin Zhang

Hi, Konstantin, Helin,
there is a documented limitation of xl710 controllers (i40e driver) 
which is not handled in any way by a DPDK driver.
 From the datasheet chapter 8.4.1:

"• A single transmit packet may span up to 8 buffers (up to 8 data descriptors per packet including
both the header and payload buffers).
• The total number of data descriptors for the whole TSO (explained later on in this chapter) is
unlimited as long as each segment within the TSO obeys the previous rule (up to 8 data descriptors
per segment for both the TSO header and the segment payload buffers)."

This means that, for instance, long cluster with small fragments has to 
be linearized before it may be placed on the HW ring.
In more standard environments like Linux or FreeBSD drivers the solution 
is straight forward - call skb_linearize()/m_collapse() corresponding.
In the non-conformist environment like DPDK life is not that easy - 
there is no easy way to collapse the cluster into a linear buffer from 
inside the device driver
since device driver doesn't allocate memory in a fast path and utilizes 
the user allocated pools only.

Here are two proposals for a solution:

 1. We may provide a callback that would return a user TRUE if a give
    cluster has to be linearized and it should always be called before
    rte_eth_tx_burst(). Alternatively it may be called from inside the
    rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
    error code for a case when one of the clusters it's given has to be
    linearized.
 2. Another option is to allocate a mempool in the driver with the
    elements consuming a single page each (standard 2KB buffers would
    do). Number of elements in the pool should be as Tx ring length
    multiplied by "64KB/(linear data length of the buffer in the pool
    above)". Here I use 64KB as a maximum packet length and not taking
    into an account esoteric things like "Giant" TSO mentioned in the
    spec above. Then we may actually go and linearize the cluster if
    needed on top of the buffers from the pool above, post the buffer
    from the mempool above on the HW ring, link the original cluster to
    that new cluster (using the private data) and release it when the
    send is done.

The first is a change in the API and would require from the application 
some additional handling (linearization). The second would require some 
additional memory but would keep all dirty details inside the driver and 
would leave the rest of the code intact.

Pls., comment.

thanks,
vlad

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] i40e xmit path HW limitation
  2015-07-30 14:57 [dpdk-dev] RFC: i40e xmit path HW limitation Vlad Zolotarov
@ 2015-07-30 16:10 ` Zhang, Helin
  2015-07-30 16:44   ` Vlad Zolotarov
  2015-07-30 16:17 ` [dpdk-dev] RFC: " Stephen Hemminger
  1 sibling, 1 reply; 13+ messages in thread
From: Zhang, Helin @ 2015-07-30 16:10 UTC (permalink / raw)
  To: Vlad Zolotarov, Ananyev, Konstantin; +Cc: dev



> -----Original Message-----
> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> Sent: Thursday, July 30, 2015 7:58 AM
> To: dev@dpdk.org; Ananyev, Konstantin; Zhang, Helin
> Subject: RFC: i40e xmit path HW limitation
> 
> Hi, Konstantin, Helin,
> there is a documented limitation of xl710 controllers (i40e driver) which is not
> handled in any way by a DPDK driver.
>  From the datasheet chapter 8.4.1:
> 
> "• A single transmit packet may span up to 8 buffers (up to 8 data descriptors per
> packet including both the header and payload buffers).
> • The total number of data descriptors for the whole TSO (explained later on in
> this chapter) is unlimited as long as each segment within the TSO obeys the
> previous rule (up to 8 data descriptors per segment for both the TSO header and
> the segment payload buffers)."
Yes, I remember the RX side just supports 5 segments per packet receiving.
But what's the possible issue you thought about?

> 
> This means that, for instance, long cluster with small fragments has to be
> linearized before it may be placed on the HW ring.
What type of size of the small fragments? Basically 2KB is the default size of mbuf of most
example applications. 2KB x 8 is bigger than 1.5KB. So it is enough for the maximum
packet size we supported.
If 1KB mbuf is used, don't expect it can transmit more than 8KB size of packet.

> In more standard environments like Linux or FreeBSD drivers the solution is
> straight forward - call skb_linearize()/m_collapse() corresponding.
> In the non-conformist environment like DPDK life is not that easy - there is no
> easy way to collapse the cluster into a linear buffer from inside the device driver
> since device driver doesn't allocate memory in a fast path and utilizes the user
> allocated pools only.

> 
> Here are two proposals for a solution:
> 
>  1. We may provide a callback that would return a user TRUE if a give
>     cluster has to be linearized and it should always be called before
>     rte_eth_tx_burst(). Alternatively it may be called from inside the
>     rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
>     error code for a case when one of the clusters it's given has to be
>     linearized.
>  2. Another option is to allocate a mempool in the driver with the
>     elements consuming a single page each (standard 2KB buffers would
>     do). Number of elements in the pool should be as Tx ring length
>     multiplied by "64KB/(linear data length of the buffer in the pool
>     above)". Here I use 64KB as a maximum packet length and not taking
>     into an account esoteric things like "Giant" TSO mentioned in the
>     spec above. Then we may actually go and linearize the cluster if
>     needed on top of the buffers from the pool above, post the buffer
>     from the mempool above on the HW ring, link the original cluster to
>     that new cluster (using the private data) and release it when the
>     send is done.
> 
> 
> The first is a change in the API and would require from the application some
> additional handling (linearization). The second would require some additional
> memory but would keep all dirty details inside the driver and would leave the
> rest of the code intact.
> 
> Pls., comment.
> 
> thanks,
> vlad
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] RFC: i40e xmit path HW limitation
  2015-07-30 14:57 [dpdk-dev] RFC: i40e xmit path HW limitation Vlad Zolotarov
  2015-07-30 16:10 ` [dpdk-dev] " Zhang, Helin
@ 2015-07-30 16:17 ` Stephen Hemminger
  2015-07-30 16:20   ` Avi Kivity
  1 sibling, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2015-07-30 16:17 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Thu, 30 Jul 2015 17:57:33 +0300
Vlad Zolotarov <vladz@cloudius-systems.com> wrote:

> Hi, Konstantin, Helin,
> there is a documented limitation of xl710 controllers (i40e driver) 
> which is not handled in any way by a DPDK driver.
>  From the datasheet chapter 8.4.1:
> 
> "• A single transmit packet may span up to 8 buffers (up to 8 data descriptors per packet including
> both the header and payload buffers).
> • The total number of data descriptors for the whole TSO (explained later on in this chapter) is
> unlimited as long as each segment within the TSO obeys the previous rule (up to 8 data descriptors
> per segment for both the TSO header and the segment payload buffers)."
> 
> This means that, for instance, long cluster with small fragments has to 
> be linearized before it may be placed on the HW ring.
> In more standard environments like Linux or FreeBSD drivers the solution 
> is straight forward - call skb_linearize()/m_collapse() corresponding.
> In the non-conformist environment like DPDK life is not that easy - 
> there is no easy way to collapse the cluster into a linear buffer from 
> inside the device driver
> since device driver doesn't allocate memory in a fast path and utilizes 
> the user allocated pools only.
> 
> Here are two proposals for a solution:
> 
>  1. We may provide a callback that would return a user TRUE if a give
>     cluster has to be linearized and it should always be called before
>     rte_eth_tx_burst(). Alternatively it may be called from inside the
>     rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
>     error code for a case when one of the clusters it's given has to be
>     linearized.
>  2. Another option is to allocate a mempool in the driver with the
>     elements consuming a single page each (standard 2KB buffers would
>     do). Number of elements in the pool should be as Tx ring length
>     multiplied by "64KB/(linear data length of the buffer in the pool
>     above)". Here I use 64KB as a maximum packet length and not taking
>     into an account esoteric things like "Giant" TSO mentioned in the
>     spec above. Then we may actually go and linearize the cluster if
>     needed on top of the buffers from the pool above, post the buffer
>     from the mempool above on the HW ring, link the original cluster to
>     that new cluster (using the private data) and release it when the
>     send is done.

Or just silently drop heavily scattered packets (and increment oerrors)
with a PMD_TX_LOG debug message.

I think a DPDK driver doesn't have to accept all possible mbufs and do
extra work. It seems reasonable to expect caller to be well behaved
in this restricted ecosystem.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] RFC: i40e xmit path HW limitation
  2015-07-30 16:17 ` [dpdk-dev] RFC: " Stephen Hemminger
@ 2015-07-30 16:20   ` Avi Kivity
  2015-07-30 16:50     ` Vlad Zolotarov
  0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2015-07-30 16:20 UTC (permalink / raw)
  To: Stephen Hemminger, Vlad Zolotarov; +Cc: dev



On 07/30/2015 07:17 PM, Stephen Hemminger wrote:
> On Thu, 30 Jul 2015 17:57:33 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>
>> Hi, Konstantin, Helin,
>> there is a documented limitation of xl710 controllers (i40e driver)
>> which is not handled in any way by a DPDK driver.
>>   From the datasheet chapter 8.4.1:
>>
>> "• A single transmit packet may span up to 8 buffers (up to 8 data descriptors per packet including
>> both the header and payload buffers).
>> • The total number of data descriptors for the whole TSO (explained later on in this chapter) is
>> unlimited as long as each segment within the TSO obeys the previous rule (up to 8 data descriptors
>> per segment for both the TSO header and the segment payload buffers)."
>>
>> This means that, for instance, long cluster with small fragments has to
>> be linearized before it may be placed on the HW ring.
>> In more standard environments like Linux or FreeBSD drivers the solution
>> is straight forward - call skb_linearize()/m_collapse() corresponding.
>> In the non-conformist environment like DPDK life is not that easy -
>> there is no easy way to collapse the cluster into a linear buffer from
>> inside the device driver
>> since device driver doesn't allocate memory in a fast path and utilizes
>> the user allocated pools only.
>>
>> Here are two proposals for a solution:
>>
>>   1. We may provide a callback that would return a user TRUE if a give
>>      cluster has to be linearized and it should always be called before
>>      rte_eth_tx_burst(). Alternatively it may be called from inside the
>>      rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
>>      error code for a case when one of the clusters it's given has to be
>>      linearized.
>>   2. Another option is to allocate a mempool in the driver with the
>>      elements consuming a single page each (standard 2KB buffers would
>>      do). Number of elements in the pool should be as Tx ring length
>>      multiplied by "64KB/(linear data length of the buffer in the pool
>>      above)". Here I use 64KB as a maximum packet length and not taking
>>      into an account esoteric things like "Giant" TSO mentioned in the
>>      spec above. Then we may actually go and linearize the cluster if
>>      needed on top of the buffers from the pool above, post the buffer
>>      from the mempool above on the HW ring, link the original cluster to
>>      that new cluster (using the private data) and release it when the
>>      send is done.
> Or just silently drop heavily scattered packets (and increment oerrors)
> with a PMD_TX_LOG debug message.
>
> I think a DPDK driver doesn't have to accept all possible mbufs and do
> extra work. It seems reasonable to expect caller to be well behaved
> in this restricted ecosystem.
>

How can the caller know what's well behaved?  It's device dependent.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] i40e xmit path HW limitation
  2015-07-30 16:10 ` [dpdk-dev] " Zhang, Helin
@ 2015-07-30 16:44   ` Vlad Zolotarov
  2015-07-30 17:33     ` Zhang, Helin
  0 siblings, 1 reply; 13+ messages in thread
From: Vlad Zolotarov @ 2015-07-30 16:44 UTC (permalink / raw)
  To: Zhang, Helin, Ananyev, Konstantin; +Cc: dev



On 07/30/15 19:10, Zhang, Helin wrote:
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>> Sent: Thursday, July 30, 2015 7:58 AM
>> To: dev@dpdk.org; Ananyev, Konstantin; Zhang, Helin
>> Subject: RFC: i40e xmit path HW limitation
>>
>> Hi, Konstantin, Helin,
>> there is a documented limitation of xl710 controllers (i40e driver) which is not
>> handled in any way by a DPDK driver.
>>   From the datasheet chapter 8.4.1:
>>
>> "• A single transmit packet may span up to 8 buffers (up to 8 data descriptors per
>> packet including both the header and payload buffers).
>> • The total number of data descriptors for the whole TSO (explained later on in
>> this chapter) is unlimited as long as each segment within the TSO obeys the
>> previous rule (up to 8 data descriptors per segment for both the TSO header and
>> the segment payload buffers)."
> Yes, I remember the RX side just supports 5 segments per packet receiving.
> But what's the possible issue you thought about?
Note that it's a Tx size we are talking about.

See 30520831f058cd9d75c0f6b360bc5c5ae49b5f27 commit in linux net-next repo.
If such a cluster arrives and you post it on the HW ring - HW will shut 
this HW ring down permanently. The application will see that it's ring 
is stuck.

>
>> This means that, for instance, long cluster with small fragments has to be
>> linearized before it may be placed on the HW ring.
> What type of size of the small fragments? Basically 2KB is the default size of mbuf of most
> example applications. 2KB x 8 is bigger than 1.5KB. So it is enough for the maximum
> packet size we supported.
> If 1KB mbuf is used, don't expect it can transmit more than 8KB size of packet.

I kinda lost u here. Again, we talk about the Tx side here and buffers 
are not obligatory completely filled. Namely there may be a cluster with 
15 fragments 100 bytes each.

>
>> In more standard environments like Linux or FreeBSD drivers the solution is
>> straight forward - call skb_linearize()/m_collapse() corresponding.
>> In the non-conformist environment like DPDK life is not that easy - there is no
>> easy way to collapse the cluster into a linear buffer from inside the device driver
>> since device driver doesn't allocate memory in a fast path and utilizes the user
>> allocated pools only.
>> Here are two proposals for a solution:
>>
>>   1. We may provide a callback that would return a user TRUE if a give
>>      cluster has to be linearized and it should always be called before
>>      rte_eth_tx_burst(). Alternatively it may be called from inside the
>>      rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
>>      error code for a case when one of the clusters it's given has to be
>>      linearized.
>>   2. Another option is to allocate a mempool in the driver with the
>>      elements consuming a single page each (standard 2KB buffers would
>>      do). Number of elements in the pool should be as Tx ring length
>>      multiplied by "64KB/(linear data length of the buffer in the pool
>>      above)". Here I use 64KB as a maximum packet length and not taking
>>      into an account esoteric things like "Giant" TSO mentioned in the
>>      spec above. Then we may actually go and linearize the cluster if
>>      needed on top of the buffers from the pool above, post the buffer
>>      from the mempool above on the HW ring, link the original cluster to
>>      that new cluster (using the private data) and release it when the
>>      send is done.
>>
>>
>> The first is a change in the API and would require from the application some
>> additional handling (linearization). The second would require some additional
>> memory but would keep all dirty details inside the driver and would leave the
>> rest of the code intact.
>>
>> Pls., comment.
>>
>> thanks,
>> vlad
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] RFC: i40e xmit path HW limitation
  2015-07-30 16:20   ` Avi Kivity
@ 2015-07-30 16:50     ` Vlad Zolotarov
  2015-07-30 17:01       ` Stephen Hemminger
  0 siblings, 1 reply; 13+ messages in thread
From: Vlad Zolotarov @ 2015-07-30 16:50 UTC (permalink / raw)
  To: Avi Kivity, Stephen Hemminger; +Cc: dev



On 07/30/15 19:20, Avi Kivity wrote:
>
>
> On 07/30/2015 07:17 PM, Stephen Hemminger wrote:
>> On Thu, 30 Jul 2015 17:57:33 +0300
>> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>>
>>> Hi, Konstantin, Helin,
>>> there is a documented limitation of xl710 controllers (i40e driver)
>>> which is not handled in any way by a DPDK driver.
>>>   From the datasheet chapter 8.4.1:
>>>
>>> "• A single transmit packet may span up to 8 buffers (up to 8 data 
>>> descriptors per packet including
>>> both the header and payload buffers).
>>> • The total number of data descriptors for the whole TSO (explained 
>>> later on in this chapter) is
>>> unlimited as long as each segment within the TSO obeys the previous 
>>> rule (up to 8 data descriptors
>>> per segment for both the TSO header and the segment payload buffers)."
>>>
>>> This means that, for instance, long cluster with small fragments has to
>>> be linearized before it may be placed on the HW ring.
>>> In more standard environments like Linux or FreeBSD drivers the 
>>> solution
>>> is straight forward - call skb_linearize()/m_collapse() corresponding.
>>> In the non-conformist environment like DPDK life is not that easy -
>>> there is no easy way to collapse the cluster into a linear buffer from
>>> inside the device driver
>>> since device driver doesn't allocate memory in a fast path and utilizes
>>> the user allocated pools only.
>>>
>>> Here are two proposals for a solution:
>>>
>>>   1. We may provide a callback that would return a user TRUE if a give
>>>      cluster has to be linearized and it should always be called before
>>>      rte_eth_tx_burst(). Alternatively it may be called from inside the
>>>      rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return 
>>> some
>>>      error code for a case when one of the clusters it's given has 
>>> to be
>>>      linearized.
>>>   2. Another option is to allocate a mempool in the driver with the
>>>      elements consuming a single page each (standard 2KB buffers would
>>>      do). Number of elements in the pool should be as Tx ring length
>>>      multiplied by "64KB/(linear data length of the buffer in the pool
>>>      above)". Here I use 64KB as a maximum packet length and not taking
>>>      into an account esoteric things like "Giant" TSO mentioned in the
>>>      spec above. Then we may actually go and linearize the cluster if
>>>      needed on top of the buffers from the pool above, post the buffer
>>>      from the mempool above on the HW ring, link the original 
>>> cluster to
>>>      that new cluster (using the private data) and release it when the
>>>      send is done.
>> Or just silently drop heavily scattered packets (and increment oerrors)
>> with a PMD_TX_LOG debug message.
>>
>> I think a DPDK driver doesn't have to accept all possible mbufs and do
>> extra work. It seems reasonable to expect caller to be well behaved
>> in this restricted ecosystem.
>>
>
> How can the caller know what's well behaved?  It's device dependent.

+1

Stephen, how do you imagine this well-behaved application? Having switch 
case by an underlying device type and then "well-behaving" correspondingly?
Not to mention that to "well-behave" the application writer has to read 
HW specs and understand them, which would limit the amount of DPDK 
developers to a very small amount of people... ;) Not to mention that 
the mentioned above switch-case would be a super ugly thing to be found 
in an application that would raise a big question about the 
justification of a DPDK existence as as SDK providing device drivers 
interface. ;)

>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] RFC: i40e xmit path HW limitation
  2015-07-30 16:50     ` Vlad Zolotarov
@ 2015-07-30 17:01       ` Stephen Hemminger
  2015-07-30 17:14         ` Vlad Zolotarov
  2015-07-30 17:22         ` Avi Kivity
  0 siblings, 2 replies; 13+ messages in thread
From: Stephen Hemminger @ 2015-07-30 17:01 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Thu, 30 Jul 2015 19:50:27 +0300
Vlad Zolotarov <vladz@cloudius-systems.com> wrote:

> 
> 
> On 07/30/15 19:20, Avi Kivity wrote:
> >
> >
> > On 07/30/2015 07:17 PM, Stephen Hemminger wrote:
> >> On Thu, 30 Jul 2015 17:57:33 +0300
> >> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> >>
> >>> Hi, Konstantin, Helin,
> >>> there is a documented limitation of xl710 controllers (i40e driver)
> >>> which is not handled in any way by a DPDK driver.
> >>>   From the datasheet chapter 8.4.1:
> >>>
> >>> "• A single transmit packet may span up to 8 buffers (up to 8 data 
> >>> descriptors per packet including
> >>> both the header and payload buffers).
> >>> • The total number of data descriptors for the whole TSO (explained 
> >>> later on in this chapter) is
> >>> unlimited as long as each segment within the TSO obeys the previous 
> >>> rule (up to 8 data descriptors
> >>> per segment for both the TSO header and the segment payload buffers)."
> >>>
> >>> This means that, for instance, long cluster with small fragments has to
> >>> be linearized before it may be placed on the HW ring.
> >>> In more standard environments like Linux or FreeBSD drivers the 
> >>> solution
> >>> is straight forward - call skb_linearize()/m_collapse() corresponding.
> >>> In the non-conformist environment like DPDK life is not that easy -
> >>> there is no easy way to collapse the cluster into a linear buffer from
> >>> inside the device driver
> >>> since device driver doesn't allocate memory in a fast path and utilizes
> >>> the user allocated pools only.
> >>>
> >>> Here are two proposals for a solution:
> >>>
> >>>   1. We may provide a callback that would return a user TRUE if a give
> >>>      cluster has to be linearized and it should always be called before
> >>>      rte_eth_tx_burst(). Alternatively it may be called from inside the
> >>>      rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return 
> >>> some
> >>>      error code for a case when one of the clusters it's given has 
> >>> to be
> >>>      linearized.
> >>>   2. Another option is to allocate a mempool in the driver with the
> >>>      elements consuming a single page each (standard 2KB buffers would
> >>>      do). Number of elements in the pool should be as Tx ring length
> >>>      multiplied by "64KB/(linear data length of the buffer in the pool
> >>>      above)". Here I use 64KB as a maximum packet length and not taking
> >>>      into an account esoteric things like "Giant" TSO mentioned in the
> >>>      spec above. Then we may actually go and linearize the cluster if
> >>>      needed on top of the buffers from the pool above, post the buffer
> >>>      from the mempool above on the HW ring, link the original 
> >>> cluster to
> >>>      that new cluster (using the private data) and release it when the
> >>>      send is done.
> >> Or just silently drop heavily scattered packets (and increment oerrors)
> >> with a PMD_TX_LOG debug message.
> >>
> >> I think a DPDK driver doesn't have to accept all possible mbufs and do
> >> extra work. It seems reasonable to expect caller to be well behaved
> >> in this restricted ecosystem.
> >>
> >
> > How can the caller know what's well behaved?  It's device dependent.
> 
> +1
> 
> Stephen, how do you imagine this well-behaved application? Having switch 
> case by an underlying device type and then "well-behaving" correspondingly?
> Not to mention that to "well-behave" the application writer has to read 
> HW specs and understand them, which would limit the amount of DPDK 
> developers to a very small amount of people... ;) Not to mention that 
> the mentioned above switch-case would be a super ugly thing to be found 
> in an application that would raise a big question about the 
> justification of a DPDK existence as as SDK providing device drivers 
> interface. ;)

Either have a RTE_MAX_MBUF_SEGMENTS that is global or
a mbuf_linearize function?  Driver already can stash the
mbuf pool used for Rx and reuse it for the transient Tx buffers.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] RFC: i40e xmit path HW limitation
  2015-07-30 17:01       ` Stephen Hemminger
@ 2015-07-30 17:14         ` Vlad Zolotarov
  2015-07-30 17:22         ` Avi Kivity
  1 sibling, 0 replies; 13+ messages in thread
From: Vlad Zolotarov @ 2015-07-30 17:14 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev



On 07/30/15 20:01, Stephen Hemminger wrote:
> On Thu, 30 Jul 2015 19:50:27 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>
>>
>> On 07/30/15 19:20, Avi Kivity wrote:
>>>
>>> On 07/30/2015 07:17 PM, Stephen Hemminger wrote:
>>>> On Thu, 30 Jul 2015 17:57:33 +0300
>>>> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>>>>
>>>>> Hi, Konstantin, Helin,
>>>>> there is a documented limitation of xl710 controllers (i40e driver)
>>>>> which is not handled in any way by a DPDK driver.
>>>>>    From the datasheet chapter 8.4.1:
>>>>>
>>>>> "• A single transmit packet may span up to 8 buffers (up to 8 data
>>>>> descriptors per packet including
>>>>> both the header and payload buffers).
>>>>> • The total number of data descriptors for the whole TSO (explained
>>>>> later on in this chapter) is
>>>>> unlimited as long as each segment within the TSO obeys the previous
>>>>> rule (up to 8 data descriptors
>>>>> per segment for both the TSO header and the segment payload buffers)."
>>>>>
>>>>> This means that, for instance, long cluster with small fragments has to
>>>>> be linearized before it may be placed on the HW ring.
>>>>> In more standard environments like Linux or FreeBSD drivers the
>>>>> solution
>>>>> is straight forward - call skb_linearize()/m_collapse() corresponding.
>>>>> In the non-conformist environment like DPDK life is not that easy -
>>>>> there is no easy way to collapse the cluster into a linear buffer from
>>>>> inside the device driver
>>>>> since device driver doesn't allocate memory in a fast path and utilizes
>>>>> the user allocated pools only.
>>>>>
>>>>> Here are two proposals for a solution:
>>>>>
>>>>>    1. We may provide a callback that would return a user TRUE if a give
>>>>>       cluster has to be linearized and it should always be called before
>>>>>       rte_eth_tx_burst(). Alternatively it may be called from inside the
>>>>>       rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return
>>>>> some
>>>>>       error code for a case when one of the clusters it's given has
>>>>> to be
>>>>>       linearized.
>>>>>    2. Another option is to allocate a mempool in the driver with the
>>>>>       elements consuming a single page each (standard 2KB buffers would
>>>>>       do). Number of elements in the pool should be as Tx ring length
>>>>>       multiplied by "64KB/(linear data length of the buffer in the pool
>>>>>       above)". Here I use 64KB as a maximum packet length and not taking
>>>>>       into an account esoteric things like "Giant" TSO mentioned in the
>>>>>       spec above. Then we may actually go and linearize the cluster if
>>>>>       needed on top of the buffers from the pool above, post the buffer
>>>>>       from the mempool above on the HW ring, link the original
>>>>> cluster to
>>>>>       that new cluster (using the private data) and release it when the
>>>>>       send is done.
>>>> Or just silently drop heavily scattered packets (and increment oerrors)
>>>> with a PMD_TX_LOG debug message.
>>>>
>>>> I think a DPDK driver doesn't have to accept all possible mbufs and do
>>>> extra work. It seems reasonable to expect caller to be well behaved
>>>> in this restricted ecosystem.
>>>>
>>> How can the caller know what's well behaved?  It's device dependent.
>> +1
>>
>> Stephen, how do you imagine this well-behaved application? Having switch
>> case by an underlying device type and then "well-behaving" correspondingly?
>> Not to mention that to "well-behave" the application writer has to read
>> HW specs and understand them, which would limit the amount of DPDK
>> developers to a very small amount of people... ;) Not to mention that
>> the mentioned above switch-case would be a super ugly thing to be found
>> in an application that would raise a big question about the
>> justification of a DPDK existence as as SDK providing device drivers
>> interface. ;)
> Either have a RTE_MAX_MBUF_SEGMENTS

And what would it be in our care? 8? This would limit the maximum TSO 
packet to 16KB for 2KB buffers.

> that is global or
> a mbuf_linearize function?  Driver already can stash the
> mbuf pool used for Rx and reuse it for the transient Tx buffers.
First of all who can guaranty that that pool would meet our needs - 
namely have large enough buffers?
Secondly, using user's Rx mempool for that would be really not nice 
(read - dirty) towards the user that may had allocated the specific 
amount of buffers in it according to some calculations that didn't 
include the usage from the Tx flow.

And lastly and most importantly, this would require using the atomic 
operations during access to Rx mempool, that would both require a 
specific mempool initialization and would significantly hit the 
performance.


>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] RFC: i40e xmit path HW limitation
  2015-07-30 17:01       ` Stephen Hemminger
  2015-07-30 17:14         ` Vlad Zolotarov
@ 2015-07-30 17:22         ` Avi Kivity
  1 sibling, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2015-07-30 17:22 UTC (permalink / raw)
  To: Stephen Hemminger, Vlad Zolotarov; +Cc: dev

On 07/30/2015 08:01 PM, Stephen Hemminger wrote:
> On Thu, 30 Jul 2015 19:50:27 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>
>>
>> On 07/30/15 19:20, Avi Kivity wrote:
>>>
>>> On 07/30/2015 07:17 PM, Stephen Hemminger wrote:
>>>> On Thu, 30 Jul 2015 17:57:33 +0300
>>>> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>>>>
>>>>> Hi, Konstantin, Helin,
>>>>> there is a documented limitation of xl710 controllers (i40e driver)
>>>>> which is not handled in any way by a DPDK driver.
>>>>>    From the datasheet chapter 8.4.1:
>>>>>
>>>>> "• A single transmit packet may span up to 8 buffers (up to 8 data
>>>>> descriptors per packet including
>>>>> both the header and payload buffers).
>>>>> • The total number of data descriptors for the whole TSO (explained
>>>>> later on in this chapter) is
>>>>> unlimited as long as each segment within the TSO obeys the previous
>>>>> rule (up to 8 data descriptors
>>>>> per segment for both the TSO header and the segment payload buffers)."
>>>>>
>>>>> This means that, for instance, long cluster with small fragments has to
>>>>> be linearized before it may be placed on the HW ring.
>>>>> In more standard environments like Linux or FreeBSD drivers the
>>>>> solution
>>>>> is straight forward - call skb_linearize()/m_collapse() corresponding.
>>>>> In the non-conformist environment like DPDK life is not that easy -
>>>>> there is no easy way to collapse the cluster into a linear buffer from
>>>>> inside the device driver
>>>>> since device driver doesn't allocate memory in a fast path and utilizes
>>>>> the user allocated pools only.
>>>>>
>>>>> Here are two proposals for a solution:
>>>>>
>>>>>    1. We may provide a callback that would return a user TRUE if a give
>>>>>       cluster has to be linearized and it should always be called before
>>>>>       rte_eth_tx_burst(). Alternatively it may be called from inside the
>>>>>       rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return
>>>>> some
>>>>>       error code for a case when one of the clusters it's given has
>>>>> to be
>>>>>       linearized.
>>>>>    2. Another option is to allocate a mempool in the driver with the
>>>>>       elements consuming a single page each (standard 2KB buffers would
>>>>>       do). Number of elements in the pool should be as Tx ring length
>>>>>       multiplied by "64KB/(linear data length of the buffer in the pool
>>>>>       above)". Here I use 64KB as a maximum packet length and not taking
>>>>>       into an account esoteric things like "Giant" TSO mentioned in the
>>>>>       spec above. Then we may actually go and linearize the cluster if
>>>>>       needed on top of the buffers from the pool above, post the buffer
>>>>>       from the mempool above on the HW ring, link the original
>>>>> cluster to
>>>>>       that new cluster (using the private data) and release it when the
>>>>>       send is done.
>>>> Or just silently drop heavily scattered packets (and increment oerrors)
>>>> with a PMD_TX_LOG debug message.
>>>>
>>>> I think a DPDK driver doesn't have to accept all possible mbufs and do
>>>> extra work. It seems reasonable to expect caller to be well behaved
>>>> in this restricted ecosystem.
>>>>
>>> How can the caller know what's well behaved?  It's device dependent.
>> +1
>>
>> Stephen, how do you imagine this well-behaved application? Having switch
>> case by an underlying device type and then "well-behaving" correspondingly?
>> Not to mention that to "well-behave" the application writer has to read
>> HW specs and understand them, which would limit the amount of DPDK
>> developers to a very small amount of people... ;) Not to mention that
>> the mentioned above switch-case would be a super ugly thing to be found
>> in an application that would raise a big question about the
>> justification of a DPDK existence as as SDK providing device drivers
>> interface. ;)
> Either have a RTE_MAX_MBUF_SEGMENTS that is global or
> a mbuf_linearize function?  Driver already can stash the
> mbuf pool used for Rx and reuse it for the transient Tx buffers.
>

The pass/fail criteria is much more complicated than that.  You might 
have a packet with 340 fragments successfully transmitted (64k/1500*8) 
or a packet with 9 fragments fail.

What's wrong with exposing the pass/fail criteria as a driver-supplied 
function?  If the application is sure that its mbufs pass, it can choose 
not to call it.  A less constrained application will call it, and 
linearize the packet itself if it fails the test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] i40e xmit path HW limitation
  2015-07-30 16:44   ` Vlad Zolotarov
@ 2015-07-30 17:33     ` Zhang, Helin
  2015-07-30 17:56       ` Vlad Zolotarov
  0 siblings, 1 reply; 13+ messages in thread
From: Zhang, Helin @ 2015-07-30 17:33 UTC (permalink / raw)
  To: Vlad Zolotarov, Ananyev, Konstantin; +Cc: dev



> -----Original Message-----
> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> Sent: Thursday, July 30, 2015 9:44 AM
> To: Zhang, Helin; Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: i40e xmit path HW limitation
> 
> 
> 
> On 07/30/15 19:10, Zhang, Helin wrote:
> >
> >> -----Original Message-----
> >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> >> Sent: Thursday, July 30, 2015 7:58 AM
> >> To: dev@dpdk.org; Ananyev, Konstantin; Zhang, Helin
> >> Subject: RFC: i40e xmit path HW limitation
> >>
> >> Hi, Konstantin, Helin,
> >> there is a documented limitation of xl710 controllers (i40e driver)
> >> which is not handled in any way by a DPDK driver.
> >>   From the datasheet chapter 8.4.1:
> >>
> >> "• A single transmit packet may span up to 8 buffers (up to 8 data
> >> descriptors per packet including both the header and payload buffers).
> >> • The total number of data descriptors for the whole TSO (explained
> >> later on in this chapter) is unlimited as long as each segment within
> >> the TSO obeys the previous rule (up to 8 data descriptors per segment
> >> for both the TSO header and the segment payload buffers)."
> > Yes, I remember the RX side just supports 5 segments per packet receiving.
> > But what's the possible issue you thought about?
> Note that it's a Tx size we are talking about.
> 
> See 30520831f058cd9d75c0f6b360bc5c5ae49b5f27 commit in linux net-next repo.
> If such a cluster arrives and you post it on the HW ring - HW will shut this HW ring
> down permanently. The application will see that it's ring is stuck.
That issue was because of using more than 8 descriptors for a packet for TSO.

> 
> >
> >> This means that, for instance, long cluster with small fragments has to be
> >> linearized before it may be placed on the HW ring.
> > What type of size of the small fragments? Basically 2KB is the default size of
> mbuf of most
> > example applications. 2KB x 8 is bigger than 1.5KB. So it is enough for the
> maximum
> > packet size we supported.
> > If 1KB mbuf is used, don't expect it can transmit more than 8KB size of packet.
> 
> I kinda lost u here. Again, we talk about the Tx side here and buffers
> are not obligatory completely filled. Namely there may be a cluster with
> 15 fragments 100 bytes each.
The root cause is using more than 8 descriptors for a packet. Linux driver can help
on reducing number of descriptors to be used by merging small size of payload
together, right?
It is not for TSO, it is just for packet transmitting. 2 options in my mind:
1. Use should ensure it will not use more than 8 descriptors per packet for transmitting.
2. DPDK driver should try to merge small packet together for such case, like Linux kernel driver.
I prefer to use option 1, users should ensure that in the application or up layer software,
and keep the PMD driver as simple as possible.

But I have a thought that the maximum number of RX/TX descriptor should be able to be
queried somewhere.

Regards,
Helin
> 
> >
> >> In more standard environments like Linux or FreeBSD drivers the solution is
> >> straight forward - call skb_linearize()/m_collapse() corresponding.
> >> In the non-conformist environment like DPDK life is not that easy - there is no
> >> easy way to collapse the cluster into a linear buffer from inside the device
> driver
> >> since device driver doesn't allocate memory in a fast path and utilizes the user
> >> allocated pools only.
> >> Here are two proposals for a solution:
> >>
> >>   1. We may provide a callback that would return a user TRUE if a give
> >>      cluster has to be linearized and it should always be called before
> >>      rte_eth_tx_burst(). Alternatively it may be called from inside the
> >>      rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
> >>      error code for a case when one of the clusters it's given has to be
> >>      linearized.
> >>   2. Another option is to allocate a mempool in the driver with the
> >>      elements consuming a single page each (standard 2KB buffers would
> >>      do). Number of elements in the pool should be as Tx ring length
> >>      multiplied by "64KB/(linear data length of the buffer in the pool
> >>      above)". Here I use 64KB as a maximum packet length and not taking
> >>      into an account esoteric things like "Giant" TSO mentioned in the
> >>      spec above. Then we may actually go and linearize the cluster if
> >>      needed on top of the buffers from the pool above, post the buffer
> >>      from the mempool above on the HW ring, link the original cluster to
> >>      that new cluster (using the private data) and release it when the
> >>      send is done.
> >>
> >>
> >> The first is a change in the API and would require from the application some
> >> additional handling (linearization). The second would require some additional
> >> memory but would keep all dirty details inside the driver and would leave the
> >> rest of the code intact.
> >>
> >> Pls., comment.
> >>
> >> thanks,
> >> vlad
> >>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] i40e xmit path HW limitation
  2015-07-30 17:33     ` Zhang, Helin
@ 2015-07-30 17:56       ` Vlad Zolotarov
  2015-07-30 19:00         ` Zhang, Helin
  0 siblings, 1 reply; 13+ messages in thread
From: Vlad Zolotarov @ 2015-07-30 17:56 UTC (permalink / raw)
  To: Zhang, Helin, Ananyev, Konstantin; +Cc: dev



On 07/30/15 20:33, Zhang, Helin wrote:
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>> Sent: Thursday, July 30, 2015 9:44 AM
>> To: Zhang, Helin; Ananyev, Konstantin
>> Cc: dev@dpdk.org
>> Subject: Re: i40e xmit path HW limitation
>>
>>
>>
>> On 07/30/15 19:10, Zhang, Helin wrote:
>>>> -----Original Message-----
>>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>>>> Sent: Thursday, July 30, 2015 7:58 AM
>>>> To: dev@dpdk.org; Ananyev, Konstantin; Zhang, Helin
>>>> Subject: RFC: i40e xmit path HW limitation
>>>>
>>>> Hi, Konstantin, Helin,
>>>> there is a documented limitation of xl710 controllers (i40e driver)
>>>> which is not handled in any way by a DPDK driver.
>>>>    From the datasheet chapter 8.4.1:
>>>>
>>>> "• A single transmit packet may span up to 8 buffers (up to 8 data
>>>> descriptors per packet including both the header and payload buffers).
>>>> • The total number of data descriptors for the whole TSO (explained
>>>> later on in this chapter) is unlimited as long as each segment within
>>>> the TSO obeys the previous rule (up to 8 data descriptors per segment
>>>> for both the TSO header and the segment payload buffers)."
>>> Yes, I remember the RX side just supports 5 segments per packet receiving.
>>> But what's the possible issue you thought about?
>> Note that it's a Tx size we are talking about.
>>
>> See 30520831f058cd9d75c0f6b360bc5c5ae49b5f27 commit in linux net-next repo.
>> If such a cluster arrives and you post it on the HW ring - HW will shut this HW ring
>> down permanently. The application will see that it's ring is stuck.
> That issue was because of using more than 8 descriptors for a packet for TSO.

There is no problem in transmitting the TSO packet with more than 8 
fragments.
On the opposite - one can't transmit a non-TSO packet with more than 8 
fragments.
One also can't transmit the TSO packet that would contain more than 8 
fragments in a single TSO segment including the TSO headers.

Pls., read the HW spec as I quoted above for more details.

>
>>>> This means that, for instance, long cluster with small fragments has to be
>>>> linearized before it may be placed on the HW ring.
>>> What type of size of the small fragments? Basically 2KB is the default size of
>> mbuf of most
>>> example applications. 2KB x 8 is bigger than 1.5KB. So it is enough for the
>> maximum
>>> packet size we supported.
>>> If 1KB mbuf is used, don't expect it can transmit more than 8KB size of packet.
>> I kinda lost u here. Again, we talk about the Tx side here and buffers
>> are not obligatory completely filled. Namely there may be a cluster with
>> 15 fragments 100 bytes each.
> The root cause is using more than 8 descriptors for a packet.

That would be if u would like to SUPER simplify the HW limitation above. 
In that case u would significantly limit the different packets that may 
be sent without the linearization.

> Linux driver can help
> on reducing number of descriptors to be used by merging small size of payload
> together, right?
> It is not for TSO, it is just for packet transmitting. 2 options in my mind:
> 1. Use should ensure it will not use more than 8 descriptors per packet for transmitting.

This requirement is too restricting. Pls., see above.

> 2. DPDK driver should try to merge small packet together for such case, like Linux kernel driver.
> I prefer to use option 1, users should ensure that in the application or up layer software,
> and keep the PMD driver as simple as possible.

The above statement is super confusing: on the one hand u suggest the 
DPDK driver to merge the small packet (fragments?) together (how?) and 
then u immediately propose the user application to do that. Could u, 
pls., clarify what exactly u suggest here?
If that's to leave it to the application - note that it would demand 
patching all existing DPDK applications that send TCP packets.

>
> But I have a thought that the maximum number of RX/TX descriptor should be able to be
> queried somewhere.

There is no such thing as maximum number of Tx fragments in a TSO case. 
It's only limited by the Tx ring size.

>
> Regards,
> Helin
>>>> In more standard environments like Linux or FreeBSD drivers the solution is
>>>> straight forward - call skb_linearize()/m_collapse() corresponding.
>>>> In the non-conformist environment like DPDK life is not that easy - there is no
>>>> easy way to collapse the cluster into a linear buffer from inside the device
>> driver
>>>> since device driver doesn't allocate memory in a fast path and utilizes the user
>>>> allocated pools only.
>>>> Here are two proposals for a solution:
>>>>
>>>>    1. We may provide a callback that would return a user TRUE if a give
>>>>       cluster has to be linearized and it should always be called before
>>>>       rte_eth_tx_burst(). Alternatively it may be called from inside the
>>>>       rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return some
>>>>       error code for a case when one of the clusters it's given has to be
>>>>       linearized.
>>>>    2. Another option is to allocate a mempool in the driver with the
>>>>       elements consuming a single page each (standard 2KB buffers would
>>>>       do). Number of elements in the pool should be as Tx ring length
>>>>       multiplied by "64KB/(linear data length of the buffer in the pool
>>>>       above)". Here I use 64KB as a maximum packet length and not taking
>>>>       into an account esoteric things like "Giant" TSO mentioned in the
>>>>       spec above. Then we may actually go and linearize the cluster if
>>>>       needed on top of the buffers from the pool above, post the buffer
>>>>       from the mempool above on the HW ring, link the original cluster to
>>>>       that new cluster (using the private data) and release it when the
>>>>       send is done.
>>>>
>>>>
>>>> The first is a change in the API and would require from the application some
>>>> additional handling (linearization). The second would require some additional
>>>> memory but would keep all dirty details inside the driver and would leave the
>>>> rest of the code intact.
>>>>
>>>> Pls., comment.
>>>>
>>>> thanks,
>>>> vlad
>>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] i40e xmit path HW limitation
  2015-07-30 17:56       ` Vlad Zolotarov
@ 2015-07-30 19:00         ` Zhang, Helin
  2015-07-30 19:25           ` Vladislav Zolotarov
  0 siblings, 1 reply; 13+ messages in thread
From: Zhang, Helin @ 2015-07-30 19:00 UTC (permalink / raw)
  To: Vlad Zolotarov, Ananyev, Konstantin; +Cc: dev



> -----Original Message-----
> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> Sent: Thursday, July 30, 2015 10:56 AM
> To: Zhang, Helin; Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: i40e xmit path HW limitation
> 
> 
> 
> On 07/30/15 20:33, Zhang, Helin wrote:
> >
> >> -----Original Message-----
> >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> >> Sent: Thursday, July 30, 2015 9:44 AM
> >> To: Zhang, Helin; Ananyev, Konstantin
> >> Cc: dev@dpdk.org
> >> Subject: Re: i40e xmit path HW limitation
> >>
> >>
> >>
> >> On 07/30/15 19:10, Zhang, Helin wrote:
> >>>> -----Original Message-----
> >>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> >>>> Sent: Thursday, July 30, 2015 7:58 AM
> >>>> To: dev@dpdk.org; Ananyev, Konstantin; Zhang, Helin
> >>>> Subject: RFC: i40e xmit path HW limitation
> >>>>
> >>>> Hi, Konstantin, Helin,
> >>>> there is a documented limitation of xl710 controllers (i40e driver)
> >>>> which is not handled in any way by a DPDK driver.
> >>>>    From the datasheet chapter 8.4.1:
> >>>>
> >>>> "• A single transmit packet may span up to 8 buffers (up to 8 data
> >>>> descriptors per packet including both the header and payload buffers).
> >>>> • The total number of data descriptors for the whole TSO (explained
> >>>> later on in this chapter) is unlimited as long as each segment
> >>>> within the TSO obeys the previous rule (up to 8 data descriptors
> >>>> per segment for both the TSO header and the segment payload buffers)."
> >>> Yes, I remember the RX side just supports 5 segments per packet receiving.
> >>> But what's the possible issue you thought about?
> >> Note that it's a Tx size we are talking about.
> >>
> >> See 30520831f058cd9d75c0f6b360bc5c5ae49b5f27 commit in linux net-next
> repo.
> >> If such a cluster arrives and you post it on the HW ring - HW will
> >> shut this HW ring down permanently. The application will see that it's ring is
> stuck.
> > That issue was because of using more than 8 descriptors for a packet for TSO.
> 
> There is no problem in transmitting the TSO packet with more than 8 fragments.
> On the opposite - one can't transmit a non-TSO packet with more than 8
> fragments.
> One also can't transmit the TSO packet that would contain more than 8 fragments
> in a single TSO segment including the TSO headers.
> 
> Pls., read the HW spec as I quoted above for more details.
I meant a packet to be transmitted by the hardware, but not the TSO packet in memory.
It could be a segment in TSO packet in memory.
The linearize check in kernel driver is not for TSO only, it is for both TSO and
NON-TSO cases.

> 
> >
> >>>> This means that, for instance, long cluster with small fragments
> >>>> has to be linearized before it may be placed on the HW ring.
> >>> What type of size of the small fragments? Basically 2KB is the
> >>> default size of
> >> mbuf of most
> >>> example applications. 2KB x 8 is bigger than 1.5KB. So it is enough
> >>> for the
> >> maximum
> >>> packet size we supported.
> >>> If 1KB mbuf is used, don't expect it can transmit more than 8KB size of
> packet.
> >> I kinda lost u here. Again, we talk about the Tx side here and
> >> buffers are not obligatory completely filled. Namely there may be a
> >> cluster with
> >> 15 fragments 100 bytes each.
> > The root cause is using more than 8 descriptors for a packet.
> 
> That would be if u would like to SUPER simplify the HW limitation above.
> In that case u would significantly limit the different packets that may be sent
> without the linearization.
> 
> > Linux driver can help
> > on reducing number of descriptors to be used by merging small size of
> > payload together, right?
> > It is not for TSO, it is just for packet transmitting. 2 options in my mind:
> > 1. Use should ensure it will not use more than 8 descriptors per packet for
> transmitting.
> 
> This requirement is too restricting. Pls., see above.
> 
> > 2. DPDK driver should try to merge small packet together for such case, like
> Linux kernel driver.
> > I prefer to use option 1, users should ensure that in the application
> > or up layer software, and keep the PMD driver as simple as possible.
> 
> The above statement is super confusing: on the one hand u suggest the DPDK
> driver to merge the small packet (fragments?) together (how?) and then u
> immediately propose the user application to do that. Could u, pls., clarify what
> exactly u suggest here?
> If that's to leave it to the application - note that it would demand patching all
> existing DPDK applications that send TCP packets.
Those are two of obvious options. One is to do that in PMD, the other one is to do
that in up layer. I did not mean it needs to do both!


> 
> >
> > But I have a thought that the maximum number of RX/TX descriptor
> > should be able to be queried somewhere.
> 
> There is no such thing as maximum number of Tx fragments in a TSO case.
> It's only limited by the Tx ring size.
Again, it is not for TSO case only. You are talking about how to implement it?
Anything missed can be added, as long as it is reasonable.

Regards,
Helin

> 
> >
> > Regards,
> > Helin
> >>>> In more standard environments like Linux or FreeBSD drivers the
> >>>> solution is straight forward - call skb_linearize()/m_collapse()
> corresponding.
> >>>> In the non-conformist environment like DPDK life is not that easy -
> >>>> there is no easy way to collapse the cluster into a linear buffer
> >>>> from inside the device
> >> driver
> >>>> since device driver doesn't allocate memory in a fast path and
> >>>> utilizes the user allocated pools only.
> >>>> Here are two proposals for a solution:
> >>>>
> >>>>    1. We may provide a callback that would return a user TRUE if a give
> >>>>       cluster has to be linearized and it should always be called before
> >>>>       rte_eth_tx_burst(). Alternatively it may be called from inside the
> >>>>       rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return
> some
> >>>>       error code for a case when one of the clusters it's given has to be
> >>>>       linearized.
> >>>>    2. Another option is to allocate a mempool in the driver with the
> >>>>       elements consuming a single page each (standard 2KB buffers
> would
> >>>>       do). Number of elements in the pool should be as Tx ring length
> >>>>       multiplied by "64KB/(linear data length of the buffer in the pool
> >>>>       above)". Here I use 64KB as a maximum packet length and not
> taking
> >>>>       into an account esoteric things like "Giant" TSO mentioned in the
> >>>>       spec above. Then we may actually go and linearize the cluster if
> >>>>       needed on top of the buffers from the pool above, post the buffer
> >>>>       from the mempool above on the HW ring, link the original cluster to
> >>>>       that new cluster (using the private data) and release it when the
> >>>>       send is done.
> >>>>
> >>>>
> >>>> The first is a change in the API and would require from the
> >>>> application some additional handling (linearization). The second
> >>>> would require some additional memory but would keep all dirty
> >>>> details inside the driver and would leave the rest of the code intact.
> >>>>
> >>>> Pls., comment.
> >>>>
> >>>> thanks,
> >>>> vlad
> >>>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] i40e xmit path HW limitation
  2015-07-30 19:00         ` Zhang, Helin
@ 2015-07-30 19:25           ` Vladislav Zolotarov
  0 siblings, 0 replies; 13+ messages in thread
From: Vladislav Zolotarov @ 2015-07-30 19:25 UTC (permalink / raw)
  To: Helin Zhang; +Cc: dev

On Jul 30, 2015 22:00, "Zhang, Helin" <helin.zhang@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> > Sent: Thursday, July 30, 2015 10:56 AM
> > To: Zhang, Helin; Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: i40e xmit path HW limitation
> >
> >
> >
> > On 07/30/15 20:33, Zhang, Helin wrote:
> > >
> > >> -----Original Message-----
> > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> > >> Sent: Thursday, July 30, 2015 9:44 AM
> > >> To: Zhang, Helin; Ananyev, Konstantin
> > >> Cc: dev@dpdk.org
> > >> Subject: Re: i40e xmit path HW limitation
> > >>
> > >>
> > >>
> > >> On 07/30/15 19:10, Zhang, Helin wrote:
> > >>>> -----Original Message-----
> > >>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> > >>>> Sent: Thursday, July 30, 2015 7:58 AM
> > >>>> To: dev@dpdk.org; Ananyev, Konstantin; Zhang, Helin
> > >>>> Subject: RFC: i40e xmit path HW limitation
> > >>>>
> > >>>> Hi, Konstantin, Helin,
> > >>>> there is a documented limitation of xl710 controllers (i40e driver)
> > >>>> which is not handled in any way by a DPDK driver.
> > >>>>    From the datasheet chapter 8.4.1:
> > >>>>
> > >>>> "• A single transmit packet may span up to 8 buffers (up to 8 data
> > >>>> descriptors per packet including both the header and payload
buffers).
> > >>>> • The total number of data descriptors for the whole TSO (explained
> > >>>> later on in this chapter) is unlimited as long as each segment
> > >>>> within the TSO obeys the previous rule (up to 8 data descriptors
> > >>>> per segment for both the TSO header and the segment payload
buffers)."
> > >>> Yes, I remember the RX side just supports 5 segments per packet
receiving.
> > >>> But what's the possible issue you thought about?
> > >> Note that it's a Tx size we are talking about.
> > >>
> > >> See 30520831f058cd9d75c0f6b360bc5c5ae49b5f27 commit in linux net-next
> > repo.
> > >> If such a cluster arrives and you post it on the HW ring - HW will
> > >> shut this HW ring down permanently. The application will see that
it's ring is
> > stuck.
> > > That issue was because of using more than 8 descriptors for a packet
for TSO.
> >
> > There is no problem in transmitting the TSO packet with more than 8
fragments.
> > On the opposite - one can't transmit a non-TSO packet with more than 8
> > fragments.
> > One also can't transmit the TSO packet that would contain more than 8
fragments
> > in a single TSO segment including the TSO headers.
> >
> > Pls., read the HW spec as I quoted above for more details.
> I meant a packet to be transmitted by the hardware, but not the TSO
packet in memory.
> It could be a segment in TSO packet in memory.
> The linearize check in kernel driver is not for TSO only, it is for both
TSO and
> NON-TSO cases.

That's what i was trying to tell u. Great we are on the same page at
last... 😉

>
> >
> > >
> > >>>> This means that, for instance, long cluster with small fragments
> > >>>> has to be linearized before it may be placed on the HW ring.
> > >>> What type of size of the small fragments? Basically 2KB is the
> > >>> default size of
> > >> mbuf of most
> > >>> example applications. 2KB x 8 is bigger than 1.5KB. So it is enough
> > >>> for the
> > >> maximum
> > >>> packet size we supported.
> > >>> If 1KB mbuf is used, don't expect it can transmit more than 8KB
size of
> > packet.
> > >> I kinda lost u here. Again, we talk about the Tx side here and
> > >> buffers are not obligatory completely filled. Namely there may be a
> > >> cluster with
> > >> 15 fragments 100 bytes each.
> > > The root cause is using more than 8 descriptors for a packet.
> >
> > That would be if u would like to SUPER simplify the HW limitation above.
> > In that case u would significantly limit the different packets that may
be sent
> > without the linearization.
> >
> > > Linux driver can help
> > > on reducing number of descriptors to be used by merging small size of
> > > payload together, right?
> > > It is not for TSO, it is just for packet transmitting. 2 options in
my mind:
> > > 1. Use should ensure it will not use more than 8 descriptors per
packet for
> > transmitting.
> >
> > This requirement is too restricting. Pls., see above.
> >
> > > 2. DPDK driver should try to merge small packet together for such
case, like
> > Linux kernel driver.
> > > I prefer to use option 1, users should ensure that in the application
> > > or up layer software, and keep the PMD driver as simple as possible.
> >
> > The above statement is super confusing: on the one hand u suggest the
DPDK
> > driver to merge the small packet (fragments?) together (how?) and then u
> > immediately propose the user application to do that. Could u, pls.,
clarify what
> > exactly u suggest here?
> > If that's to leave it to the application - note that it would demand
patching all
> > existing DPDK applications that send TCP packets.
> Those are two of obvious options. One is to do that in PMD, the other one
is to do
> that in up layer. I did not mean it needs to do both!

Ok. I just didn't understand where the (2) description ends. Now i get u...
😉
>
>
> >
> > >
> > > But I have a thought that the maximum number of RX/TX descriptor
> > > should be able to be queried somewhere.
> >
> > There is no such thing as maximum number of Tx fragments in a TSO case.
> > It's only limited by the Tx ring size.
> Again, it is not for TSO case only. You are talking about how to
implement it?

I understand that and what I was trying to tell was that any limit we
choose that satisfies the non-TSO case would be too restricting  for a TSO
case. Therefore I'd suggest to go to the second option and implement the
merging in the driver. Not only it would be the cleanest and robust way but
it would also prevent the tremendous code duplication across all
applications susceptible to this HW limitation.

> Anything missed can be added, as long as it is reasonable.

>
> Regards,
> Helin
>
> >
> > >
> > > Regards,
> > > Helin
> > >>>> In more standard environments like Linux or FreeBSD drivers the
> > >>>> solution is straight forward - call skb_linearize()/m_collapse()
> > corresponding.
> > >>>> In the non-conformist environment like DPDK life is not that easy -
> > >>>> there is no easy way to collapse the cluster into a linear buffer
> > >>>> from inside the device
> > >> driver
> > >>>> since device driver doesn't allocate memory in a fast path and
> > >>>> utilizes the user allocated pools only.
> > >>>> Here are two proposals for a solution:
> > >>>>
> > >>>>    1. We may provide a callback that would return a user TRUE if a
give
> > >>>>       cluster has to be linearized and it should always be called
before
> > >>>>       rte_eth_tx_burst(). Alternatively it may be called from
inside the
> > >>>>       rte_eth_tx_burst() and rte_eth_tx_burst() is changed to
return
> > some
> > >>>>       error code for a case when one of the clusters it's given
has to be
> > >>>>       linearized.
> > >>>>    2. Another option is to allocate a mempool in the driver with
the
> > >>>>       elements consuming a single page each (standard 2KB buffers
> > would
> > >>>>       do). Number of elements in the pool should be as Tx ring
length
> > >>>>       multiplied by "64KB/(linear data length of the buffer in the
pool
> > >>>>       above)". Here I use 64KB as a maximum packet length and not
> > taking
> > >>>>       into an account esoteric things like "Giant" TSO mentioned
in the
> > >>>>       spec above. Then we may actually go and linearize the
cluster if
> > >>>>       needed on top of the buffers from the pool above, post the
buffer
> > >>>>       from the mempool above on the HW ring, link the original
cluster to
> > >>>>       that new cluster (using the private data) and release it
when the
> > >>>>       send is done.
> > >>>>
> > >>>>
> > >>>> The first is a change in the API and would require from the
> > >>>> application some additional handling (linearization). The second
> > >>>> would require some additional memory but would keep all dirty
> > >>>> details inside the driver and would leave the rest of the code
intact.
> > >>>>
> > >>>> Pls., comment.
> > >>>>
> > >>>> thanks,
> > >>>> vlad
> > >>>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-07-30 19:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-30 14:57 [dpdk-dev] RFC: i40e xmit path HW limitation Vlad Zolotarov
2015-07-30 16:10 ` [dpdk-dev] " Zhang, Helin
2015-07-30 16:44   ` Vlad Zolotarov
2015-07-30 17:33     ` Zhang, Helin
2015-07-30 17:56       ` Vlad Zolotarov
2015-07-30 19:00         ` Zhang, Helin
2015-07-30 19:25           ` Vladislav Zolotarov
2015-07-30 16:17 ` [dpdk-dev] RFC: " Stephen Hemminger
2015-07-30 16:20   ` Avi Kivity
2015-07-30 16:50     ` Vlad Zolotarov
2015-07-30 17:01       ` Stephen Hemminger
2015-07-30 17:14         ` Vlad Zolotarov
2015-07-30 17:22         ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).