DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx flow?
@ 2015-02-25  9:40 Vlad Zolotarov
  2015-02-25 11:02 ` Bruce Richardson
  0 siblings, 1 reply; 3+ messages in thread
From: Vlad Zolotarov @ 2015-02-25  9:40 UTC (permalink / raw)
  To: dev

Hi, I have a question about the "scattered Rx" feature: why enabling it 
disabled "bulk allocation" feature?
There is some unclear comment in the ixgbe_recv_scattered_pkts():

		/*
		 * Descriptor done.
		 *
		 * Allocate a new mbuf to replenish the RX ring descriptor.
		 * If the allocation fails:
		 *    - arrange for that RX descriptor to be the first one
		 *      being parsed the next time the receive function is
		 *      invoked [on the same queue].
		 *
		 *    - Stop parsing the RX ring and return immediately.
		 *
		 * This policy does not drop the packet received in the RX
		 * descriptor for which the allocation of a new mbuf failed.
		 * Thus, it allows that packet to be later retrieved if
		 * mbuf have been freed in the mean time.
		 * As a side effect, holding RX descriptors instead of
		 * systematically giving them back to the NIC may lead to
		 * RX ring exhaustion situations.
		 * However, the NIC can gracefully prevent such situations
		 * to happen by sending specific "back-pressure" flow control
		 * frames to its peer(s).
		 */

Why the same "policy" can't be done in the bulk-context allocation? - 
Don't advance the RDT until u've refilled the ring. What do I miss here?

Another question is about the LRO feature - is there a reason why it's 
not implemented? I've implemented the LRO support in ixgbe PMD to begin 
with - I used a "scattered Rx" as a template and now I'm tuning it 
(things like the stuff above).

Is there any philosophical reason why it hasn't been implemented in 
*any* PMD so far? ;)

thanks,
vlad

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx flow?
  2015-02-25  9:40 [dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx flow? Vlad Zolotarov
@ 2015-02-25 11:02 ` Bruce Richardson
  2015-02-25 16:46   ` Vlad Zolotarov
  0 siblings, 1 reply; 3+ messages in thread
From: Bruce Richardson @ 2015-02-25 11:02 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Wed, Feb 25, 2015 at 11:40:36AM +0200, Vlad Zolotarov wrote:
> Hi, I have a question about the "scattered Rx" feature: why enabling it
> disabled "bulk allocation" feature?

The "bulk-allocation" feature is one where a more optimized RX code path is
used. For the sake of performance, when doing that code path, certain assumptions
were made, one of which was that packets would fit inside a single mbuf. Not
having this assumption makes the receiving of packets much more complicated and
therefore slower. [For similar reasons, the optimized TX routines e.g. vector
TX, are only used if it is guaranteed that no hardware offload features are
going to be used].

Now, it is possible, though challenging, to write optimized code for these more
complicated cases, such as scattered RX, or TX with offloads or scattered packets.
In general, we will always want separate routines for the simple case and the
complicated cases, as the performance hit of checking for the offloads, or
multi-mbuf packets will be significant enough to hit our performance badly when
they are not needed. In the case of the vector PMD for ixgbe - our highest
performance path right now - we have indeed two receive routines, for simple
and scattered cases. For TX, we only have an optimized path for the simple case,
but that is not to say that at some point someone may provide one for the
offload case too.

A final note on scattered packets in particular: if packets are too big to fit
in a single mbuf, then they are not small packets, and the processing time per
packet available is, by definition, larger than for packets that fit in a 
single mbuf. For 64-byte packets, the packet arrival rate is 67ns @ 10G, or
approx 200 cycles at 3GHz. If we assume a standard 2k mbuf, then a packet which
spans two mbufs takes at least 1654ns, and therefore a 3GHz CPU has nearly 5000
cycles to process that same packet. Therefore, since the processing budget is
so much bigger the need to optimize is much less. Therefore it's more important
to focus on the small packet case, which is what we have done.

> There is some unclear comment in the ixgbe_recv_scattered_pkts():
> 
> 		/*
> 		 * Descriptor done.
> 		 *
> 		 * Allocate a new mbuf to replenish the RX ring descriptor.
> 		 * If the allocation fails:
> 		 *    - arrange for that RX descriptor to be the first one
> 		 *      being parsed the next time the receive function is
> 		 *      invoked [on the same queue].
> 		 *
> 		 *    - Stop parsing the RX ring and return immediately.
> 		 *
> 		 * This policy does not drop the packet received in the RX
> 		 * descriptor for which the allocation of a new mbuf failed.
> 		 * Thus, it allows that packet to be later retrieved if
> 		 * mbuf have been freed in the mean time.
> 		 * As a side effect, holding RX descriptors instead of
> 		 * systematically giving them back to the NIC may lead to
> 		 * RX ring exhaustion situations.
> 		 * However, the NIC can gracefully prevent such situations
> 		 * to happen by sending specific "back-pressure" flow control
> 		 * frames to its peer(s).
> 		 */
> 
> Why the same "policy" can't be done in the bulk-context allocation? - Don't
> advance the RDT until u've refilled the ring. What do I miss here?

A lot of the optimizations done in other code paths, such as bulk alloc, may well
be applicable here, it's just that the work has not been done yet, as the focus
is elsewhere. For vector PMD RX, we have now routines that work on both regular
and scattered packets, and both perform much better than the scalar equivalents.
Also to note that in every RX (and TX) routine, the NIC tail pointer update is
always done just once at the end of the function. 

> 
> Another question is about the LRO feature - is there a reason why it's not
> implemented? I've implemented the LRO support in ixgbe PMD to begin with - I
> used a "scattered Rx" as a template and now I'm tuning it (things like the
> stuff above).
> 
> Is there any philosophical reason why it hasn't been implemented in *any*
> PMD so far? ;)

I'm not aware of any philosophical reasons why it hasn't been done. Patches
are welcome, as always. :-)

/Bruce

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx flow?
  2015-02-25 11:02 ` Bruce Richardson
@ 2015-02-25 16:46   ` Vlad Zolotarov
  0 siblings, 0 replies; 3+ messages in thread
From: Vlad Zolotarov @ 2015-02-25 16:46 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev


On 02/25/15 13:02, Bruce Richardson wrote:
> On Wed, Feb 25, 2015 at 11:40:36AM +0200, Vlad Zolotarov wrote:
>> Hi, I have a question about the "scattered Rx" feature: why enabling it
>> disabled "bulk allocation" feature?
> The "bulk-allocation" feature is one where a more optimized RX code path is
> used. For the sake of performance, when doing that code path, certain assumptions
> were made, one of which was that packets would fit inside a single mbuf. Not
> having this assumption makes the receiving of packets much more complicated and
> therefore slower. [For similar reasons, the optimized TX routines e.g. vector
> TX, are only used if it is guaranteed that no hardware offload features are
> going to be used].
>
> Now, it is possible, though challenging, to write optimized code for these more
> complicated cases, such as scattered RX, or TX with offloads or scattered packets.
> In general, we will always want separate routines for the simple case and the
> complicated cases, as the performance hit of checking for the offloads, or
> multi-mbuf packets will be significant enough to hit our performance badly when
> they are not needed. In the case of the vector PMD for ixgbe - our highest
> performance path right now - we have indeed two receive routines, for simple
> and scattered cases. For TX, we only have an optimized path for the simple case,
> but that is not to say that at some point someone may provide one for the
> offload case too.
>
> A final note on scattered packets in particular: if packets are too big to fit
> in a single mbuf, then they are not small packets, and the processing time per
> packet available is, by definition, larger than for packets that fit in a
> single mbuf. For 64-byte packets, the packet arrival rate is 67ns @ 10G, or
> approx 200 cycles at 3GHz. If we assume a standard 2k mbuf, then a packet which
> spans two mbufs takes at least 1654ns, and therefore a 3GHz CPU has nearly 5000
> cycles to process that same packet. Therefore, since the processing budget is
> so much bigger the need to optimize is much less. Therefore it's more important
> to focus on the small packet case, which is what we have done.

Sure. I'm doing my best not to harm the existing code paths: the RSC 
handler is a separate function (i first patched the scalar scattered 
function but now I'm rewriting it as a stand alone routine), I don't 
change the igb_rx_entry (leave it to be a pointer) and keep the 
additional info in separate descriptors in a separate ring that is not 
accessed in a non-RSC flow.

>
>> There is some unclear comment in the ixgbe_recv_scattered_pkts():
>>
>> 		/*
>> 		 * Descriptor done.
>> 		 *
>> 		 * Allocate a new mbuf to replenish the RX ring descriptor.
>> 		 * If the allocation fails:
>> 		 *    - arrange for that RX descriptor to be the first one
>> 		 *      being parsed the next time the receive function is
>> 		 *      invoked [on the same queue].
>> 		 *
>> 		 *    - Stop parsing the RX ring and return immediately.
>> 		 *
>> 		 * This policy does not drop the packet received in the RX
>> 		 * descriptor for which the allocation of a new mbuf failed.
>> 		 * Thus, it allows that packet to be later retrieved if
>> 		 * mbuf have been freed in the mean time.
>> 		 * As a side effect, holding RX descriptors instead of
>> 		 * systematically giving them back to the NIC may lead to
>> 		 * RX ring exhaustion situations.
>> 		 * However, the NIC can gracefully prevent such situations
>> 		 * to happen by sending specific "back-pressure" flow control
>> 		 * frames to its peer(s).
>> 		 */
>>
>> Why the same "policy" can't be done in the bulk-context allocation? - Don't
>> advance the RDT until u've refilled the ring. What do I miss here?
> A lot of the optimizations done in other code paths, such as bulk alloc, may well
> be applicable here, it's just that the work has not been done yet, as the focus
> is elsewhere. For vector PMD RX, we have now routines that work on both regular
> and scattered packets, and both perform much better than the scalar equivalents.
> Also to note that in every RX (and TX) routine, the NIC tail pointer update is
> always done just once at the end of the function.

I see. Thanks for an educated clarification. Although I've spent some 
time with DPDK I still feel sometimes that I don't I fully understand 
the original author's idea and the clarifications like your really help.
I looked at the vectored receive function (_recv_raw_pkts_vec()) and it 
is one cryptic piece of a code! ;) Since u've brought it up - could u 
direct me to the measurements comparing the vectored  and scalar DPDK 
data paths please? I wonder how working without CSUM offload for 
instance may be faster even for small packets like u mentioned above? 
One will have to calculate it in SW in that case and I'm puzzled how 
this may be faster than letting HW do it...

>
>> Another question is about the LRO feature - is there a reason why it's not
>> implemented? I've implemented the LRO support in ixgbe PMD to begin with - I
>> used a "scattered Rx" as a template and now I'm tuning it (things like the
>> stuff above).
>>
>> Is there any philosophical reason why it hasn't been implemented in *any*
>> PMD so far? ;)
> I'm not aware of any philosophical reasons why it hasn't been done. Patches
> are welcome, as always. :-)

Great! So, i'll send what I have once it's ready... ;)

Again, thank for a great clarification.

>
> /Bruce
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-02-25 16:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-25  9:40 [dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx flow? Vlad Zolotarov
2015-02-25 11:02 ` Bruce Richardson
2015-02-25 16:46   ` Vlad Zolotarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).