[dpdk-dev] rte_mbuf.next in 2nd cacheline

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] rte_mbuf.next in 2nd cacheline
@ 2015-06-10 21:47 Damjan Marion (damarion)
  2015-06-15 13:20 ` Olivier MATZ
  0 siblings, 1 reply; 26+ messages in thread
From: Damjan Marion (damarion) @ 2015-06-10 21:47 UTC (permalink / raw)
  To: dev

Hi,

We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.

Currently, it falls under /* second cache line - fields only used in slow path or on TX */
but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).

Is there anything we can do here (stop using next field, or move it to 1st cache line)?

Thanks,

Damjan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-10 21:47 [dpdk-dev] rte_mbuf.next in 2nd cacheline Damjan Marion (damarion)
@ 2015-06-15 13:20 ` Olivier MATZ
  2015-06-15 13:44   ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Olivier MATZ @ 2015-06-15 13:20 UTC (permalink / raw)
  To: Damjan Marion (damarion), dev

Hi Damjan,

On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> 
> Hi,
> 
> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> 
> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> 
> Is there anything we can do here (stop using next field, or move it to 1st cache line)?

Agree, this is also something I noticed, see:
http://dpdk.org/ml/archives/dev/2015-February/014400.html

I did not have the time to do performance testing, but it's something
I'd like to do as soon as I can. I don't see any obvious reason not to
do it.

It seems we currently just have enough room to do it (8 bytes are
remaining in the first cache line when compiled in 64 bits).


Regards,
Olivier


> 
> Thanks,
> 
> Damjan
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 13:20 ` Olivier MATZ
@ 2015-06-15 13:44   ` Bruce Richardson
  2015-06-15 13:54     ` Ananyev, Konstantin
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2015-06-15 13:44 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> Hi Damjan,
> 
> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> > 
> > Hi,
> > 
> > We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> > 
> > Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> > but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> > 
> > Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> 
> Agree, this is also something I noticed, see:
> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> 
> I did not have the time to do performance testing, but it's something
> I'd like to do as soon as I can. I don't see any obvious reason not to
> do it.
> 
> It seems we currently just have enough room to do it (8 bytes are
> remaining in the first cache line when compiled in 64 bits).

This, to me, is the obvious reason not to do it! It prevents us from taking in
any other offload fields in the RX fast-path into the mbuf.

That being said, I can see why we might want to look to move it - but from the
work done in the ixgbe driver, I'd be hopeful we can get as much performance with
it on the second cache line for most cases, through judicious use of prefetching,
or otherwise.

It took a lot of work and investigation to get free space in the mbuf - especially
in cache line 0, and I'd like to avoid just filling the cache line up again as
long as we possibly can!

/Bruce

> 
> 
> Regards,
> Olivier
> 
> 
> > 
> > Thanks,
> > 
> > Damjan
> > 
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 13:44   ` Bruce Richardson
@ 2015-06-15 13:54     ` Ananyev, Konstantin
  2015-06-15 14:05       ` Olivier MATZ
  0 siblings, 1 reply; 26+ messages in thread
From: Ananyev, Konstantin @ 2015-06-15 13:54 UTC (permalink / raw)
  To: Richardson, Bruce, Olivier MATZ; +Cc: dev, Damjan Marion (damarion)



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Monday, June 15, 2015 2:44 PM
> To: Olivier MATZ
> Cc: dev@dpdk.org; Damjan Marion (damarion)
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> > Hi Damjan,
> >
> > On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> > >
> > > Hi,
> > >
> > > We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> > >
> > > Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> > > but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> > >
> > > Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> >
> > Agree, this is also something I noticed, see:
> > http://dpdk.org/ml/archives/dev/2015-February/014400.html
> >
> > I did not have the time to do performance testing, but it's something
> > I'd like to do as soon as I can. I don't see any obvious reason not to
> > do it.
> >
> > It seems we currently just have enough room to do it (8 bytes are
> > remaining in the first cache line when compiled in 64 bits).
> 
> This, to me, is the obvious reason not to do it! It prevents us from taking in
> any other offload fields in the RX fast-path into the mbuf.
> 
> That being said, I can see why we might want to look to move it - but from the
> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> it on the second cache line for most cases, through judicious use of prefetching,
> or otherwise.
> 
> It took a lot of work and investigation to get free space in the mbuf - especially
> in cache line 0, and I'd like to avoid just filling the cache line up again as
> long as we possibly can!

Yep, agree with Bruce here.
Plus, with packet_type going to be 4B and vlan_tci_outer,
we just don't have 8 free bytes at the first cache line any more.
Konstantin

> 
> /Bruce
> 
> >
> >
> > Regards,
> > Olivier
> >
> >
> > >
> > > Thanks,
> > >
> > > Damjan
> > >
> > >
> > >
> > >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 13:54     ` Ananyev, Konstantin
@ 2015-06-15 14:05       ` Olivier MATZ
  2015-06-15 14:12         ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Olivier MATZ @ 2015-06-15 14:05 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: dev, Damjan Marion (damarion)

Hi,

On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> 
> 
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>> Sent: Monday, June 15, 2015 2:44 PM
>> To: Olivier MATZ
>> Cc: dev@dpdk.org; Damjan Marion (damarion)
>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
>>
>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
>>> Hi Damjan,
>>>
>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
>>>>
>>>> Hi,
>>>>
>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
>>>>
>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
>>>>
>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
>>>
>>> Agree, this is also something I noticed, see:
>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
>>>
>>> I did not have the time to do performance testing, but it's something
>>> I'd like to do as soon as I can. I don't see any obvious reason not to
>>> do it.
>>>
>>> It seems we currently just have enough room to do it (8 bytes are
>>> remaining in the first cache line when compiled in 64 bits).
>>
>> This, to me, is the obvious reason not to do it! It prevents us from taking in
>> any other offload fields in the RX fast-path into the mbuf.
>>
>> That being said, I can see why we might want to look to move it - but from the
>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
>> it on the second cache line for most cases, through judicious use of prefetching,
>> or otherwise.
>>
>> It took a lot of work and investigation to get free space in the mbuf - especially
>> in cache line 0, and I'd like to avoid just filling the cache line up again as
>> long as we possibly can!
> 
> Yep, agree with Bruce here.
> Plus, with packet_type going to be 4B and vlan_tci_outer,
> we just don't have 8 free bytes at the first cache line any more.

I don't understand why m->next would not be a better candidate than
rx offload fields to be in the first cache line. For instance, m->next
is mandatory and must be initialized when allocating a mbuf (to be
compared with m->seqn for instance, which is also in the first cache
line). So if we want to do some room in the first cache line, I
think we can.

To me, the only reason for not doing it now is because we don't
have a full performance test report (several use-cases, drivers, ...)
that shows it's better.

Olivier

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 14:05       ` Olivier MATZ
@ 2015-06-15 14:12         ` Bruce Richardson
  2015-06-15 14:30           ` Olivier MATZ
  2015-06-17 13:55           ` Damjan Marion (damarion)
  0 siblings, 2 replies; 26+ messages in thread
From: Bruce Richardson @ 2015-06-15 14:12 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
> Hi,
> 
> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> > 
> > 
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >> Sent: Monday, June 15, 2015 2:44 PM
> >> To: Olivier MATZ
> >> Cc: dev@dpdk.org; Damjan Marion (damarion)
> >> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> >>
> >> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> >>> Hi Damjan,
> >>>
> >>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> >>>>
> >>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> >>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> >>>>
> >>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> >>>
> >>> Agree, this is also something I noticed, see:
> >>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> >>>
> >>> I did not have the time to do performance testing, but it's something
> >>> I'd like to do as soon as I can. I don't see any obvious reason not to
> >>> do it.
> >>>
> >>> It seems we currently just have enough room to do it (8 bytes are
> >>> remaining in the first cache line when compiled in 64 bits).
> >>
> >> This, to me, is the obvious reason not to do it! It prevents us from taking in
> >> any other offload fields in the RX fast-path into the mbuf.
> >>
> >> That being said, I can see why we might want to look to move it - but from the
> >> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> >> it on the second cache line for most cases, through judicious use of prefetching,
> >> or otherwise.
> >>
> >> It took a lot of work and investigation to get free space in the mbuf - especially
> >> in cache line 0, and I'd like to avoid just filling the cache line up again as
> >> long as we possibly can!
> > 
> > Yep, agree with Bruce here.
> > Plus, with packet_type going to be 4B and vlan_tci_outer,
> > we just don't have 8 free bytes at the first cache line any more.
> 
> I don't understand why m->next would not be a better candidate than
> rx offload fields to be in the first cache line. For instance, m->next
> is mandatory and must be initialized when allocating a mbuf (to be
> compared with m->seqn for instance, which is also in the first cache
> line). So if we want to do some room in the first cache line, I
> think we can.
> 
> To me, the only reason for not doing it now is because we don't
> have a full performance test report (several use-cases, drivers, ...)
> that shows it's better.
>
Because the "next" field is not mandatory to be set on initialization. It can
instead be set only when needed, and cleared on free if it is used.

The next pointers always start out as NULL when the mbuf pool is created. The
only time it is set to non-NULL is when we have chained mbufs. If we never have
any chained mbufs, we never need to touch the next field, or even read it - since
we have the num-segments count in the first cache line. If we do have a multi-segment
mbuf, it's likely to be a big packet, so we have more processing time available
and we can then take the hit of setting the next pointer. Whenever we go to
free that mbuf for that packet, the code to do the freeing obviously needs to
read the next pointer so as to free all the buffers in the chain, and so it can
also reset the next pointer to NULL when doing so.

In this way, we can ensure that the next pointer on cache line 1 is not a problem
in our fast path.

/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 14:12         ` Bruce Richardson
@ 2015-06-15 14:30           ` Olivier MATZ
  2015-06-15 14:46             ` Bruce Richardson
  2015-06-15 14:52             ` Ananyev, Konstantin
  2015-06-17 13:55           ` Damjan Marion (damarion)
  1 sibling, 2 replies; 26+ messages in thread
From: Olivier MATZ @ 2015-06-15 14:30 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Damjan Marion (damarion)



On 06/15/2015 04:12 PM, Bruce Richardson wrote:
> On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
>> Hi,
>>
>> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>>>> Sent: Monday, June 15, 2015 2:44 PM
>>>> To: Olivier MATZ
>>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
>>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
>>>>
>>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
>>>>> Hi Damjan,
>>>>>
>>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
>>>>>>
>>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
>>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
>>>>>>
>>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
>>>>>
>>>>> Agree, this is also something I noticed, see:
>>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
>>>>>
>>>>> I did not have the time to do performance testing, but it's something
>>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
>>>>> do it.
>>>>>
>>>>> It seems we currently just have enough room to do it (8 bytes are
>>>>> remaining in the first cache line when compiled in 64 bits).
>>>>
>>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
>>>> any other offload fields in the RX fast-path into the mbuf.
>>>>
>>>> That being said, I can see why we might want to look to move it - but from the
>>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
>>>> it on the second cache line for most cases, through judicious use of prefetching,
>>>> or otherwise.
>>>>
>>>> It took a lot of work and investigation to get free space in the mbuf - especially
>>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
>>>> long as we possibly can!
>>>
>>> Yep, agree with Bruce here.
>>> Plus, with packet_type going to be 4B and vlan_tci_outer,
>>> we just don't have 8 free bytes at the first cache line any more.
>>
>> I don't understand why m->next would not be a better candidate than
>> rx offload fields to be in the first cache line. For instance, m->next
>> is mandatory and must be initialized when allocating a mbuf (to be
>> compared with m->seqn for instance, which is also in the first cache
>> line). So if we want to do some room in the first cache line, I
>> think we can.
>>
>> To me, the only reason for not doing it now is because we don't
>> have a full performance test report (several use-cases, drivers, ...)
>> that shows it's better.
>>
> Because the "next" field is not mandatory to be set on initialization. It can
> instead be set only when needed, and cleared on free if it is used.
> 
> The next pointers always start out as NULL when the mbuf pool is created. The
> only time it is set to non-NULL is when we have chained mbufs. If we never have
> any chained mbufs, we never need to touch the next field, or even read it - since
> we have the num-segments count in the first cache line. If we do have a multi-segment
> mbuf, it's likely to be a big packet, so we have more processing time available
> and we can then take the hit of setting the next pointer. Whenever we go to
> free that mbuf for that packet, the code to do the freeing obviously needs to
> read the next pointer so as to free all the buffers in the chain, and so it can
> also reset the next pointer to NULL when doing so.
> 
> In this way, we can ensure that the next pointer on cache line 1 is not a problem
> in our fast path.

This is a good idea, but looking at the drivers, it seems that today
they all set m->next to NULL in the rx function. What you are suggesting
is to remove all of them, and document somewhere that all mbufs in a
pool are supposed to have their m->next set to NULL, correct?

I think what you are describing could also apply to reference counter
(set to 1 by default), right?


Olivier


> 
> /Bruce
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 14:30           ` Olivier MATZ
@ 2015-06-15 14:46             ` Bruce Richardson
  2015-06-15 14:52             ` Ananyev, Konstantin
  1 sibling, 0 replies; 26+ messages in thread
From: Bruce Richardson @ 2015-06-15 14:46 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 04:30:56PM +0200, Olivier MATZ wrote:
> 
> 
> On 06/15/2015 04:12 PM, Bruce Richardson wrote:
> > On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
> >> Hi,
> >>
> >> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >>>> Sent: Monday, June 15, 2015 2:44 PM
> >>>> To: Olivier MATZ
> >>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
> >>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> >>>>
> >>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> >>>>> Hi Damjan,
> >>>>>
> >>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> >>>>>>
> >>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> >>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> >>>>>>
> >>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> >>>>>
> >>>>> Agree, this is also something I noticed, see:
> >>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> >>>>>
> >>>>> I did not have the time to do performance testing, but it's something
> >>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
> >>>>> do it.
> >>>>>
> >>>>> It seems we currently just have enough room to do it (8 bytes are
> >>>>> remaining in the first cache line when compiled in 64 bits).
> >>>>
> >>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
> >>>> any other offload fields in the RX fast-path into the mbuf.
> >>>>
> >>>> That being said, I can see why we might want to look to move it - but from the
> >>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> >>>> it on the second cache line for most cases, through judicious use of prefetching,
> >>>> or otherwise.
> >>>>
> >>>> It took a lot of work and investigation to get free space in the mbuf - especially
> >>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
> >>>> long as we possibly can!
> >>>
> >>> Yep, agree with Bruce here.
> >>> Plus, with packet_type going to be 4B and vlan_tci_outer,
> >>> we just don't have 8 free bytes at the first cache line any more.
> >>
> >> I don't understand why m->next would not be a better candidate than
> >> rx offload fields to be in the first cache line. For instance, m->next
> >> is mandatory and must be initialized when allocating a mbuf (to be
> >> compared with m->seqn for instance, which is also in the first cache
> >> line). So if we want to do some room in the first cache line, I
> >> think we can.
> >>
> >> To me, the only reason for not doing it now is because we don't
> >> have a full performance test report (several use-cases, drivers, ...)
> >> that shows it's better.
> >>
> > Because the "next" field is not mandatory to be set on initialization. It can
> > instead be set only when needed, and cleared on free if it is used.
> > 
> > The next pointers always start out as NULL when the mbuf pool is created. The
> > only time it is set to non-NULL is when we have chained mbufs. If we never have
> > any chained mbufs, we never need to touch the next field, or even read it - since
> > we have the num-segments count in the first cache line. If we do have a multi-segment
> > mbuf, it's likely to be a big packet, so we have more processing time available
> > and we can then take the hit of setting the next pointer. Whenever we go to
> > free that mbuf for that packet, the code to do the freeing obviously needs to
> > read the next pointer so as to free all the buffers in the chain, and so it can
> > also reset the next pointer to NULL when doing so.
> > 
> > In this way, we can ensure that the next pointer on cache line 1 is not a problem
> > in our fast path.
> 
> This is a good idea, but looking at the drivers, it seems that today
> they all set m->next to NULL in the rx function. What you are suggesting
> is to remove all of them, and document somewhere that all mbufs in a
> pool are supposed to have their m->next set to NULL, correct?

Yes. However, this restriction should not be visible to the applications, as
it's generally necessary to call pktmbuf_free (or a function based off it) to
free an mbuf - whether in user code or a library.

> 
> I think what you are describing could also apply to reference counter
> (set to 1 by default), right?

Yes, except that on free, the refcnt is actually set to zero, as we need to
do so because of the atomic nature of a refcnt e.g. it's 2 and then two
threads decrement it simultaneously. In a case like this, it doesn't matter
whether its set to 1 again on free or on allocation, as it's on cacheline zero.

/Bruce

> 
> 
> Olivier
> 
> 
> > 
> > /Bruce
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 14:30           ` Olivier MATZ
  2015-06-15 14:46             ` Bruce Richardson
@ 2015-06-15 14:52             ` Ananyev, Konstantin
  2015-06-15 15:19               ` Olivier MATZ
  1 sibling, 1 reply; 26+ messages in thread
From: Ananyev, Konstantin @ 2015-06-15 14:52 UTC (permalink / raw)
  To: Olivier MATZ, Richardson, Bruce; +Cc: dev, Damjan Marion (damarion)

Hi Olivier,

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Monday, June 15, 2015 3:31 PM
> To: Richardson, Bruce
> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> 
> 
> On 06/15/2015 04:12 PM, Bruce Richardson wrote:
> > On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
> >> Hi,
> >>
> >> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >>>> Sent: Monday, June 15, 2015 2:44 PM
> >>>> To: Olivier MATZ
> >>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
> >>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> >>>>
> >>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> >>>>> Hi Damjan,
> >>>>>
> >>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> >>>>>>
> >>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> >>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> >>>>>>
> >>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> >>>>>
> >>>>> Agree, this is also something I noticed, see:
> >>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> >>>>>
> >>>>> I did not have the time to do performance testing, but it's something
> >>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
> >>>>> do it.
> >>>>>
> >>>>> It seems we currently just have enough room to do it (8 bytes are
> >>>>> remaining in the first cache line when compiled in 64 bits).
> >>>>
> >>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
> >>>> any other offload fields in the RX fast-path into the mbuf.
> >>>>
> >>>> That being said, I can see why we might want to look to move it - but from the
> >>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> >>>> it on the second cache line for most cases, through judicious use of prefetching,
> >>>> or otherwise.
> >>>>
> >>>> It took a lot of work and investigation to get free space in the mbuf - especially
> >>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
> >>>> long as we possibly can!
> >>>
> >>> Yep, agree with Bruce here.
> >>> Plus, with packet_type going to be 4B and vlan_tci_outer,
> >>> we just don't have 8 free bytes at the first cache line any more.
> >>
> >> I don't understand why m->next would not be a better candidate than
> >> rx offload fields to be in the first cache line. For instance, m->next
> >> is mandatory and must be initialized when allocating a mbuf (to be
> >> compared with m->seqn for instance, which is also in the first cache
> >> line). So if we want to do some room in the first cache line, I
> >> think we can.
> >>
> >> To me, the only reason for not doing it now is because we don't
> >> have a full performance test report (several use-cases, drivers, ...)
> >> that shows it's better.
> >>
> > Because the "next" field is not mandatory to be set on initialization. It can
> > instead be set only when needed, and cleared on free if it is used.
> >
> > The next pointers always start out as NULL when the mbuf pool is created. The
> > only time it is set to non-NULL is when we have chained mbufs. If we never have
> > any chained mbufs, we never need to touch the next field, or even read it - since
> > we have the num-segments count in the first cache line. If we do have a multi-segment
> > mbuf, it's likely to be a big packet, so we have more processing time available
> > and we can then take the hit of setting the next pointer. Whenever we go to
> > free that mbuf for that packet, the code to do the freeing obviously needs to
> > read the next pointer so as to free all the buffers in the chain, and so it can
> > also reset the next pointer to NULL when doing so.
> >
> > In this way, we can ensure that the next pointer on cache line 1 is not a problem
> > in our fast path.
> 
> This is a good idea, but looking at the drivers, it seems that today
> they all set m->next to NULL in the rx function. What you are suggesting
> is to remove all of them, and document somewhere that all mbufs in a
> pool are supposed to have their m->next set to NULL, correct?
> 
> I think what you are describing could also apply to reference counter
> (set to 1 by default), right?

We probably can reset next to NULL at __rte_pktmbuf_prefree_seg(),
at the same time we do reset refcnt to 0.
Is that what you suggesting?  

Konstantin

> 
> 
> Olivier
> 
> 
> >
> > /Bruce
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 14:52             ` Ananyev, Konstantin
@ 2015-06-15 15:19               ` Olivier MATZ
  2015-06-15 15:23                 ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Olivier MATZ @ 2015-06-15 15:19 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: dev, Damjan Marion (damarion)



On 06/15/2015 04:52 PM, Ananyev, Konstantin wrote:
> Hi Olivier,
> 
>> -----Original Message-----
>> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
>> Sent: Monday, June 15, 2015 3:31 PM
>> To: Richardson, Bruce
>> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
>>
>>
>>
>> On 06/15/2015 04:12 PM, Bruce Richardson wrote:
>>> On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
>>>> Hi,
>>>>
>>>> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>>>>>> Sent: Monday, June 15, 2015 2:44 PM
>>>>>> To: Olivier MATZ
>>>>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
>>>>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
>>>>>>
>>>>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
>>>>>>> Hi Damjan,
>>>>>>>
>>>>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
>>>>>>>>
>>>>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
>>>>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
>>>>>>>>
>>>>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
>>>>>>>
>>>>>>> Agree, this is also something I noticed, see:
>>>>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
>>>>>>>
>>>>>>> I did not have the time to do performance testing, but it's something
>>>>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
>>>>>>> do it.
>>>>>>>
>>>>>>> It seems we currently just have enough room to do it (8 bytes are
>>>>>>> remaining in the first cache line when compiled in 64 bits).
>>>>>>
>>>>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
>>>>>> any other offload fields in the RX fast-path into the mbuf.
>>>>>>
>>>>>> That being said, I can see why we might want to look to move it - but from the
>>>>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
>>>>>> it on the second cache line for most cases, through judicious use of prefetching,
>>>>>> or otherwise.
>>>>>>
>>>>>> It took a lot of work and investigation to get free space in the mbuf - especially
>>>>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
>>>>>> long as we possibly can!
>>>>>
>>>>> Yep, agree with Bruce here.
>>>>> Plus, with packet_type going to be 4B and vlan_tci_outer,
>>>>> we just don't have 8 free bytes at the first cache line any more.
>>>>
>>>> I don't understand why m->next would not be a better candidate than
>>>> rx offload fields to be in the first cache line. For instance, m->next
>>>> is mandatory and must be initialized when allocating a mbuf (to be
>>>> compared with m->seqn for instance, which is also in the first cache
>>>> line). So if we want to do some room in the first cache line, I
>>>> think we can.
>>>>
>>>> To me, the only reason for not doing it now is because we don't
>>>> have a full performance test report (several use-cases, drivers, ...)
>>>> that shows it's better.
>>>>
>>> Because the "next" field is not mandatory to be set on initialization. It can
>>> instead be set only when needed, and cleared on free if it is used.
>>>
>>> The next pointers always start out as NULL when the mbuf pool is created. The
>>> only time it is set to non-NULL is when we have chained mbufs. If we never have
>>> any chained mbufs, we never need to touch the next field, or even read it - since
>>> we have the num-segments count in the first cache line. If we do have a multi-segment
>>> mbuf, it's likely to be a big packet, so we have more processing time available
>>> and we can then take the hit of setting the next pointer. Whenever we go to
>>> free that mbuf for that packet, the code to do the freeing obviously needs to
>>> read the next pointer so as to free all the buffers in the chain, and so it can
>>> also reset the next pointer to NULL when doing so.
>>>
>>> In this way, we can ensure that the next pointer on cache line 1 is not a problem
>>> in our fast path.
>>
>> This is a good idea, but looking at the drivers, it seems that today
>> they all set m->next to NULL in the rx function. What you are suggesting
>> is to remove all of them, and document somewhere that all mbufs in a
>> pool are supposed to have their m->next set to NULL, correct?
>>
>> I think what you are describing could also apply to reference counter
>> (set to 1 by default), right?
> 
> We probably can reset next to NULL at __rte_pktmbuf_prefree_seg(),
> at the same time we do reset refcnt to 0.
> Is that what you suggesting?  

Yes, I can give it a try.



> 
> Konstantin
> 
>>
>>
>> Olivier
>>
>>
>>>
>>> /Bruce
>>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 15:19               ` Olivier MATZ
@ 2015-06-15 15:23                 ` Bruce Richardson
  2015-06-15 15:28                   ` Ananyev, Konstantin
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2015-06-15 15:23 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 05:19:27PM +0200, Olivier MATZ wrote:
> 
> 
> On 06/15/2015 04:52 PM, Ananyev, Konstantin wrote:
> > Hi Olivier,
> > 
> >> -----Original Message-----
> >> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> >> Sent: Monday, June 15, 2015 3:31 PM
> >> To: Richardson, Bruce
> >> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> >> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> >>
> >>
> >>
> >> On 06/15/2015 04:12 PM, Bruce Richardson wrote:
> >>> On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
> >>>> Hi,
> >>>>
> >>>> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >>>>>> Sent: Monday, June 15, 2015 2:44 PM
> >>>>>> To: Olivier MATZ
> >>>>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
> >>>>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> >>>>>>
> >>>>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> >>>>>>> Hi Damjan,
> >>>>>>>
> >>>>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> >>>>>>>>
> >>>>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> >>>>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> >>>>>>>>
> >>>>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> >>>>>>>
> >>>>>>> Agree, this is also something I noticed, see:
> >>>>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> >>>>>>>
> >>>>>>> I did not have the time to do performance testing, but it's something
> >>>>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
> >>>>>>> do it.
> >>>>>>>
> >>>>>>> It seems we currently just have enough room to do it (8 bytes are
> >>>>>>> remaining in the first cache line when compiled in 64 bits).
> >>>>>>
> >>>>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
> >>>>>> any other offload fields in the RX fast-path into the mbuf.
> >>>>>>
> >>>>>> That being said, I can see why we might want to look to move it - but from the
> >>>>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> >>>>>> it on the second cache line for most cases, through judicious use of prefetching,
> >>>>>> or otherwise.
> >>>>>>
> >>>>>> It took a lot of work and investigation to get free space in the mbuf - especially
> >>>>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
> >>>>>> long as we possibly can!
> >>>>>
> >>>>> Yep, agree with Bruce here.
> >>>>> Plus, with packet_type going to be 4B and vlan_tci_outer,
> >>>>> we just don't have 8 free bytes at the first cache line any more.
> >>>>
> >>>> I don't understand why m->next would not be a better candidate than
> >>>> rx offload fields to be in the first cache line. For instance, m->next
> >>>> is mandatory and must be initialized when allocating a mbuf (to be
> >>>> compared with m->seqn for instance, which is also in the first cache
> >>>> line). So if we want to do some room in the first cache line, I
> >>>> think we can.
> >>>>
> >>>> To me, the only reason for not doing it now is because we don't
> >>>> have a full performance test report (several use-cases, drivers, ...)
> >>>> that shows it's better.
> >>>>
> >>> Because the "next" field is not mandatory to be set on initialization. It can
> >>> instead be set only when needed, and cleared on free if it is used.
> >>>
> >>> The next pointers always start out as NULL when the mbuf pool is created. The
> >>> only time it is set to non-NULL is when we have chained mbufs. If we never have
> >>> any chained mbufs, we never need to touch the next field, or even read it - since
> >>> we have the num-segments count in the first cache line. If we do have a multi-segment
> >>> mbuf, it's likely to be a big packet, so we have more processing time available
> >>> and we can then take the hit of setting the next pointer. Whenever we go to
> >>> free that mbuf for that packet, the code to do the freeing obviously needs to
> >>> read the next pointer so as to free all the buffers in the chain, and so it can
> >>> also reset the next pointer to NULL when doing so.
> >>>
> >>> In this way, we can ensure that the next pointer on cache line 1 is not a problem
> >>> in our fast path.
> >>
> >> This is a good idea, but looking at the drivers, it seems that today
> >> they all set m->next to NULL in the rx function. What you are suggesting
> >> is to remove all of them, and document somewhere that all mbufs in a
> >> pool are supposed to have their m->next set to NULL, correct?
> >>
> >> I think what you are describing could also apply to reference counter
> >> (set to 1 by default), right?
> > 
> > We probably can reset next to NULL at __rte_pktmbuf_prefree_seg(),
> > at the same time we do reset refcnt to 0.
> > Is that what you suggesting?  
> 
> Yes, I can give it a try.
> 
>
Why would we need to change that function? The main free_seg function (which
is called from rte_pktmbuf_free() function) already sets the next pointer to
NULL? Is there some edge case in the code now where we are missing setting
the next pointer to NULL on free? 

Also, any code that is not using the regular free functions i.e. mbuf_free or
free_seg - anything not starting with "-" :-) - is responsible itself for
ensuring that it frees things correctly. If it doesn't set next to NULL when it
needs to - it needs to be fixed there, rather than changing the prefree_seg
function.

/Bruce

> 
> > 
> > Konstantin
> > 
> >>
> >>
> >> Olivier
> >>
> >>
> >>>
> >>> /Bruce
> >>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 15:23                 ` Bruce Richardson
@ 2015-06-15 15:28                   ` Ananyev, Konstantin
  2015-06-15 15:39                     ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Ananyev, Konstantin @ 2015-06-15 15:28 UTC (permalink / raw)
  To: Richardson, Bruce, Olivier MATZ; +Cc: dev, Damjan Marion (damarion)



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Monday, June 15, 2015 4:24 PM
> To: Olivier MATZ
> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> On Mon, Jun 15, 2015 at 05:19:27PM +0200, Olivier MATZ wrote:
> >
> >
> > On 06/15/2015 04:52 PM, Ananyev, Konstantin wrote:
> > > Hi Olivier,
> > >
> > >> -----Original Message-----
> > >> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> > >> Sent: Monday, June 15, 2015 3:31 PM
> > >> To: Richardson, Bruce
> > >> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> > >> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > >>
> > >>
> > >>
> > >> On 06/15/2015 04:12 PM, Bruce Richardson wrote:
> > >>> On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > >>>>>> Sent: Monday, June 15, 2015 2:44 PM
> > >>>>>> To: Olivier MATZ
> > >>>>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
> > >>>>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > >>>>>>
> > >>>>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> > >>>>>>> Hi Damjan,
> > >>>>>>>
> > >>>>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> > >>>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> > >>>>>>>>
> > >>>>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> > >>>>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> > >>>>>>>>
> > >>>>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> > >>>>>>>
> > >>>>>>> Agree, this is also something I noticed, see:
> > >>>>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> > >>>>>>>
> > >>>>>>> I did not have the time to do performance testing, but it's something
> > >>>>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
> > >>>>>>> do it.
> > >>>>>>>
> > >>>>>>> It seems we currently just have enough room to do it (8 bytes are
> > >>>>>>> remaining in the first cache line when compiled in 64 bits).
> > >>>>>>
> > >>>>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
> > >>>>>> any other offload fields in the RX fast-path into the mbuf.
> > >>>>>>
> > >>>>>> That being said, I can see why we might want to look to move it - but from the
> > >>>>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> > >>>>>> it on the second cache line for most cases, through judicious use of prefetching,
> > >>>>>> or otherwise.
> > >>>>>>
> > >>>>>> It took a lot of work and investigation to get free space in the mbuf - especially
> > >>>>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
> > >>>>>> long as we possibly can!
> > >>>>>
> > >>>>> Yep, agree with Bruce here.
> > >>>>> Plus, with packet_type going to be 4B and vlan_tci_outer,
> > >>>>> we just don't have 8 free bytes at the first cache line any more.
> > >>>>
> > >>>> I don't understand why m->next would not be a better candidate than
> > >>>> rx offload fields to be in the first cache line. For instance, m->next
> > >>>> is mandatory and must be initialized when allocating a mbuf (to be
> > >>>> compared with m->seqn for instance, which is also in the first cache
> > >>>> line). So if we want to do some room in the first cache line, I
> > >>>> think we can.
> > >>>>
> > >>>> To me, the only reason for not doing it now is because we don't
> > >>>> have a full performance test report (several use-cases, drivers, ...)
> > >>>> that shows it's better.
> > >>>>
> > >>> Because the "next" field is not mandatory to be set on initialization. It can
> > >>> instead be set only when needed, and cleared on free if it is used.
> > >>>
> > >>> The next pointers always start out as NULL when the mbuf pool is created. The
> > >>> only time it is set to non-NULL is when we have chained mbufs. If we never have
> > >>> any chained mbufs, we never need to touch the next field, or even read it - since
> > >>> we have the num-segments count in the first cache line. If we do have a multi-segment
> > >>> mbuf, it's likely to be a big packet, so we have more processing time available
> > >>> and we can then take the hit of setting the next pointer. Whenever we go to
> > >>> free that mbuf for that packet, the code to do the freeing obviously needs to
> > >>> read the next pointer so as to free all the buffers in the chain, and so it can
> > >>> also reset the next pointer to NULL when doing so.
> > >>>
> > >>> In this way, we can ensure that the next pointer on cache line 1 is not a problem
> > >>> in our fast path.
> > >>
> > >> This is a good idea, but looking at the drivers, it seems that today
> > >> they all set m->next to NULL in the rx function. What you are suggesting
> > >> is to remove all of them, and document somewhere that all mbufs in a
> > >> pool are supposed to have their m->next set to NULL, correct?
> > >>
> > >> I think what you are describing could also apply to reference counter
> > >> (set to 1 by default), right?
> > >
> > > We probably can reset next to NULL at __rte_pktmbuf_prefree_seg(),
> > > at the same time we do reset refcnt to 0.
> > > Is that what you suggesting?
> >
> > Yes, I can give it a try.
> >
> >
> Why would we need to change that function? The main free_seg function (which
> is called from rte_pktmbuf_free() function) already sets the next pointer to
> NULL? Is there some edge case in the code now where we are missing setting
> the next pointer to NULL on free?

ixgbe_tx_free_bufs() at drivers/net/ixgbe/ixgbe_rxtx_vec.c

> 
> Also, any code that is not using the regular free functions i.e. mbuf_free or
> free_seg - anything not starting with "-" :-) - is responsible itself for
> ensuring that it frees things correctly. If it doesn't set next to NULL when it
> needs to - it needs to be fixed there, rather than changing the prefree_seg
> function.

Why do think it would be a problem?
It seems like a good idea, tohide it inside __rte_pktmbuf_prefree_seg(),
So users don't have to set it to NULL manually each time.

Konstantin

> 
> /Bruce
> 
> >
> > >
> > > Konstantin
> > >
> > >>
> > >>
> > >> Olivier
> > >>
> > >>
> > >>>
> > >>> /Bruce
> > >>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 15:28                   ` Ananyev, Konstantin
@ 2015-06-15 15:39                     ` Bruce Richardson
  2015-06-15 15:59                       ` Ananyev, Konstantin
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2015-06-15 15:39 UTC (permalink / raw)
  To: Ananyev, Konstantin, #; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 04:28:55PM +0100, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Richardson, Bruce
> > Sent: Monday, June 15, 2015 4:24 PM
> > To: Olivier MATZ
> > Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > 
> > On Mon, Jun 15, 2015 at 05:19:27PM +0200, Olivier MATZ wrote:
> > >
> > >
> > > On 06/15/2015 04:52 PM, Ananyev, Konstantin wrote:
> > > > Hi Olivier,
> > > >
> > > >> -----Original Message-----
> > > >> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> > > >> Sent: Monday, June 15, 2015 3:31 PM
> > > >> To: Richardson, Bruce
> > > >> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> > > >> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > > >>
> > > >>
> > > >>
> > > >> On 06/15/2015 04:12 PM, Bruce Richardson wrote:
> > > >>> On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
> > > >>>> Hi,
> > > >>>>
> > > >>>> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> > > >>>>>
> > > >>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > >>>>>> Sent: Monday, June 15, 2015 2:44 PM
> > > >>>>>> To: Olivier MATZ
> > > >>>>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
> > > >>>>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > > >>>>>>
> > > >>>>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> > > >>>>>>> Hi Damjan,
> > > >>>>>>>
> > > >>>>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> > > >>>>>>>>
> > > >>>>>>>> Hi,
> > > >>>>>>>>
> > > >>>>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> > > >>>>>>>>
> > > >>>>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> > > >>>>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> > > >>>>>>>>
> > > >>>>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> > > >>>>>>>
> > > >>>>>>> Agree, this is also something I noticed, see:
> > > >>>>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> > > >>>>>>>
> > > >>>>>>> I did not have the time to do performance testing, but it's something
> > > >>>>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
> > > >>>>>>> do it.
> > > >>>>>>>
> > > >>>>>>> It seems we currently just have enough room to do it (8 bytes are
> > > >>>>>>> remaining in the first cache line when compiled in 64 bits).
> > > >>>>>>
> > > >>>>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
> > > >>>>>> any other offload fields in the RX fast-path into the mbuf.
> > > >>>>>>
> > > >>>>>> That being said, I can see why we might want to look to move it - but from the
> > > >>>>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> > > >>>>>> it on the second cache line for most cases, through judicious use of prefetching,
> > > >>>>>> or otherwise.
> > > >>>>>>
> > > >>>>>> It took a lot of work and investigation to get free space in the mbuf - especially
> > > >>>>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
> > > >>>>>> long as we possibly can!
> > > >>>>>
> > > >>>>> Yep, agree with Bruce here.
> > > >>>>> Plus, with packet_type going to be 4B and vlan_tci_outer,
> > > >>>>> we just don't have 8 free bytes at the first cache line any more.
> > > >>>>
> > > >>>> I don't understand why m->next would not be a better candidate than
> > > >>>> rx offload fields to be in the first cache line. For instance, m->next
> > > >>>> is mandatory and must be initialized when allocating a mbuf (to be
> > > >>>> compared with m->seqn for instance, which is also in the first cache
> > > >>>> line). So if we want to do some room in the first cache line, I
> > > >>>> think we can.
> > > >>>>
> > > >>>> To me, the only reason for not doing it now is because we don't
> > > >>>> have a full performance test report (several use-cases, drivers, ...)
> > > >>>> that shows it's better.
> > > >>>>
> > > >>> Because the "next" field is not mandatory to be set on initialization. It can
> > > >>> instead be set only when needed, and cleared on free if it is used.
> > > >>>
> > > >>> The next pointers always start out as NULL when the mbuf pool is created. The
> > > >>> only time it is set to non-NULL is when we have chained mbufs. If we never have
> > > >>> any chained mbufs, we never need to touch the next field, or even read it - since
> > > >>> we have the num-segments count in the first cache line. If we do have a multi-segment
> > > >>> mbuf, it's likely to be a big packet, so we have more processing time available
> > > >>> and we can then take the hit of setting the next pointer. Whenever we go to
> > > >>> free that mbuf for that packet, the code to do the freeing obviously needs to
> > > >>> read the next pointer so as to free all the buffers in the chain, and so it can
> > > >>> also reset the next pointer to NULL when doing so.
> > > >>>
> > > >>> In this way, we can ensure that the next pointer on cache line 1 is not a problem
> > > >>> in our fast path.
> > > >>
> > > >> This is a good idea, but looking at the drivers, it seems that today
> > > >> they all set m->next to NULL in the rx function. What you are suggesting
> > > >> is to remove all of them, and document somewhere that all mbufs in a
> > > >> pool are supposed to have their m->next set to NULL, correct?
> > > >>
> > > >> I think what you are describing could also apply to reference counter
> > > >> (set to 1 by default), right?
> > > >
> > > > We probably can reset next to NULL at __rte_pktmbuf_prefree_seg(),
> > > > at the same time we do reset refcnt to 0.
> > > > Is that what you suggesting?
> > >
> > > Yes, I can give it a try.
> > >
> > >
> > Why would we need to change that function? The main free_seg function (which
> > is called from rte_pktmbuf_free() function) already sets the next pointer to
> > NULL? Is there some edge case in the code now where we are missing setting
> > the next pointer to NULL on free?
> 
> ixgbe_tx_free_bufs() at drivers/net/ixgbe/ixgbe_rxtx_vec.c
> 

Not a gap - the vector TX functions cannot deal with scattered packets generally,
not just freeing them :-)

> > 
> > Also, any code that is not using the regular free functions i.e. mbuf_free or
> > free_seg - anything not starting with "-" :-) - is responsible itself for
> > ensuring that it frees things correctly. If it doesn't set next to NULL when it
> > needs to - it needs to be fixed there, rather than changing the prefree_seg
> > function.
> 
> Why do think it would be a problem?
> It seems like a good idea, tohide it inside __rte_pktmbuf_prefree_seg(),
> So users don't have to set it to NULL manually each time.
> 

I'm just nervous of slowing things down. Adding an extra instruction here applies
to every single packet - it's not just per burst. That's why I'd like to be
sure there is a definite problem before adding in more instructions here.

/Bruce

> Konstantin
> 
> > 
> > /Bruce
> > 
> > >
> > > >
> > > > Konstantin
> > > >
> > > >>
> > > >>
> > > >> Olivier
> > > >>
> > > >>
> > > >>>
> > > >>> /Bruce
> > > >>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 15:39                     ` Bruce Richardson
@ 2015-06-15 15:59                       ` Ananyev, Konstantin
  2015-06-15 16:02                         ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Ananyev, Konstantin @ 2015-06-15 15:59 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev, Damjan Marion (damarion)



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Monday, June 15, 2015 4:40 PM
> To: Ananyev, Konstantin; #
> Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> On Mon, Jun 15, 2015 at 04:28:55PM +0100, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Richardson, Bruce
> > > Sent: Monday, June 15, 2015 4:24 PM
> > > To: Olivier MATZ
> > > Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> > > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > >
> > > On Mon, Jun 15, 2015 at 05:19:27PM +0200, Olivier MATZ wrote:
> > > >
> > > >
> > > > On 06/15/2015 04:52 PM, Ananyev, Konstantin wrote:
> > > > > Hi Olivier,
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> > > > >> Sent: Monday, June 15, 2015 3:31 PM
> > > > >> To: Richardson, Bruce
> > > > >> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion)
> > > > >> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 06/15/2015 04:12 PM, Bruce Richardson wrote:
> > > > >>> On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote:
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> -----Original Message-----
> > > > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > > >>>>>> Sent: Monday, June 15, 2015 2:44 PM
> > > > >>>>>> To: Olivier MATZ
> > > > >>>>>> Cc: dev@dpdk.org; Damjan Marion (damarion)
> > > > >>>>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > > > >>>>>>
> > > > >>>>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote:
> > > > >>>>>>> Hi Damjan,
> > > > >>>>>>>
> > > > >>>>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> Hi,
> > > > >>>>>>>>
> > > > >>>>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line.
> > > > >>>>>>>>
> > > > >>>>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */
> > > > >>>>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL).
> > > > >>>>>>>>
> > > > >>>>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)?
> > > > >>>>>>>
> > > > >>>>>>> Agree, this is also something I noticed, see:
> > > > >>>>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html
> > > > >>>>>>>
> > > > >>>>>>> I did not have the time to do performance testing, but it's something
> > > > >>>>>>> I'd like to do as soon as I can. I don't see any obvious reason not to
> > > > >>>>>>> do it.
> > > > >>>>>>>
> > > > >>>>>>> It seems we currently just have enough room to do it (8 bytes are
> > > > >>>>>>> remaining in the first cache line when compiled in 64 bits).
> > > > >>>>>>
> > > > >>>>>> This, to me, is the obvious reason not to do it! It prevents us from taking in
> > > > >>>>>> any other offload fields in the RX fast-path into the mbuf.
> > > > >>>>>>
> > > > >>>>>> That being said, I can see why we might want to look to move it - but from the
> > > > >>>>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with
> > > > >>>>>> it on the second cache line for most cases, through judicious use of prefetching,
> > > > >>>>>> or otherwise.
> > > > >>>>>>
> > > > >>>>>> It took a lot of work and investigation to get free space in the mbuf - especially
> > > > >>>>>> in cache line 0, and I'd like to avoid just filling the cache line up again as
> > > > >>>>>> long as we possibly can!
> > > > >>>>>
> > > > >>>>> Yep, agree with Bruce here.
> > > > >>>>> Plus, with packet_type going to be 4B and vlan_tci_outer,
> > > > >>>>> we just don't have 8 free bytes at the first cache line any more.
> > > > >>>>
> > > > >>>> I don't understand why m->next would not be a better candidate than
> > > > >>>> rx offload fields to be in the first cache line. For instance, m->next
> > > > >>>> is mandatory and must be initialized when allocating a mbuf (to be
> > > > >>>> compared with m->seqn for instance, which is also in the first cache
> > > > >>>> line). So if we want to do some room in the first cache line, I
> > > > >>>> think we can.
> > > > >>>>
> > > > >>>> To me, the only reason for not doing it now is because we don't
> > > > >>>> have a full performance test report (several use-cases, drivers, ...)
> > > > >>>> that shows it's better.
> > > > >>>>
> > > > >>> Because the "next" field is not mandatory to be set on initialization. It can
> > > > >>> instead be set only when needed, and cleared on free if it is used.
> > > > >>>
> > > > >>> The next pointers always start out as NULL when the mbuf pool is created. The
> > > > >>> only time it is set to non-NULL is when we have chained mbufs. If we never have
> > > > >>> any chained mbufs, we never need to touch the next field, or even read it - since
> > > > >>> we have the num-segments count in the first cache line. If we do have a multi-segment
> > > > >>> mbuf, it's likely to be a big packet, so we have more processing time available
> > > > >>> and we can then take the hit of setting the next pointer. Whenever we go to
> > > > >>> free that mbuf for that packet, the code to do the freeing obviously needs to
> > > > >>> read the next pointer so as to free all the buffers in the chain, and so it can
> > > > >>> also reset the next pointer to NULL when doing so.
> > > > >>>
> > > > >>> In this way, we can ensure that the next pointer on cache line 1 is not a problem
> > > > >>> in our fast path.
> > > > >>
> > > > >> This is a good idea, but looking at the drivers, it seems that today
> > > > >> they all set m->next to NULL in the rx function. What you are suggesting
> > > > >> is to remove all of them, and document somewhere that all mbufs in a
> > > > >> pool are supposed to have their m->next set to NULL, correct?
> > > > >>
> > > > >> I think what you are describing could also apply to reference counter
> > > > >> (set to 1 by default), right?
> > > > >
> > > > > We probably can reset next to NULL at __rte_pktmbuf_prefree_seg(),
> > > > > at the same time we do reset refcnt to 0.
> > > > > Is that what you suggesting?
> > > >
> > > > Yes, I can give it a try.
> > > >
> > > >
> > > Why would we need to change that function? The main free_seg function (which
> > > is called from rte_pktmbuf_free() function) already sets the next pointer to
> > > NULL? Is there some edge case in the code now where we are missing setting
> > > the next pointer to NULL on free?
> >
> > ixgbe_tx_free_bufs() at drivers/net/ixgbe/ixgbe_rxtx_vec.c
> >
> 
> Not a gap - the vector TX functions cannot deal with scattered packets generally,
> not just freeing them :-)

True :)

> 
> > >
> > > Also, any code that is not using the regular free functions i.e. mbuf_free or
> > > free_seg - anything not starting with "-" :-) - is responsible itself for
> > > ensuring that it frees things correctly. If it doesn't set next to NULL when it
> > > needs to - it needs to be fixed there, rather than changing the prefree_seg
> > > function.
> >
> > Why do think it would be a problem?
> > It seems like a good idea, tohide it inside __rte_pktmbuf_prefree_seg(),
> > So users don't have to set it to NULL manually each time.
> >
> 
> I'm just nervous of slowing things down. Adding an extra instruction here applies
> to every single packet - it's not just per burst. That's why I'd like to be
> sure there is a definite problem before adding in more instructions here.


As I can see, vector TX is the only one that calls __rte_pktmbuf_prefree_seg() directly.
All others use rte_pktmbuf_free_seg(), that does ' m->next = NULL' anyway.
For vector TX - yes, need to verify that it would not introduce a slowdown.
Konstantin

> 
> /Bruce
> 
> > Konstantin
> >
> > >
> > > /Bruce
> > >
> > > >
> > > > >
> > > > > Konstantin
> > > > >
> > > > >>
> > > > >>
> > > > >> Olivier
> > > > >>
> > > > >>
> > > > >>>
> > > > >>> /Bruce
> > > > >>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 15:59                       ` Ananyev, Konstantin
@ 2015-06-15 16:02                         ` Bruce Richardson
  2015-06-15 16:10                           ` Ananyev, Konstantin
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2015-06-15 16:02 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 04:59:55PM +0100, Ananyev, Konstantin wrote:
> 
> 
> 
> As I can see, vector TX is the only one that calls __rte_pktmbuf_prefree_seg() directly.
> All others use rte_pktmbuf_free_seg(), that does ' m->next = NULL' anyway.
> For vector TX - yes, need to verify that it would not introduce a slowdown.
> Konstantin
>

But if the function is only directly called from one place, and that doesn't
have a problem, why would we bother making any change at all?

/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 16:02                         ` Bruce Richardson
@ 2015-06-15 16:10                           ` Ananyev, Konstantin
  2015-06-15 16:23                             ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Ananyev, Konstantin @ 2015-06-15 16:10 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev, Damjan Marion (damarion)



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Monday, June 15, 2015 5:02 PM
> To: Ananyev, Konstantin
> Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> On Mon, Jun 15, 2015 at 04:59:55PM +0100, Ananyev, Konstantin wrote:
> >
> >
> >
> > As I can see, vector TX is the only one that calls __rte_pktmbuf_prefree_seg() directly.
> > All others use rte_pktmbuf_free_seg(), that does ' m->next = NULL' anyway.
> > For vector TX - yes, need to verify that it would not introduce a slowdown.
> > Konstantin
> >
> 
> But if the function is only directly called from one place, and that doesn't
> have a problem, why would we bother making any change at all?


For future usages?
But sure, if you believe that we can safely remove 'm->next = NULL' at RX path,
without any changes in the __rte_pktmbuf_prefree_seg() -
that seems fine to me.
Konstantin

> 
> /Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 16:10                           ` Ananyev, Konstantin
@ 2015-06-15 16:23                             ` Bruce Richardson
  2015-06-15 18:34                               ` Ananyev, Konstantin
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2015-06-15 16:23 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 05:10:44PM +0100, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Richardson, Bruce
> > Sent: Monday, June 15, 2015 5:02 PM
> > To: Ananyev, Konstantin
> > Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > 
> > On Mon, Jun 15, 2015 at 04:59:55PM +0100, Ananyev, Konstantin wrote:
> > >
> > >
> > >
> > > As I can see, vector TX is the only one that calls __rte_pktmbuf_prefree_seg() directly.
> > > All others use rte_pktmbuf_free_seg(), that does ' m->next = NULL' anyway.
> > > For vector TX - yes, need to verify that it would not introduce a slowdown.
> > > Konstantin
> > >
> > 
> > But if the function is only directly called from one place, and that doesn't
> > have a problem, why would we bother making any change at all?
> 
> 
> For future usages?
> But sure, if you believe that we can safely remove 'm->next = NULL' at RX path,
> without any changes in the __rte_pktmbuf_prefree_seg() -
> that seems fine to me.
> Konstantin
> 

If we find it's not safe, we can add in the change to __rte_pktmbuf_prefree_seg
as you suggest.

One other question: based on this, do you think it's safe to also remove the
assignment to NULL from the pktmbuf_alloc function? I suspect it should be safe, and
that should help any traffic-generator type applications that use that function
extensively.

/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 16:23                             ` Bruce Richardson
@ 2015-06-15 18:34                               ` Ananyev, Konstantin
  2015-06-15 20:47                                 ` Stephen Hemminger
  0 siblings, 1 reply; 26+ messages in thread
From: Ananyev, Konstantin @ 2015-06-15 18:34 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev, Damjan Marion (damarion)



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Monday, June 15, 2015 5:23 PM
> To: Ananyev, Konstantin
> Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> On Mon, Jun 15, 2015 at 05:10:44PM +0100, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Richardson, Bruce
> > > Sent: Monday, June 15, 2015 5:02 PM
> > > To: Ananyev, Konstantin
> > > Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> > > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > >
> > > On Mon, Jun 15, 2015 at 04:59:55PM +0100, Ananyev, Konstantin wrote:
> > > >
> > > >
> > > >
> > > > As I can see, vector TX is the only one that calls __rte_pktmbuf_prefree_seg() directly.
> > > > All others use rte_pktmbuf_free_seg(), that does ' m->next = NULL' anyway.
> > > > For vector TX - yes, need to verify that it would not introduce a slowdown.
> > > > Konstantin
> > > >
> > >
> > > But if the function is only directly called from one place, and that doesn't
> > > have a problem, why would we bother making any change at all?
> >
> >
> > For future usages?
> > But sure, if you believe that we can safely remove 'm->next = NULL' at RX path,
> > without any changes in the __rte_pktmbuf_prefree_seg() -
> > that seems fine to me.
> > Konstantin
> >
> 
> If we find it's not safe, we can add in the change to __rte_pktmbuf_prefree_seg
> as you suggest.
> 
> One other question: based on this, do you think it's safe to also remove the
> assignment to NULL from the pktmbuf_alloc function? I suspect it should be safe, and
> that should help any traffic-generator type applications that use that function
> extensively.

So it will be setup to NULL, either by:
- mbuf constructor.
- TX full-path free descriptors code.
- upper layer code that uses 'next' pointer explicitly.
?

I can't come-up with some breakage scenario off-hand.
But that means that we have to should avoid resetting tx_offload in pktmbuf_alloc() too, right?
Otherwise there probably wouldn't be any real gain.
Konstantin

> 
> /Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 18:34                               ` Ananyev, Konstantin
@ 2015-06-15 20:47                                 ` Stephen Hemminger
  2015-06-16  8:20                                   ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen Hemminger @ 2015-06-15 20:47 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Damjan Marion (damarion)

On Mon, 15 Jun 2015 18:34:13 +0000
"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

> 
> 
> > -----Original Message-----
> > From: Richardson, Bruce
> > Sent: Monday, June 15, 2015 5:23 PM
> > To: Ananyev, Konstantin
> > Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > 
> > On Mon, Jun 15, 2015 at 05:10:44PM +0100, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Richardson, Bruce
> > > > Sent: Monday, June 15, 2015 5:02 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> > > > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > > >
> > > > On Mon, Jun 15, 2015 at 04:59:55PM +0100, Ananyev, Konstantin wrote:
> > > > >
> > > > >
> > > > >
> > > > > As I can see, vector TX is the only one that calls __rte_pktmbuf_prefree_seg() directly.
> > > > > All others use rte_pktmbuf_free_seg(), that does ' m->next = NULL' anyway.
> > > > > For vector TX - yes, need to verify that it would not introduce a slowdown.
> > > > > Konstantin
> > > > >
> > > >
> > > > But if the function is only directly called from one place, and that doesn't
> > > > have a problem, why would we bother making any change at all?
> > >
> > >
> > > For future usages?
> > > But sure, if you believe that we can safely remove 'm->next = NULL' at RX path,
> > > without any changes in the __rte_pktmbuf_prefree_seg() -
> > > that seems fine to me.
> > > Konstantin
> > >
> > 
> > If we find it's not safe, we can add in the change to __rte_pktmbuf_prefree_seg
> > as you suggest.
> > 
> > One other question: based on this, do you think it's safe to also remove the
> > assignment to NULL from the pktmbuf_alloc function? I suspect it should be safe, and
> > that should help any traffic-generator type applications that use that function
> > extensively.
> 
> So it will be setup to NULL, either by:
> - mbuf constructor.
> - TX full-path free descriptors code.
> - upper layer code that uses 'next' pointer explicitly.
> ?
> 
> I can't come-up with some breakage scenario off-hand.
> But that means that we have to should avoid resetting tx_offload in pktmbuf_alloc() too, right?
> Otherwise there probably wouldn't be any real gain.
> Konstantin
> 
> > 
> > /Bruce
> 
> 

The issue is when mbuf is used once by something and it sets it to non-NULL
packet is sent then freed.
then the free packet is picked up by one of the drivers using "fast path" Rx code.

IMHO the "fast path" Rx code has to go away. All the special allocation code
should call rte_pktmuf_reset and then only set the fields that are special
for the received packet.  The compiler should be able to remove the redundant
stores, and it would prevent bugs I am seeing where some fields are not set
in some drivers.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 20:47                                 ` Stephen Hemminger
@ 2015-06-16  8:20                                   ` Bruce Richardson
  0 siblings, 0 replies; 26+ messages in thread
From: Bruce Richardson @ 2015-06-16  8:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Damjan Marion (damarion)

On Mon, Jun 15, 2015 at 01:47:26PM -0700, Stephen Hemminger wrote:
> On Mon, 15 Jun 2015 18:34:13 +0000
> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> 
> > 
> > 
> > > -----Original Message-----
> > > From: Richardson, Bruce
> > > Sent: Monday, June 15, 2015 5:23 PM
> > > To: Ananyev, Konstantin
> > > Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> > > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > > 
> > > On Mon, Jun 15, 2015 at 05:10:44PM +0100, Ananyev, Konstantin wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Richardson, Bruce
> > > > > Sent: Monday, June 15, 2015 5:02 PM
> > > > > To: Ananyev, Konstantin
> > > > > Cc: Olivier MATZ; dev@dpdk.org; Damjan Marion (damarion)
> > > > > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> > > > >
> > > > > On Mon, Jun 15, 2015 at 04:59:55PM +0100, Ananyev, Konstantin wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > As I can see, vector TX is the only one that calls __rte_pktmbuf_prefree_seg() directly.
> > > > > > All others use rte_pktmbuf_free_seg(), that does ' m->next = NULL' anyway.
> > > > > > For vector TX - yes, need to verify that it would not introduce a slowdown.
> > > > > > Konstantin
> > > > > >
> > > > >
> > > > > But if the function is only directly called from one place, and that doesn't
> > > > > have a problem, why would we bother making any change at all?
> > > >
> > > >
> > > > For future usages?
> > > > But sure, if you believe that we can safely remove 'm->next = NULL' at RX path,
> > > > without any changes in the __rte_pktmbuf_prefree_seg() -
> > > > that seems fine to me.
> > > > Konstantin
> > > >
> > > 
> > > If we find it's not safe, we can add in the change to __rte_pktmbuf_prefree_seg
> > > as you suggest.
> > > 
> > > One other question: based on this, do you think it's safe to also remove the
> > > assignment to NULL from the pktmbuf_alloc function? I suspect it should be safe, and
> > > that should help any traffic-generator type applications that use that function
> > > extensively.
> > 
> > So it will be setup to NULL, either by:
> > - mbuf constructor.
> > - TX full-path free descriptors code.
> > - upper layer code that uses 'next' pointer explicitly.
> > ?
> > 
> > I can't come-up with some breakage scenario off-hand.
> > But that means that we have to should avoid resetting tx_offload in pktmbuf_alloc() too, right?
> > Otherwise there probably wouldn't be any real gain.
> > Konstantin
> > 
> > > 
> > > /Bruce
> > 
> > 
> 
> The issue is when mbuf is used once by something and it sets it to non-NULL
> packet is sent then freed.
> then the free packet is picked up by one of the drivers using "fast path" Rx code.
> 
> IMHO the "fast path" Rx code has to go away. All the special allocation code
> should call rte_pktmuf_reset and then only set the fields that are special
> for the received packet.  The compiler should be able to remove the redundant
> stores, and it would prevent bugs I am seeing where some fields are not set
> in some drivers.

Is the better option here not to just fix the drivers rather than making the
fast-path RX code go away? Can you give us a report out on what fields are not
getting correctly set, and when using what drivers, and we can look and see 
where our gaps in coverage are?

/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-15 14:12         ` Bruce Richardson
  2015-06-15 14:30           ` Olivier MATZ
@ 2015-06-17 13:55           ` Damjan Marion (damarion)
  2015-06-17 14:04             ` Thomas Monjalon
  2015-06-17 14:06             ` Bruce Richardson
  1 sibling, 2 replies; 26+ messages in thread
From: Damjan Marion (damarion) @ 2015-06-17 13:55 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev


> On 15 Jun 2015, at 16:12, Bruce Richardson <bruce.richardson@intel.com> wrote:
> 
> The next pointers always start out as NULL when the mbuf pool is created. The
> only time it is set to non-NULL is when we have chained mbufs. If we never have
> any chained mbufs, we never need to touch the next field, or even read it - since
> we have the num-segments count in the first cache line. If we do have a multi-segment
> mbuf, it's likely to be a big packet, so we have more processing time available
> and we can then take the hit of setting the next pointer.

There are applications which are not using rx offload, but they deal with chained mbufs.
Why they are less important than ones using rx offload? This is something people 
should be able to configure on build time.
That should not be too hard to achieve with set of macros. I can come up with the patch...

Thanks,

Damjan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-17 13:55           ` Damjan Marion (damarion)
@ 2015-06-17 14:04             ` Thomas Monjalon
  2015-06-17 14:06             ` Bruce Richardson
  1 sibling, 0 replies; 26+ messages in thread
From: Thomas Monjalon @ 2015-06-17 14:04 UTC (permalink / raw)
  To: Damjan Marion (damarion); +Cc: dev

2015-06-17 13:55, Damjan Marion:
> 
> > On 15 Jun 2015, at 16:12, Bruce Richardson <bruce.richardson@intel.com> wrote:
> > 
> > The next pointers always start out as NULL when the mbuf pool is created. The
> > only time it is set to non-NULL is when we have chained mbufs. If we never have
> > any chained mbufs, we never need to touch the next field, or even read it - since
> > we have the num-segments count in the first cache line. If we do have a multi-segment
> > mbuf, it's likely to be a big packet, so we have more processing time available
> > and we can then take the hit of setting the next pointer.
> 
> There are applications which are not using rx offload, but they deal with chained mbufs.
> Why they are less important than ones using rx offload? This is something people 
> should be able to configure on build time.
> That should not be too hard to achieve with set of macros. I can come up with the patch...

Having a build-time configuration of mbuf totally breaks the idea of having
some shared libraries.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-17 13:55           ` Damjan Marion (damarion)
  2015-06-17 14:04             ` Thomas Monjalon
@ 2015-06-17 14:06             ` Bruce Richardson
  2015-06-17 14:23               ` Damjan Marion (damarion)
       [not found]               ` <0DE313B5-C9F0-4879-9D92-838ED088202C@cisco.com>
  1 sibling, 2 replies; 26+ messages in thread
From: Bruce Richardson @ 2015-06-17 14:06 UTC (permalink / raw)
  To: Damjan Marion (damarion); +Cc: dev

On Wed, Jun 17, 2015 at 01:55:57PM +0000, Damjan Marion (damarion) wrote:
> 
> > On 15 Jun 2015, at 16:12, Bruce Richardson <bruce.richardson@intel.com> wrote:
> > 
> > The next pointers always start out as NULL when the mbuf pool is created. The
> > only time it is set to non-NULL is when we have chained mbufs. If we never have
> > any chained mbufs, we never need to touch the next field, or even read it - since
> > we have the num-segments count in the first cache line. If we do have a multi-segment
> > mbuf, it's likely to be a big packet, so we have more processing time available
> > and we can then take the hit of setting the next pointer.
> 
> There are applications which are not using rx offload, but they deal with chained mbufs.
> Why they are less important than ones using rx offload? This is something people 
> should be able to configure on build time.

It's not that they are less important, it's that the packet processing cycle count
budget is going to be greater. A packet which is 64 bytes, or 128 bytes in size
can make use of a number of RX offloads to reduce it's processing time. However,
a 64/128 packet is not going to be split across multiple buffers [unless we
are dealing with a very unusual setup!].

To handle 64 byte packets at 40G line rate, one has 50 cycles per core per packet
when running at 3GHz. [3000000000 cycles / 59.5 mpps].
If we assume that we are dealing with fairly small buffers
here, and that anything greater than 1k packets are chained, we still have 626
cycles per 3GHz core per packet to work with for that 1k packet. Given that
"normal" DPDK buffers are 2k in size, we have over a thousand cycles per packet
for any packet that is split. 

In summary, packets spread across multiple buffers are large packets, and so have
larger packet cycle count budgets and so can much better absorb the cost of
touching a second cache line in the mbuf than a 64-byte packet can. Therefore,
we optimize for the 64B packet case.

Hope this clarifies things a bit.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-17 14:06             ` Bruce Richardson
@ 2015-06-17 14:23               ` Damjan Marion (damarion)
  2015-06-17 16:32                 ` Thomas Monjalon
       [not found]               ` <0DE313B5-C9F0-4879-9D92-838ED088202C@cisco.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Damjan Marion (damarion) @ 2015-06-17 14:23 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev


> On 17 Jun 2015, at 16:06, Bruce Richardson <bruce.richardson@intel.com> wrote:
> 
> On Wed, Jun 17, 2015 at 01:55:57PM +0000, Damjan Marion (damarion) wrote:
>> 
>>> On 15 Jun 2015, at 16:12, Bruce Richardson <bruce.richardson@intel.com> wrote:
>>> 
>>> The next pointers always start out as NULL when the mbuf pool is created. The
>>> only time it is set to non-NULL is when we have chained mbufs. If we never have
>>> any chained mbufs, we never need to touch the next field, or even read it - since
>>> we have the num-segments count in the first cache line. If we do have a multi-segment
>>> mbuf, it's likely to be a big packet, so we have more processing time available
>>> and we can then take the hit of setting the next pointer.
>> 
>> There are applications which are not using rx offload, but they deal with chained mbufs.
>> Why they are less important than ones using rx offload? This is something people 
>> should be able to configure on build time.
> 
> It's not that they are less important, it's that the packet processing cycle count
> budget is going to be greater. A packet which is 64 bytes, or 128 bytes in size
> can make use of a number of RX offloads to reduce it's processing time. However,
> a 64/128 packet is not going to be split across multiple buffers [unless we
> are dealing with a very unusual setup!].
> 
> To handle 64 byte packets at 40G line rate, one has 50 cycles per core per packet
> when running at 3GHz. [3000000000 cycles / 59.5 mpps].
> If we assume that we are dealing with fairly small buffers
> here, and that anything greater than 1k packets are chained, we still have 626
> cycles per 3GHz core per packet to work with for that 1k packet. Given that
> "normal" DPDK buffers are 2k in size, we have over a thousand cycles per packet
> for any packet that is split. 
> 
> In summary, packets spread across multiple buffers are large packets, and so have
> larger packet cycle count budgets and so can much better absorb the cost of
> touching a second cache line in the mbuf than a 64-byte packet can. Therefore,
> we optimize for the 64B packet case.

This makes sense if there is no other work to do on the same core.
Otherwise it is better to spent those cycles on actual work instead of waiting for 
2nd cache line...

Thanks,

Damjan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
  2015-06-17 14:23               ` Damjan Marion (damarion)
@ 2015-06-17 16:32                 ` Thomas Monjalon
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Monjalon @ 2015-06-17 16:32 UTC (permalink / raw)
  To: Damjan Marion (damarion); +Cc: dev

2015-06-17 14:23, Damjan Marion:
> 
> > On 17 Jun 2015, at 16:06, Bruce Richardson <bruce.richardson@intel.com> wrote:
> > 
> > On Wed, Jun 17, 2015 at 01:55:57PM +0000, Damjan Marion (damarion) wrote:
> >> 
> >>> On 15 Jun 2015, at 16:12, Bruce Richardson <bruce.richardson@intel.com> wrote:
> >>> 
> >>> The next pointers always start out as NULL when the mbuf pool is created. The
> >>> only time it is set to non-NULL is when we have chained mbufs. If we never have
> >>> any chained mbufs, we never need to touch the next field, or even read it - since
> >>> we have the num-segments count in the first cache line. If we do have a multi-segment
> >>> mbuf, it's likely to be a big packet, so we have more processing time available
> >>> and we can then take the hit of setting the next pointer.
> >> 
> >> There are applications which are not using rx offload, but they deal with chained mbufs.
> >> Why they are less important than ones using rx offload? This is something people 
> >> should be able to configure on build time.
> > 
> > It's not that they are less important, it's that the packet processing cycle count
> > budget is going to be greater. A packet which is 64 bytes, or 128 bytes in size
> > can make use of a number of RX offloads to reduce it's processing time. However,
> > a 64/128 packet is not going to be split across multiple buffers [unless we
> > are dealing with a very unusual setup!].
> > 
> > To handle 64 byte packets at 40G line rate, one has 50 cycles per core per packet
> > when running at 3GHz. [3000000000 cycles / 59.5 mpps].
> > If we assume that we are dealing with fairly small buffers
> > here, and that anything greater than 1k packets are chained, we still have 626
> > cycles per 3GHz core per packet to work with for that 1k packet. Given that
> > "normal" DPDK buffers are 2k in size, we have over a thousand cycles per packet
> > for any packet that is split. 
> > 
> > In summary, packets spread across multiple buffers are large packets, and so have
> > larger packet cycle count budgets and so can much better absorb the cost of
> > touching a second cache line in the mbuf than a 64-byte packet can. Therefore,
> > we optimize for the 64B packet case.
> 
> This makes sense if there is no other work to do on the same core.
> Otherwise it is better to spent those cycles on actual work instead of waiting for 
> 2nd cache line...

You're probably right.
I wonder wether this flexibility can be implemented only in static lib builds?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
       [not found]                         ` <2601191342CEEE43887BDE71AB97725836A1237C@irsmsx105.ger.corp.intel.com>
@ 2015-06-17 18:50                           ` Ananyev, Konstantin
  0 siblings, 0 replies; 26+ messages in thread
From: Ananyev, Konstantin @ 2015-06-17 18:50 UTC (permalink / raw)
  To: dev

Hi Dave,

> From: Dave Barach (dbarach) [mailto:dbarach@cisco.com]
> Sent: Wednesday, June 17, 2015 4:46 PM
> To: Venkatesan, Venky; Richardson, Bruce; olivier.matz@6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion)
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Dear Venky,
> 
> The first thing I noticed - on a specific piece of hardware, yadda yadda yadda - was that the i40e driver speed-path spent an ungodly
> amount of time stalled in i40e_rx_alloc_bufs(.) in rte_mbuf_refcnt_set. Mumble, missing prefetch. I added a stride-of-1 prefetch,
> which made the first stall go away. See below.
> 
> Next thing I noticed: a stall setting mb->next = 0, in the non-scattered RX case. So I added the (commented-out) rte_prefetch0
> (&pfmb->next). At that point, I decided to move the buffer metadata around to eliminate the second prefetch.
> 
> Taken together, these changes made a 10% PPS difference, again in a specific use case.

That seems like a valid point, but why moving next to the first cache-line is considered as the only possible option?
As Bruce suggested before, we can try to get rid of touching next in non-scattered RX routines.
That I think should provide similar performance improvement for i40e/ixgbe fast-path scalar RX code. 

Or you are talking about hit in your app layer code, when you start chain mbufs together?

> 
> Damjan was going to produce a cleaned up version of the rte_mbuf.h diffs, of the form:
> 
> struct rte_mbuf {
> #ifdef CONFIG_OFFLOAD_IN_FIRST_CACHE_LINE
>   Offload-in-first-cache-line;
>   Next-in-second-cache-line;
> #else
>   Offload-in-second-cache-line;
>   Next-in-first-cache-line;r
> #endif
> };
> 
> .along with a parallel change in the kernel module version.

As first, with the proposed below changes ixgbe vector RX routine would be broken.
Of course, it could be fixed by putting even more conditional compilation around -
enable vector RX, only when OFFLOAD_IN_FIRST_CACHE_LINE enabled, etc.
Second, how long it would take before someone else would like to introduce another mbuf fields swap?
All for perfectly good reason of course.
Let say, swap 'hash' and 'segn' (to keep vector RX working),
or 'next' and 'userdata', or put tx_offload into the first line (traffic generators).
I think if we'll go that way  (allow mbuf fields swapping at build-time) we'll end up with
totally unmaintainable code.
Not to mention problems with ABI compatibility) (shared libraries, multiple processes, KNI).
So, I think we better stick with one mbuf format.
If it's absolutely necessary to move next into the first cache, I think it has to be done for all configs.
Though, from what you described with i40e_rx_alloc_bufs() -
that looks like a flaw in particular implementation, that might be fixed without changing mbuf format.

> 
> I'm out of LSU bandwidth / prefetch slots in a number of apps - mostly L2 apps - which will never use h/w offload results. I can see
> your point about not turning dpdk buffer metadata into a garbage can. On the other hand, the problems we face often involve MTU
> => scattered RX, with a mix of small and large packets. As in: enough PPS to care about extra cache-line fills.
> 
> Hope this explains it a bit. We can obviously make these changes in our own repos, but I can't imagine that we're the only folks who
> might not use h/w offload results.
> 
> Thanks. Dave
> 
> 
> diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
> index 9c7be6f..1200361 100644
> --- a/lib/librte_pmd_i40e/i40e_rxtx.c
> +++ b/lib/librte_pmd_i40e/i40e_rxtx.c
> @@ -779,9 +779,15 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> 
>       rxdp = &rxq->rx_ring[alloc_idx];
>       for (i = 0; i < rxq->rx_free_thresh; i++) {
> +                if (i < (rxq->rx_free_thresh - 1)) {
> +                        struct rte_mbuf *pfmb;
> +                        pfmb = rxep[i+1].mbuf;
> +                        rte_prefetch0 (pfmb);
> +                        // rte_prefetch0 (&pfmb->next);

Wonder does your compiler unroll that loop?
If not, wonder, would manually unrolling it (by 4 or so) help?
Konstantin  


> +                }
>             mb = rxep[i].mbuf;
>             rte_mbuf_refcnt_set(mb, 1);
> -           mb->next = NULL;
> +           mb->next = NULL; /* $$$ in second cacheline */
>             mb->data_off = RTE_PKTMBUF_HEADROOM;
>             mb->nb_segs = 1;
>             mb->port = rxq->port_id;
> 
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 55749bc..efd7f4e 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -288,6 +288,12 @@ struct rte_mbuf {
>       uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
>       uint16_t vlan_tci;        /**< VLAN Tag Control Identifier (CPU order) */
>       uint16_t reserved;
> +     uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */
> +
> +     struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> +
> +     /* second cache line - fields only used in slow path or on TX */
> +     MARKER cacheline1 __rte_cache_aligned;
>       union {
>             uint32_t rss;     /**< RSS hash result if RSS enabled */
>             struct {
> @@ -307,18 +313,12 @@ struct rte_mbuf {
>             uint32_t usr;       /**< User defined tags. See rte_distributor_process() */
>       } hash;                   /**< hash information */
> 
> -     uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */
> -
> -     /* second cache line - fields only used in slow path or on TX */
> -     MARKER cacheline1 __rte_cache_aligned;
> -
>       union {
>             void *userdata;   /**< Can be used for external metadata */
>             uint64_t udata64; /**< Allow 8-byte userdata on 32-bit */
>       };
> 
>       struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
> -     struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> 
>       /* fields to support TX offloads */
>       union {
> 
> 
> 
> 
> Thanks. Dave
> 
> From: Venkatesan, Venky [mailto:venky.venkatesan@intel.com]
> Sent: Wednesday, June 17, 2015 11:03 AM
> To: Richardson, Bruce; Dave Barach (dbarach); olivier.matz@6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion)
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Dave,
> 
> Is there a patch that I can look at? From your description what you seem to indicate is that the mbuf structure becomes variant
> between applications - that to me is likely a non-starter. If rte_mbuf is variant between implementations, then  we really have no
> commonality that all VNFs can rely on.
> 
> I may be getting a bit ahead of things here but letting different apps implement different mbuf layouts means, to achieve any level of
> commonality for upper level apps we need to go to accessor functions (not inline macros) to get various components - pretty much
> like ODP does; all that does is add function call overhead per access and that is completely non-performant. We toyed with that about
> 5 years ago and tossed it out.
> 
> Bruce,
> 
> To some extent I need to get an understanding of why this performance drop is actually happening. Since SNB we have had a paired
> cache line prefetcher that brings in the cache line pair. I think the reason that we have the perf issue is that half the mbufs actually
> begin on an odd cache line boundary - i.e. those are the ones that will suffer a cache miss on rte_mbuf.next. Could you verify that this
> is the case, and see what happens if we address all mbufs beginning on an even cache line? That runs into another potential issue with
> 4K aliasing, but I may have a way to avoid that.
> 
> Regards,
> -Venky
> 
> 
> From: Richardson, Bruce
> Sent: Wednesday, June 17, 2015 7:50 AM
> To: Dave Barach (dbarach); olivier.matz@6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion); Venkatesan, Venky
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Hi,
> 
> it's something we can look at - especially once we see the proposed patch.
> 
> However, my question is, if an app is cycle constrained such that the extra cache-line reference for jumbo frames cause a problem, is
> that app not likely to really suffer performance issues when run with smaller e.g. 256 byte packet sizes?  And in the inverse case, if an
> app can deal with small packets, is it not likely able to take the extra hit for the jumbo frames? This was our thinking when doing the
> cache-line split, but perhaps the logic doesn't hold up in the cases you refer to.
> 
> Regards,
> /Bruce
> 
> From: Dave Barach (dbarach) [mailto:dbarach@cisco.com]
> Sent: Wednesday, June 17, 2015 3:37 PM
> To: Richardson, Bruce; olivier.matz@6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion)
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Dear Bruce,
> 
> We've implemented a number of use cases which can't or at least don't currently use hardware offload data. Examples: L2 bridging,
> integrated routing and bridging, various flavors of tunneling, ipv6 MAP, ipv6 segment routing, and so on.
> 
> You're not considering the cost of additional pressure on the load-store unit, memory system, caused when mb->next must be
> prefetched, set, and cleared. It doesn't matter that the packets are large, the cost still exists. As we transition to 40 and 100g NICs,
> large-packet PPS-per-core becomes more of an issue.
> 
> Damjan has offered to make the layout of the buffer metadata configurable, so that folks can "have it their way." Given that it's a ~10
> line change - with minimal comedic potential - it seems like a reasonable way to go.
> 
> Thoughts?
> 
> Thanks. Dave Barach
> Cisco Fellow
> 
> From: Damjan Marion (damarion)
> Sent: Wednesday, June 17, 2015 10:17 AM
> To: Dave Barach (dbarach)
> Subject: Fwd: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> 
> 
> Begin forwarded message:
> 
> From: Bruce Richardson <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> Date: 17 Jun 2015 16:06:48 CEST
> To: "Damjan Marion (damarion)" <damarion@cisco.com>
> Cc: Olivier MATZ <olivier.matz@6wind.com>, "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "dev@dpdk.org"
> <dev@dpdk.org>
> 
> On Wed, Jun 17, 2015 at 01:55:57PM +0000, Damjan Marion (damarion) wrote:
> 
> On 15 Jun 2015, at 16:12, Bruce Richardson <bruce.richardson@intel.com> wrote:
> 
> The next pointers always start out as NULL when the mbuf pool is created. The
> only time it is set to non-NULL is when we have chained mbufs. If we never have
> any chained mbufs, we never need to touch the next field, or even read it - since
> we have the num-segments count in the first cache line. If we do have a multi-segment
> mbuf, it's likely to be a big packet, so we have more processing time available
> and we can then take the hit of setting the next pointer.
> 
> There are applications which are not using rx offload, but they deal with chained mbufs.
> Why they are less important than ones using rx offload? This is something people
> should be able to configure on build time.
> 
> It's not that they are less important, it's that the packet processing cycle count
> budget is going to be greater. A packet which is 64 bytes, or 128 bytes in size
> can make use of a number of RX offloads to reduce it's processing time. However,
> a 64/128 packet is not going to be split across multiple buffers [unless we
> are dealing with a very unusual setup!].
> 
> To handle 64 byte packets at 40G line rate, one has 50 cycles per core per packet
> when running at 3GHz. [3000000000 cycles / 59.5 mpps].
> If we assume that we are dealing with fairly small buffers
> here, and that anything greater than 1k packets are chained, we still have 626
> cycles per 3GHz core per packet to work with for that 1k packet. Given that
> "normal" DPDK buffers are 2k in size, we have over a thousand cycles per packet
> for any packet that is split.
> 
> In summary, packets spread across multiple buffers are large packets, and so have
> larger packet cycle count budgets and so can much better absorb the cost of
> touching a second cache line in the mbuf than a 64-byte packet can. Therefore,
> we optimize for the 64B packet case.
> 
> Hope this clarifies things a bit.
> 
> Regards,
> /Bruce

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-06-17 18:50 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-10 21:47 [dpdk-dev] rte_mbuf.next in 2nd cacheline Damjan Marion (damarion)
2015-06-15 13:20 ` Olivier MATZ
2015-06-15 13:44   ` Bruce Richardson
2015-06-15 13:54     ` Ananyev, Konstantin
2015-06-15 14:05       ` Olivier MATZ
2015-06-15 14:12         ` Bruce Richardson
2015-06-15 14:30           ` Olivier MATZ
2015-06-15 14:46             ` Bruce Richardson
2015-06-15 14:52             ` Ananyev, Konstantin
2015-06-15 15:19               ` Olivier MATZ
2015-06-15 15:23                 ` Bruce Richardson
2015-06-15 15:28                   ` Ananyev, Konstantin
2015-06-15 15:39                     ` Bruce Richardson
2015-06-15 15:59                       ` Ananyev, Konstantin
2015-06-15 16:02                         ` Bruce Richardson
2015-06-15 16:10                           ` Ananyev, Konstantin
2015-06-15 16:23                             ` Bruce Richardson
2015-06-15 18:34                               ` Ananyev, Konstantin
2015-06-15 20:47                                 ` Stephen Hemminger
2015-06-16  8:20                                   ` Bruce Richardson
2015-06-17 13:55           ` Damjan Marion (damarion)
2015-06-17 14:04             ` Thomas Monjalon
2015-06-17 14:06             ` Bruce Richardson
2015-06-17 14:23               ` Damjan Marion (damarion)
2015-06-17 16:32                 ` Thomas Monjalon
     [not found]               ` <0DE313B5-C9F0-4879-9D92-838ED088202C@cisco.com>
     [not found]                 ` <27EA8870B328F74E88180827A0F816396BD43720@xmb-aln-x10.cisco.com>
     [not found]                   ` <59AF69C657FD0841A61C55336867B5B0345592CD@IRSMSX103.ger.corp.intel.com>
     [not found]                     ` <1FD9B82B8BF2CF418D9A1000154491D97450B186@ORSMSX102.amr.corp.intel.com>
     [not found]                       ` <27EA8870B328F74E88180827A0F816396BD43891@xmb-aln-x10.cisco.com>
     [not found]                         ` <2601191342CEEE43887BDE71AB97725836A1237C@irsmsx105.ger.corp.intel.com>
2015-06-17 18:50                           ` Ananyev, Konstantin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).