[RFC] eal: allow worker lcore stacks to be allocated from hugepage memory

DPDK patches and discussions
 help / color / mirror / Atom feed

* [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
@ 2022-04-26 12:19 Don Wallwork
  2022-04-26 14:58 ` Stephen Hemminger
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Don Wallwork @ 2022-04-26 12:19 UTC (permalink / raw)
  To: dev; +Cc: Don Wallwork

Add support for using hugepages for worker lcore stack memory.  The
intent is to improve performance by reducing stack memory related TLB
misses and also by using memory local to the NUMA node of each lcore.

Platforms desiring to make use of this capability must enable the
associated option flag and stack size settings in platform config
files.
---
 lib/eal/linux/eal.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 1ef263434a..4e1e5b6915 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1143,9 +1143,48 @@ rte_eal_init(int argc, char **argv)
 
 		lcore_config[i].state = WAIT;
 
+#ifdef RTE_EAL_NUMA_AWARE_LCORE_STACK
+		/* Allocate NUMA aware stack memory and set pthread attributes */
+		pthread_attr_t attr;
+		void *stack_ptr =
+			rte_zmalloc_socket("lcore_stack",
+					   RTE_EAL_NUMA_AWARE_LCORE_STACK_SIZE,
+					   RTE_EAL_NUMA_AWARE_LCORE_STACK_SIZE,
+					   rte_lcore_to_socket_id(i));
+
+		if (stack_ptr == NULL) {
+			rte_eal_init_alert("Cannot allocate stack memory");
+			rte_errno = ENOMEM;
+			return -1;
+		}
+
+		if (pthread_attr_init(&attr) != 0) {
+			rte_eal_init_alert("Cannot init pthread attributes");
+			rte_errno = EINVAL;
+			return -1;
+		}
+		if (pthread_attr_setstack(&attr,
+					  stack_ptr,
+					  RTE_EAL_NUMA_AWARE_LCORE_STACK_SIZE) != 0) {
+			rte_eal_init_alert("Cannot set pthread stack attributes");
+			rte_errno = ENOTSUP;
+			return -1;
+		}
+
+		/* create a thread for each lcore */
+		ret = pthread_create(&lcore_config[i].thread_id, &attr,
+				     eal_thread_loop, (void *)(uintptr_t)i);
+
+		if (pthread_attr_destroy(&attr) != 0) {
+			rte_eal_init_alert("Cannot destroy pthread attributes");
+			rte_errno = EFAULT;
+			return -1;
+		}
+#else
 		/* create a thread for each lcore */
 		ret = pthread_create(&lcore_config[i].thread_id, NULL,
 				     eal_thread_loop, (void *)(uintptr_t)i);
+#endif
 		if (ret != 0)
 			rte_panic("Cannot create thread\n");
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-26 12:19 [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory Don Wallwork
@ 2022-04-26 14:58 ` Stephen Hemminger
  2022-04-26 21:01   ` Don Wallwork
  2022-04-27  0:42 ` Honnappa Nagarahalli
  2022-04-29 20:00 ` [RFC v2] " Don Wallwork
  2 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2022-04-26 14:58 UTC (permalink / raw)
  To: Don Wallwork; +Cc: dev

On Tue, 26 Apr 2022 08:19:59 -0400
Don Wallwork <donw@xsightlabs.com> wrote:

> Add support for using hugepages for worker lcore stack memory.  The
> intent is to improve performance by reducing stack memory related TLB
> misses and also by using memory local to the NUMA node of each lcore.
> 
> Platforms desiring to make use of this capability must enable the
> associated option flag and stack size settings in platform config
> files.
> ---
>  lib/eal/linux/eal.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 

Good idea but having a fixed size stack makes writing complex application
more difficult. Plus you lose the safety of guard pages.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-26 14:58 ` Stephen Hemminger
@ 2022-04-26 21:01   ` Don Wallwork
  2022-04-26 21:21     ` Stephen Hemminger
  0 siblings, 1 reply; 15+ messages in thread
From: Don Wallwork @ 2022-04-26 21:01 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev



On 4/26/2022 10:58 AM, Stephen Hemminger wrote:
> On Tue, 26 Apr 2022 08:19:59 -0400
> Don Wallwork <donw@xsightlabs.com> wrote:
>
>> Add support for using hugepages for worker lcore stack memory.  The
>> intent is to improve performance by reducing stack memory related TLB
>> misses and also by using memory local to the NUMA node of each lcore.
>>
>> Platforms desiring to make use of this capability must enable the
>> associated option flag and stack size settings in platform config
>> files.
>> ---
>>   lib/eal/linux/eal.c | 39 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 39 insertions(+)
>>
> Good idea but having a fixed size stack makes writing complex application
> more difficult. Plus you lose the safety of guard pages.

Thanks for the quick reply.

The expectation is that use of this optional feature would be limited to 
cases where
the performance gains justify the implications of these tradeoffs. For 
example, a specific
data plane application may be okay with limited stack size and could be 
tested to ensure
stack usage remains within limits.

Also, since this applies only to worker threads, the main thread would 
not be impacted
by this change.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-26 21:01   ` Don Wallwork
@ 2022-04-26 21:21     ` Stephen Hemminger
  2022-04-26 21:25       ` Don Wallwork
  0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2022-04-26 21:21 UTC (permalink / raw)
  To: Don Wallwork; +Cc: dev

On Tue, 26 Apr 2022 17:01:18 -0400
Don Wallwork <donw@xsightlabs.com> wrote:

> On 4/26/2022 10:58 AM, Stephen Hemminger wrote:
> > On Tue, 26 Apr 2022 08:19:59 -0400
> > Don Wallwork <donw@xsightlabs.com> wrote:
> >  
> >> Add support for using hugepages for worker lcore stack memory.  The
> >> intent is to improve performance by reducing stack memory related TLB
> >> misses and also by using memory local to the NUMA node of each lcore.
> >>
> >> Platforms desiring to make use of this capability must enable the
> >> associated option flag and stack size settings in platform config
> >> files.
> >> ---
> >>   lib/eal/linux/eal.c | 39 +++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 39 insertions(+)
> >>  
> > Good idea but having a fixed size stack makes writing complex application
> > more difficult. Plus you lose the safety of guard pages.  
> 
> Thanks for the quick reply.
> 
> The expectation is that use of this optional feature would be limited to 
> cases where
> the performance gains justify the implications of these tradeoffs. For 
> example, a specific
> data plane application may be okay with limited stack size and could be 
> tested to ensure
> stack usage remains within limits.
> 
> Also, since this applies only to worker threads, the main thread would 
> not be impacted
> by this change.
> 
> 

I would prefer it as a runtime, not compile time option.
That way distributions could ship DPDK and application could opt in if it wanted.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-26 21:21     ` Stephen Hemminger
@ 2022-04-26 21:25       ` Don Wallwork
  2022-04-27  8:17         ` Morten Brørup
  0 siblings, 1 reply; 15+ messages in thread
From: Don Wallwork @ 2022-04-26 21:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev



On 4/26/2022 5:21 PM, Stephen Hemminger wrote:
> On Tue, 26 Apr 2022 17:01:18 -0400
> Don Wallwork <donw@xsightlabs.com> wrote:
>
>> On 4/26/2022 10:58 AM, Stephen Hemminger wrote:
>>> On Tue, 26 Apr 2022 08:19:59 -0400
>>> Don Wallwork <donw@xsightlabs.com> wrote:
>>>   
>>>> Add support for using hugepages for worker lcore stack memory.  The
>>>> intent is to improve performance by reducing stack memory related TLB
>>>> misses and also by using memory local to the NUMA node of each lcore.
>>>>
>>>> Platforms desiring to make use of this capability must enable the
>>>> associated option flag and stack size settings in platform config
>>>> files.
>>>> ---
>>>>    lib/eal/linux/eal.c | 39 +++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 39 insertions(+)
>>>>   
>>> Good idea but having a fixed size stack makes writing complex application
>>> more difficult. Plus you lose the safety of guard pages.
>> Thanks for the quick reply.
>>
>> The expectation is that use of this optional feature would be limited to
>> cases where
>> the performance gains justify the implications of these tradeoffs. For
>> example, a specific
>> data plane application may be okay with limited stack size and could be
>> tested to ensure
>> stack usage remains within limits.
>>
>> Also, since this applies only to worker threads, the main thread would
>> not be impacted
>> by this change.
>>
>>
> I would prefer it as a runtime, not compile time option.
> That way distributions could ship DPDK and application could opt in if it wanted.
Good point..  I'll work on a v2 and will post that when it's ready.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-26 21:25       ` Don Wallwork
@ 2022-04-27  8:17         ` Morten Brørup
  2022-04-29 18:52           ` Don Wallwork
  0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2022-04-27  8:17 UTC (permalink / raw)
  To: Don Wallwork, Stephen Hemminger, Anatoly Burakov, Dmitry Kozlyuk,
	Bruce Richardson
  Cc: dev

+CC: EAL and Memory maintainers.

> From: Don Wallwork [mailto:donw@xsightlabs.com]
> Sent: Tuesday, 26 April 2022 23.26
> 
> On 4/26/2022 5:21 PM, Stephen Hemminger wrote:
> > On Tue, 26 Apr 2022 17:01:18 -0400
> > Don Wallwork <donw@xsightlabs.com> wrote:
> >
> >> On 4/26/2022 10:58 AM, Stephen Hemminger wrote:
> >>> On Tue, 26 Apr 2022 08:19:59 -0400
> >>> Don Wallwork <donw@xsightlabs.com> wrote:
> >>>
> >>>> Add support for using hugepages for worker lcore stack memory.
> The
> >>>> intent is to improve performance by reducing stack memory related
> TLB
> >>>> misses and also by using memory local to the NUMA node of each
> lcore.

This certainly seems like a good idea!

However, I wonder: Does the O/S assign memory local to the NUMA node to an lcore-pinned thread's stack when instantiating the tread? And does the DPDK EAL ensure that the preconditions for the O/S to do that are present?

(Not relevant for this patch, but the same locality questions come to mind regarding Thread Local Storage.)

> >>>>
> >>>> Platforms desiring to make use of this capability must enable the
> >>>> associated option flag and stack size settings in platform config
> >>>> files.
> >>>> ---
> >>>>    lib/eal/linux/eal.c | 39
> +++++++++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 39 insertions(+)
> >>>>
> >>> Good idea but having a fixed size stack makes writing complex
> application
> >>> more difficult. Plus you lose the safety of guard pages.

Would it be possible to add a guard page or guard region by using the O/S memory allocator instead of rte_zmalloc_socket()? Since the stack is considered private to the process, i.e. not accessible from other processes, this patch does not need to provide remote access to stack memory from secondary processes - and thus it is not a requirement for this features to use DPDK managed memory.

> >> Thanks for the quick reply.
> >>
> >> The expectation is that use of this optional feature would be
> limited to
> >> cases where
> >> the performance gains justify the implications of these tradeoffs.
> For
> >> example, a specific
> >> data plane application may be okay with limited stack size and could
> be
> >> tested to ensure
> >> stack usage remains within limits.

How to identify the required stack size and verify it... If aiming for small stacks, some instrumentation would be nice, like rte_mempool_audit() and rte_mempool_list_dump().

Alternatively, just assume that the stack is "always big enough", and don't worry about it - like the default O/S stack size. And as Stephen already mentioned: Regardless of stack size, overflowing the stack will cause memory corruption instead of a segmentation fault.

Keep in mind that the required stack size not only depends on the application, but also on DPDK and other libraries being used by the application.

> >>
> >> Also, since this applies only to worker threads, the main thread
> would
> >> not be impacted
> >> by this change.
> >>
> >>
> > I would prefer it as a runtime, not compile time option.
> > That way distributions could ship DPDK and application could opt in
> if it wanted.
> Good point..  I'll work on a v2 and will post that when it's ready.

May I suggest using the stack size configured in the O/S, from pthread_attr_getstacksize() or similar, instead of choosing the stack size manually? If you want it to be configurable, use the default size unless explicitly specified otherwise.

Do the worker threads need a different stack size than the main thread? In my opinion: "Nice to have", not "must have".

Do the worker threads need different stack sizes individually? In my opinion: Perhaps "nice to have", certainly not "must have".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-27  8:17         ` Morten Brørup
@ 2022-04-29 18:52           ` Don Wallwork
  2022-04-29 19:03             ` Stephen Hemminger
  2022-04-30  7:55             ` Morten Brørup
  0 siblings, 2 replies; 15+ messages in thread
From: Don Wallwork @ 2022-04-29 18:52 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger, Anatoly Burakov,
	Dmitry Kozlyuk, Bruce Richardson
  Cc: dev

On 4/27/2022 4:17 AM, Morten Brørup wrote:
> +CC: EAL and Memory maintainers.
>
>> From: Don Wallwork [mailto:donw@xsightlabs.com]
>> Sent: Tuesday, 26 April 2022 23.26
>>
>> On 4/26/2022 5:21 PM, Stephen Hemminger wrote:
>>> On Tue, 26 Apr 2022 17:01:18 -0400
>>> Don Wallwork <donw@xsightlabs.com> wrote:
>>>
>>>> On 4/26/2022 10:58 AM, Stephen Hemminger wrote:
>>>>> On Tue, 26 Apr 2022 08:19:59 -0400
>>>>> Don Wallwork <donw@xsightlabs.com> wrote:
>>>>>
>>>>>> Add support for using hugepages for worker lcore stack memory. The intent is to improve performance by reducing stack memory related TLB misses and also by using memory local to the NUMA node of each lcore.
> This certainly seems like a good idea!
>
> However, I wonder: Does the O/S assign memory local to the NUMA node to an lcore-pinned thread's stack when instantiating the tread? And does the DPDK EAL ensure that the preconditions for the O/S to do that are present?
>
> (Not relevant for this patch, but the same locality questions come to mind regarding Thread Local Storage.)
Currently, DPDK does not set pthread affinity until after the pthread is 
created and the stack has been allocated.  If the affinity attribute 
were set before the pthread_create call, it seems possible that 
pthread_create could be NUMA aware when allocating the stack.  However, 
it looks like at least the glibc v2.35 implementation of pthread_create 
does not consider this at stack allocation time.
>
>>>>>> Platforms desiring to make use of this capability must enable the
>>>>>> associated option flag and stack size settings in platform config
>>>>>> files.
>>>>>> ---
>>>>>>     lib/eal/linux/eal.c | 39
>> +++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 39 insertions(+)
>>>>>>
>>>>> Good idea but having a fixed size stack makes writing complex application more difficult. Plus you lose the safety of guard pages.
> Would it be possible to add a guard page or guard region by using the O/S memory allocator instead of rte_zmalloc_socket()? Since the stack is considered private to the process, i.e. not accessible from other processes, this patch does not need to provide remote access to stack memory from secondary processes - and thus it is not a requirement for this features to use DPDK managed memory.
In order for each stack to have guard page protection, this would likely 
require reserving an entire hugepage per stack.  Although guard pages do 
not require physical memory allocation, it would not be possible for 
multiple stacks to share a hugepage and also have per stack guard page 
protection.
>
>>>> Thanks for the quick reply.
>>>>
>>>> The expectation is that use of this optional feature would be limited to cases where the performance gains justify the implications of these tradeoffs. For example, a specific data plane application may be okay with limited stack size and could be tested to ensure stack usage remains within limits.
> How to identify the required stack size and verify it... If aiming for small stacks, some instrumentation would be nice, like rte_mempool_audit() and rte_mempool_list_dump().
Theoretically, a region of memory following the stack could be populated 
with a poison pattern that could be audited.   Not as robust as hw 
mprotect/MMU, but it could provide some protection.
>
> Alternatively, just assume that the stack is "always big enough", and don't worry about it - like the default O/S stack size. And as Stephen already mentioned: Regardless of stack size, overflowing the stack will cause memory corruption instead of a segmentation fault.
>
> Keep in mind that the required stack size not only depends on the application, but also on DPDK and other libraries being used by the application.
>
>>>> Also, since this applies only to worker threads, the main thread would not be impacted by this change.
>>> I would prefer it as a runtime, not compile time option.
>>> That way distributions could ship DPDK and application could opt in if it wanted.
>> Good point..  I'll work on a v2 and will post that when it's ready.
> May I suggest using the stack size configured in the O/S, from pthread_attr_getstacksize() or similar, instead of choosing the stack size manually? If you want it to be configurable, use the default size unless explicitly specified otherwise.
Yes, that can be handled in EAL args.  I'll include that in the next 
version.
>
> Do the worker threads need a different stack size than the main thread? In my opinion: "Nice to have", not "must have".
The main thread stack behaves differently anyway; it can grow 
dynamically, but regarless of this patch, pthread stack sizes are always 
fixed.   This change only relates to worker threads.
>
> Do the worker threads need different stack sizes individually? In my opinion: Perhaps "nice to have", certainly not "must have".
>
Currently, worker thread stack sizes are uniformly sized and not 
dynamically resized. This patch does not change that aspect.  Given 
that, it seems unnecessary to add that complexity here.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-29 18:52           ` Don Wallwork
@ 2022-04-29 19:03             ` Stephen Hemminger
  2022-05-02 13:15               ` Don Wallwork
  2022-04-30  7:55             ` Morten Brørup
  1 sibling, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2022-04-29 19:03 UTC (permalink / raw)
  To: Don Wallwork
  Cc: Morten Brørup, Anatoly Burakov, Dmitry Kozlyuk,
	Bruce Richardson, dev

On Fri, 29 Apr 2022 14:52:03 -0400
Don Wallwork <donw@xsightlabs.com> wrote:

> >>>>
> >>>> The expectation is that use of this optional feature would be limited to cases where the performance gains justify the implications of these tradeoffs. For example, a specific data plane application may be okay with limited stack size and could be tested to ensure stack usage remains within limits.  
> > How to identify the required stack size and verify it... If aiming for small stacks, some instrumentation would be nice, like rte_mempool_audit() and rte_mempool_list_dump().  
> Theoretically, a region of memory following the stack could be populated 
> with a poison pattern that could be audited.   Not as robust as hw 
> mprotect/MMU, but it could provide some protection.
> >

Usually just doing mmap(.., PROT_NONE) will create a page that will cause SEGV on access
which is what  you want.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-29 19:03             ` Stephen Hemminger
@ 2022-05-02 13:15               ` Don Wallwork
  0 siblings, 0 replies; 15+ messages in thread
From: Don Wallwork @ 2022-05-02 13:15 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Morten Brørup, Anatoly Burakov, Dmitry Kozlyuk,
	Bruce Richardson, dev



On 4/29/2022 3:03 PM, Stephen Hemminger wrote:
> On Fri, 29 Apr 2022 14:52:03 -0400
> Don Wallwork <donw@xsightlabs.com> wrote:
>
>>>>>> The expectation is that use of this optional feature would be limited to cases where the performance gains justify the implications of these tradeoffs. For example, a specific data plane application may be okay with limited stack size and could be tested to ensure stack usage remains within limits.
>>> How to identify the required stack size and verify it... If aiming for small stacks, some instrumentation would be nice, like rte_mempool_audit() and rte_mempool_list_dump().
>> Theoretically, a region of memory following the stack could be populated
>> with a poison pattern that could be audited.   Not as robust as hw
>> mprotect/MMU, but it could provide some protection.
> Usually just doing mmap(.., PROT_NONE) will create a page that will cause SEGV on access
> which is what  you want.
As mentioned elsewhere, the problem with this is we don't want to 
allocate an entire hugepage per stack just to get a guard page.

There is a simple way to verify this.  If the application is run without 
the hugepage stacks option and it does not seg fault, then we know it is 
safe to run the application with the hugepage stacks given the same 
thread stack size.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-29 18:52           ` Don Wallwork
  2022-04-29 19:03             ` Stephen Hemminger
@ 2022-04-30  7:55             ` Morten Brørup
  1 sibling, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2022-04-30  7:55 UTC (permalink / raw)
  To: Don Wallwork, Stephen Hemminger, Anatoly Burakov, Dmitry Kozlyuk,
	Bruce Richardson
  Cc: dev

> From: Don Wallwork [mailto:donw@xsightlabs.com]
> Sent: Friday, 29 April 2022 20.52
> 
> On 4/27/2022 4:17 AM, Morten Brørup wrote:
> > +CC: EAL and Memory maintainers.
> >
> >> From: Don Wallwork [mailto:donw@xsightlabs.com]
> >> Sent: Tuesday, 26 April 2022 23.26
> >>
> >> On 4/26/2022 5:21 PM, Stephen Hemminger wrote:
> >>> On Tue, 26 Apr 2022 17:01:18 -0400
> >>> Don Wallwork <donw@xsightlabs.com> wrote:
> >>>
> >>>> On 4/26/2022 10:58 AM, Stephen Hemminger wrote:
> >>>>> On Tue, 26 Apr 2022 08:19:59 -0400
> >>>>> Don Wallwork <donw@xsightlabs.com> wrote:
> >>>>>
> >>>>>> Add support for using hugepages for worker lcore stack memory.
> The intent is to improve performance by reducing stack memory related
> TLB misses and also by using memory local to the NUMA node of each
> lcore.
> > This certainly seems like a good idea!
> >
> > However, I wonder: Does the O/S assign memory local to the NUMA node
> to an lcore-pinned thread's stack when instantiating the tread? And
> does the DPDK EAL ensure that the preconditions for the O/S to do that
> are present?
> >
> > (Not relevant for this patch, but the same locality questions come to
> mind regarding Thread Local Storage.)
> Currently, DPDK does not set pthread affinity until after the pthread
> is
> created and the stack has been allocated.  If the affinity attribute
> were set before the pthread_create call, it seems possible that
> pthread_create could be NUMA aware when allocating the stack.  However,
> it looks like at least the glibc v2.35 implementation of pthread_create
> does not consider this at stack allocation time.

Thank you for the looking into this! Very interesting.

So, your patch improves the memory locality (and TLB) for the stack, which is great.

The same for Thread Local Storage needs to be addressed by glibc (and the C libraries of other OS'es), which is clearly out of scope here. I searched for RTE_DEFINE_PER_LCORE, and it is only rarely used in DPDK core libraries, so this is not a problem inside DPDK.

> > Would it be possible to add a guard page or guard region by using the
> O/S memory allocator instead of rte_zmalloc_socket()? Since the stack
> is considered private to the process, i.e. not accessible from other
> processes, this patch does not need to provide remote access to stack
> memory from secondary processes - and thus it is not a requirement for
> this features to use DPDK managed memory.
> In order for each stack to have guard page protection, this would
> likely
> require reserving an entire hugepage per stack.  Although guard pages
> do
> not require physical memory allocation, it would not be possible for
> multiple stacks to share a hugepage and also have per stack guard page
> protection.

Makes sense; allocating an entire huge page for stack per worker thread could be considered too much.

> > Do the worker threads need a different stack size than the main
> thread? In my opinion: "Nice to have", not "must have".
> The main thread stack behaves differently anyway; it can grow
> dynamically, but regarless of this patch, pthread stack sizes are
> always
> fixed.   This change only relates to worker threads.

Agree.

> >
> > Do the worker threads need different stack sizes individually? In my
> opinion: Perhaps "nice to have", certainly not "must have".
> >
> Currently, worker thread stack sizes are uniformly sized and not
> dynamically resized. This patch does not change that aspect.  Given
> that, it seems unnecessary to add that complexity here.

Agree.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-26 12:19 [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory Don Wallwork
  2022-04-26 14:58 ` Stephen Hemminger
@ 2022-04-27  0:42 ` Honnappa Nagarahalli
  2022-04-27 17:50   ` Don Wallwork
  2022-04-29 20:00 ` [RFC v2] " Don Wallwork
  2 siblings, 1 reply; 15+ messages in thread
From: Honnappa Nagarahalli @ 2022-04-27  0:42 UTC (permalink / raw)
  To: Don Wallwork, dev; +Cc: nd, nd

<snip>

> 
> Add support for using hugepages for worker lcore stack memory.  The intent is
> to improve performance by reducing stack memory related TLB misses and also
> by using memory local to the NUMA node of each lcore.
This is a good idea. Have you measured any performance differences with this patch? What kind of benefits do you see?

> 
> Platforms desiring to make use of this capability must enable the associated
> option flag and stack size settings in platform config files.
> ---
>  lib/eal/linux/eal.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c index
> 1ef263434a..4e1e5b6915 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -1143,9 +1143,48 @@ rte_eal_init(int argc, char **argv)
> 
>  		lcore_config[i].state = WAIT;
> 
> +#ifdef RTE_EAL_NUMA_AWARE_LCORE_STACK
> +		/* Allocate NUMA aware stack memory and set pthread
> attributes */
> +		pthread_attr_t attr;
> +		void *stack_ptr =
> +			rte_zmalloc_socket("lcore_stack",
> +
> RTE_EAL_NUMA_AWARE_LCORE_STACK_SIZE,
> +
> RTE_EAL_NUMA_AWARE_LCORE_STACK_SIZE,
> +					   rte_lcore_to_socket_id(i));
> +
> +		if (stack_ptr == NULL) {
> +			rte_eal_init_alert("Cannot allocate stack memory");
May be worth adding more details to the error message, like lcore id.

> +			rte_errno = ENOMEM;
> +			return -1;
> +		}
> +
> +		if (pthread_attr_init(&attr) != 0) {
> +			rte_eal_init_alert("Cannot init pthread attributes");
> +			rte_errno = EINVAL;
EFAULT would be better.

> +			return -1;
> +		}
> +		if (pthread_attr_setstack(&attr,
> +					  stack_ptr,
> +
> RTE_EAL_NUMA_AWARE_LCORE_STACK_SIZE) != 0) {
> +			rte_eal_init_alert("Cannot set pthread stack
> attributes");
> +			rte_errno = ENOTSUP;
EFAULT would be better.

> +			return -1;
> +		}
> +
> +		/* create a thread for each lcore */
> +		ret = pthread_create(&lcore_config[i].thread_id, &attr,
> +				     eal_thread_loop, (void *)(uintptr_t)i);
> +
> +		if (pthread_attr_destroy(&attr) != 0) {
> +			rte_eal_init_alert("Cannot destroy pthread
> attributes");
> +			rte_errno = EFAULT;
> +			return -1;
> +		}
> +#else
>  		/* create a thread for each lcore */
>  		ret = pthread_create(&lcore_config[i].thread_id, NULL,
>  				     eal_thread_loop, (void *)(uintptr_t)i);
> +#endif
>  		if (ret != 0)
>  			rte_panic("Cannot create thread\n");
> 
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-27  0:42 ` Honnappa Nagarahalli
@ 2022-04-27 17:50   ` Don Wallwork
  2022-04-27 19:09     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 15+ messages in thread
From: Don Wallwork @ 2022-04-27 17:50 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev; +Cc: nd

On 4/26/2022 8:42 PM, Honnappa Nagarahalli wrote:
> <snip>
>
>> Add support for using hugepages for worker lcore stack memory.  The intent is
>> to improve performance by reducing stack memory related TLB misses and also
>> by using memory local to the NUMA node of each lcore.
> This is a good idea. Have you measured any performance differences with this patch? What kind of benefits do you see?

The performance benefit is very application and target dependent. For 
example, a trivial test with a small memory/stack footprint would not 
see much benefit from this change.  An application with deep call stacks 
and/or significant stack and table data memory usage would benefit.  
We're not yet prepared to release specific performance delta data, but 
we are currently developing applications that benefit from this feature.

All other comments will be incorporated in the next version.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-27 17:50   ` Don Wallwork
@ 2022-04-27 19:09     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 15+ messages in thread
From: Honnappa Nagarahalli @ 2022-04-27 19:09 UTC (permalink / raw)
  To: Don Wallwork, dev; +Cc: nd, nd

<snip>
> >
> >> Add support for using hugepages for worker lcore stack memory.  The
> >> intent is to improve performance by reducing stack memory related TLB
> >> misses and also by using memory local to the NUMA node of each lcore.
> > This is a good idea. Have you measured any performance differences with this
> patch? What kind of benefits do you see?
> 
> The performance benefit is very application and target dependent. For
> example, a trivial test with a small memory/stack footprint would not see much
> benefit from this change.  An application with deep call stacks and/or
> significant stack and table data memory usage would benefit. We're not yet
> prepared to release specific performance delta data, but we are currently
> developing applications that benefit from this feature.
Ok, thank you. So, you do not see any improvement in L3fwd example app, correct?

> 
> All other comments will be incorporated in the next version.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC v2] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-26 12:19 [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory Don Wallwork
  2022-04-26 14:58 ` Stephen Hemminger
  2022-04-27  0:42 ` Honnappa Nagarahalli
@ 2022-04-29 20:00 ` Don Wallwork
  2022-04-30  7:20   ` Morten Brørup
  2 siblings, 1 reply; 15+ messages in thread
From: Don Wallwork @ 2022-04-29 20:00 UTC (permalink / raw)
  To: dev
  Cc: donw, stephen, mb, anatoly.burakov, dmitry.kozliuk,
	bruce.richardson, Honnappa.Nagarahalli, nd

Add support for using hugepages for worker lcore stack memory.  The
intent is to improve performance by reducing stack memory related TLB
misses and also by using memory local to the NUMA node of each lcore.

EAL option '--huge-worker-stack [stack-size-kbytes]' is added to allow
the feature to be enabled at runtime.  If the size is not specified,
the system pthread stack size will be used.

Signed-off-by: Don Wallwork <donw@xsightlabs.com>
---
 lib/eal/common/eal_common_options.c | 26 ++++++++++++
 lib/eal/common/eal_internal_cfg.h   |  4 ++
 lib/eal/common/eal_options.h        |  2 +
 lib/eal/linux/eal.c                 | 65 ++++++++++++++++++++++++++++-
 4 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index f247a42455..7473ba6969 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -103,6 +103,7 @@ eal_long_options[] = {
 	{OPT_TELEMETRY,         0, NULL, OPT_TELEMETRY_NUM        },
 	{OPT_NO_TELEMETRY,      0, NULL, OPT_NO_TELEMETRY_NUM     },
 	{OPT_FORCE_MAX_SIMD_BITWIDTH, 1, NULL, OPT_FORCE_MAX_SIMD_BITWIDTH_NUM},
+	{OPT_HUGE_WORKER_STACK, 2, NULL, OPT_HUGE_WORKER_STACK_NUM     },
 
 	{0,                     0, NULL, 0                        }
 };
@@ -1618,6 +1619,22 @@ eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
 	return -1;
 }
 
+static int
+eal_parse_huge_worker_stack(const char *arg, size_t *huge_worker_stack_size)
+{
+	size_t worker_stack_size;
+	if (arg == NULL) {
+		*huge_worker_stack_size = USE_OS_STACK_SIZE;
+		return 0;
+	}
+	worker_stack_size = atoi(arg);
+	if (worker_stack_size == 0)
+		return -1;
+
+	*huge_worker_stack_size = worker_stack_size * 1024;
+	return 0;
+}
+
 int
 eal_parse_common_option(int opt, const char *optarg,
 			struct internal_config *conf)
@@ -1921,6 +1938,15 @@ eal_parse_common_option(int opt, const char *optarg,
 		}
 		break;
 
+	case OPT_HUGE_WORKER_STACK_NUM:
+		if (eal_parse_huge_worker_stack(optarg,
+						&conf->huge_worker_stack_size) < 0) {
+			RTE_LOG(ERR, EAL, "invalid parameter for --"
+				OPT_HUGE_WORKER_STACK"\n");
+			return -1;
+		}
+		break;
+
 	/* don't know what to do, leave this to caller */
 	default:
 		return 1;
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index b71faadd18..8ac91ab3a2 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -48,6 +48,9 @@ struct hugepage_file_discipline {
 	bool unlink_existing;
 };
 
+/** Worker hugepage stack size should default to OS value. */
+#define USE_OS_STACK_SIZE ((size_t)~0)
+
 /**
  * internal configuration
  */
@@ -102,6 +105,7 @@ struct internal_config {
 	unsigned int no_telemetry; /**< true to disable Telemetry */
 	struct simd_bitwidth max_simd_bitwidth;
 	/**< max simd bitwidth path to use */
+	size_t huge_worker_stack_size; /**< worker thread stack size in kbytes */
 };
 
 void eal_reset_internal_config(struct internal_config *internal_cfg);
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 8e4f7202a2..3cc9cb6412 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -87,6 +87,8 @@ enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_HUGE_WORKER_STACK  "huge-worker-stack"
+	OPT_HUGE_WORKER_STACK_NUM,
 
 	OPT_LONG_MAX_NUM
 };
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 1ef263434a..e8c872ef7b 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1144,8 +1144,69 @@ rte_eal_init(int argc, char **argv)
 		lcore_config[i].state = WAIT;
 
 		/* create a thread for each lcore */
-		ret = pthread_create(&lcore_config[i].thread_id, NULL,
-				     eal_thread_loop, (void *)(uintptr_t)i);
+		if (internal_conf->huge_worker_stack_size == 0) {
+			ret = pthread_create(&lcore_config[i].thread_id, NULL,
+					     eal_thread_loop,
+					     (void *)(uintptr_t)i);
+		} else {
+			/* Allocate NUMA aware stack memory and set
+			 * pthread attributes
+			 */
+			pthread_attr_t attr;
+			size_t stack_size;
+			void *stack_ptr;
+
+			if (pthread_attr_init(&attr) != 0) {
+				rte_eal_init_alert("Cannot init pthread "
+						   "attributes");
+				rte_errno = EFAULT;
+				return -1;
+			}
+			if (internal_conf->huge_worker_stack_size ==
+			    USE_OS_STACK_SIZE) {
+				if (pthread_attr_getstacksize(&attr,
+							      &stack_size) != 0) {
+					rte_errno = EFAULT;
+					return -1;
+				}
+			} else {
+				stack_size =
+					internal_conf->huge_worker_stack_size;
+			}
+			stack_ptr =
+				rte_zmalloc_socket("lcore_stack",
+						   stack_size,
+						   stack_size,
+						   rte_lcore_to_socket_id(i));
+
+			if (stack_ptr == NULL) {
+				rte_eal_init_alert("Cannot allocate stack "
+						   "memory for worker lcore");
+				rte_errno = ENOMEM;
+				return -1;
+			}
+
+			if (pthread_attr_setstack(&attr,
+						  stack_ptr,
+						  stack_size) != 0) {
+				rte_eal_init_alert("Cannot set pthread "
+						   "stack attributes");
+				rte_errno = EFAULT;
+				return -1;
+			}
+
+			/* create a thread for each lcore */
+			ret = pthread_create(&lcore_config[i].thread_id, &attr,
+					     eal_thread_loop,
+					     (void *)(uintptr_t)i);
+
+			if (pthread_attr_destroy(&attr) != 0) {
+				rte_eal_init_alert("Cannot destroy pthread "
+						   "attributes");
+				rte_errno = EFAULT;
+				return -1;
+			}
+		}
 		if (ret != 0)
 			rte_panic("Cannot create thread\n");
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC v2] eal: allow worker lcore stacks to be allocated from hugepage memory
  2022-04-29 20:00 ` [RFC v2] " Don Wallwork
@ 2022-04-30  7:20   ` Morten Brørup
  0 siblings, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2022-04-30  7:20 UTC (permalink / raw)
  To: Don Wallwork, dev
  Cc: stephen, anatoly.burakov, dmitry.kozliuk, bruce.richardson,
	Honnappa.Nagarahalli, nd

> From: Don Wallwork [mailto:donw@xsightlabs.com]
> Sent: Friday, 29 April 2022 22.01
> 
> Add support for using hugepages for worker lcore stack memory.  The
> intent is to improve performance by reducing stack memory related TLB
> misses and also by using memory local to the NUMA node of each lcore.
> 
> EAL option '--huge-worker-stack [stack-size-kbytes]' is added to allow
> the feature to be enabled at runtime.  If the size is not specified,
> the system pthread stack size will be used.

It would be nice if DPDK EAL could parse size parameter values provided as "1M" or "128k"; but it is clearly not a requirement for this patch. Just mentioning it.

> 
> Signed-off-by: Don Wallwork <donw@xsightlabs.com>
> ---

>  /**
>   * internal configuration
>   */
> @@ -102,6 +105,7 @@ struct internal_config {
>  	unsigned int no_telemetry; /**< true to disable Telemetry */
>  	struct simd_bitwidth max_simd_bitwidth;
>  	/**< max simd bitwidth path to use */
> +	size_t huge_worker_stack_size; /**< worker thread stack size in
> kbytes */

The command line parameter value has been converted from kbytes to bytes here, so this comment is wrong.

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-05-02 13:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 12:19 [RFC] eal: allow worker lcore stacks to be allocated from hugepage memory Don Wallwork
2022-04-26 14:58 ` Stephen Hemminger
2022-04-26 21:01   ` Don Wallwork
2022-04-26 21:21     ` Stephen Hemminger
2022-04-26 21:25       ` Don Wallwork
2022-04-27  8:17         ` Morten Brørup
2022-04-29 18:52           ` Don Wallwork
2022-04-29 19:03             ` Stephen Hemminger
2022-05-02 13:15               ` Don Wallwork
2022-04-30  7:55             ` Morten Brørup
2022-04-27  0:42 ` Honnappa Nagarahalli
2022-04-27 17:50   ` Don Wallwork
2022-04-27 19:09     ` Honnappa Nagarahalli
2022-04-29 20:00 ` [RFC v2] " Don Wallwork
2022-04-30  7:20   ` Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).