* core performance
[not found] <595544330.11681349.1727123476579.ref@mail.yahoo.com>
@ 2024-09-23 20:31 ` amit sehas
2024-09-30 15:57 ` Stephen Hemminger
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-23 20:31 UTC (permalink / raw)
To: users
We are seeing different dpdk threads (launched via rte_eal_remote_launch()), demonstrate very different performance.
After placing counters all over the code, we realize that some threads are uniformly slow, in other words there is no application level issue that is throttling one thread over the other. We come to the conculsion that either the Cores on which they are running are not at the same frequency which seems doubtful or the threads are not getting a chance to execute on the cores uniformly.
It seems that isolcpus has been deprecated in recent versions of linux.
What is the recommended approach to prevent the kernel from utilizing some CPU threads, for anything other than the threads that are launched on them.
Is there some API in dpdk which also helps us determine which CPU core the thread is pinned to?
I did not find any code in dpdk which actually performed pinning of a thread to a CPU core.
In our case it is more or less certain that the different threads are simply not getting the same CPU core time, as a result some are demonstrating higher throughput than the others ...
how do we fix this?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-23 20:31 ` core performance amit sehas
@ 2024-09-30 15:57 ` Stephen Hemminger
2024-09-30 17:27 ` Stephen Hemminger
0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2024-09-30 15:57 UTC (permalink / raw)
To: amit sehas; +Cc: users
On Mon, 23 Sep 2024 20:31:16 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> We are seeing different dpdk threads (launched via rte_eal_remote_launch()), demonstrate very different performance.
Are the DPDK threads running on isolated cpus?
Are the DPDK threads doing any system calls (use strace to check)?
>
> After placing counters all over the code, we realize that some threads are uniformly slow, in other words there is no application level issue that is throttling one thread over the other. We come to the conculsion that either the Cores on which they are running are not at the same frequency which seems doubtful or the threads are not getting a chance to execute on the cores uniformly.
>
> It seems that isolcpus has been deprecated in recent versions of linux.
>
> What is the recommended approach to prevent the kernel from utilizing some CPU threads, for anything other than the threads that are launched on them.
On modern Linux systems, CPU isolation can be achieved with cgroups.
>
> Is there some API in dpdk which also helps us determine which CPU core the thread is pinned to?
> I did not find any code in dpdk which actually performed pinning of a thread to a CPU core.
It is here in lib/eal/linux/eal.c
/* Launch threads, called at application init(). */
int
rte_eal_init(int argc, char **argv)
{
...
RTE_LCORE_FOREACH_WORKER(i) {
...
ret = rte_thread_set_affinity_by_id(lcore_config[i].thread_id,
&lcore_config[i].cpuset);
if (ret != 0)
rte_panic("Cannot set affinity\n");
}
>
> In our case it is more or less certain that the different threads are simply not getting the same CPU core time, as a result some are demonstrating higher throughput than the others ...
>
> how do we fix this?
Did you get profiling info? I would start by getting flame graph using perf.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-30 15:57 ` Stephen Hemminger
@ 2024-09-30 17:27 ` Stephen Hemminger
2024-09-30 17:31 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2024-09-30 17:27 UTC (permalink / raw)
To: amit sehas; +Cc: users
On Mon, 30 Sep 2024 08:57:26 -0700
Stephen Hemminger <stephen@networkplumber.org> wrote:
> > After placing counters all over the code, we realize that some threads are uniformly slow, in other words there is no application level issue that is throttling one thread over the other. We come to the conculsion that either the Cores on which they are running are not at the same frequency which seems doubtful or the threads are not getting a chance to execute on the cores uniformly.
> >
> > It seems that isolcpus has been deprecated in recent versions of linux.
> >
> > What is the recommended approach to prevent the kernel from utilizing some CPU threads, for anything other than the threads that are launched on them.
>
> On modern Linux systems, CPU isolation can be achieved with cgroups.
Did you checkout the links in the section in the docs on core isolation.
https://doc.dpdk.org/guides/linux_gsg/enable_func.html
https://www.suse.com/c/cpu-isolation-practical-example-part-5/
https://www.rcannings.com/systemd-core-isolation/
There is also a much more complex and detailed script which is part of
the open source project DanOs here:
https://github.com/danos/vyatta-cpu-shield/blob/master/usr/bin/cpu_shield
If you really want isolated CPU's you have to some more complex stuff to make
sure interrupts etc don't run on that CPU. Also never use CPU 0
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-30 17:27 ` Stephen Hemminger
@ 2024-09-30 17:31 ` amit sehas
0 siblings, 0 replies; 19+ messages in thread
From: amit sehas @ 2024-09-30 17:31 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: users
Thanks so much for the suggestions, i will definitely look at them,
regards
On Monday, September 30, 2024 at 10:27:58 AM PDT, Stephen Hemminger <stephen@networkplumber.org> wrote:
On Mon, 30 Sep 2024 08:57:26 -0700
Stephen Hemminger <stephen@networkplumber.org> wrote:
> > After placing counters all over the code, we realize that some threads are uniformly slow, in other words there is no application level issue that is throttling one thread over the other. We come to the conculsion that either the Cores on which they are running are not at the same frequency which seems doubtful or the threads are not getting a chance to execute on the cores uniformly.
> >
> > It seems that isolcpus has been deprecated in recent versions of linux.
> >
> > What is the recommended approach to prevent the kernel from utilizing some CPU threads, for anything other than the threads that are launched on them.
>
> On modern Linux systems, CPU isolation can be achieved with cgroups.
Did you checkout the links in the section in the docs on core isolation.
https://doc.dpdk.org/guides/linux_gsg/enable_func.html
https://www.suse.com/c/cpu-isolation-practical-example-part-5/
https://www.rcannings.com/systemd-core-isolation/
There is also a much more complex and detailed script which is part of
the open source project DanOs here:
https://github.com/danos/vyatta-cpu-shield/blob/master/usr/bin/cpu_shield
If you really want isolated CPU's you have to some more complex stuff to make
sure interrupts etc don't run on that CPU. Also never use CPU 0
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-27 3:13 ` amit sehas
@ 2024-09-27 3:23 ` amit sehas
0 siblings, 0 replies; 19+ messages in thread
From: amit sehas @ 2024-09-27 3:23 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Nishant Verma, users
For a weaker AWS instance this is what we find with cpu_layout.py:
======================================================================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
======================================================================
cores = [0, 1]
sockets = [0]
Socket 0
--------
Core 0 [0, 2]
Core 1 [1, 3]
So from this we deduce that logical core 0 and logical core 2 are on the same physical core
and 1, 3 are on the other physical core, learnt something valuable today ...
regards
On Thursday, September 26, 2024 at 08:13:48 PM PDT, amit sehas <cun23@yahoo.com> wrote:
Thanks for the suggestion, i didnt even know about cpu_layout.py ... i will definitely try it.... i made some more measurements and so far this is the hypothesis:
1) 8 hyperthreads are not the same as 8 CPUs, the scale up is not linear.
2) the vCPU cache allocation per logical CPU thread is also important, if 2 threads are running the same code on the same physical core but 2 different logical cores then we will not have cores competing with each other.
3) try to run dissimilar code on the logical cores that run on the same physical core ...
the cpu map is deinitely worth figuring out ...
regards
On Thursday, September 26, 2024 at 08:03:30 PM PDT, Stephen Hemminger <stephen@networkplumber.org> wrote:
On Thu, 26 Sep 2024 17:03:17 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> If there is a way to determine:
>
> vCPU thread utilization numbers over a period of time, such as a few hours
>
> or which processes are consuming the most CPU
>
> top always indicates that the server is consuming the most CPU.
>
> Now i am begining to wonder if 8 vCPU threads really are capable of running 6 high intensity threads or only 4 such threads? Dont know
>
> Also tried to utilize pthread_setschedparam() explicitly on some of the threads, it made no difference to the performance. But if we do it on more than 1-2 threads then it hangs the whole system.
>
> This is primarily a matter of CPU scheduling, and if we restirct context switching on even 2 critical threads we have a win.
>
>
Some other recommendations.
- avoid CPU 0 you can't isolate it, and it has other stuff that has to run there
if you have main thread that sleeps, and worker threads that poll, then
go ahead and put main on cpu 0.
- don't put two active polling cores on shared hyper-thread.
You can used DPDK's cpu_layout.py script to show this.
For example:
$ ./usertools/cpu_layout.py
======================================================================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
======================================================================
cores = [0, 1, 2, 3]
sockets = [0]
Socket 0
--------
Core 0 [0, 4]
Core 1 [1, 5]
Core 2 [2, 6]
Core 3 [3, 7]
On this system, don't poll on cores 0 and 4 (system activity).
Use lcore 1, 2, 3
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-27 3:03 ` Stephen Hemminger
@ 2024-09-27 3:13 ` amit sehas
2024-09-27 3:23 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-27 3:13 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Nishant Verma, users
Thanks for the suggestion, i didnt even know about cpu_layout.py ... i will definitely try it.... i made some more measurements and so far this is the hypothesis:
1) 8 hyperthreads are not the same as 8 CPUs, the scale up is not linear.
2) the vCPU cache allocation per logical CPU thread is also important, if 2 threads are running the same code on the same physical core but 2 different logical cores then we will not have cores competing with each other.
3) try to run dissimilar code on the logical cores that run on the same physical core ...
the cpu map is deinitely worth figuring out ...
regards
On Thursday, September 26, 2024 at 08:03:30 PM PDT, Stephen Hemminger <stephen@networkplumber.org> wrote:
On Thu, 26 Sep 2024 17:03:17 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> If there is a way to determine:
>
> vCPU thread utilization numbers over a period of time, such as a few hours
>
> or which processes are consuming the most CPU
>
> top always indicates that the server is consuming the most CPU.
>
> Now i am begining to wonder if 8 vCPU threads really are capable of running 6 high intensity threads or only 4 such threads? Dont know
>
> Also tried to utilize pthread_setschedparam() explicitly on some of the threads, it made no difference to the performance. But if we do it on more than 1-2 threads then it hangs the whole system.
>
> This is primarily a matter of CPU scheduling, and if we restirct context switching on even 2 critical threads we have a win.
>
>
Some other recommendations.
- avoid CPU 0 you can't isolate it, and it has other stuff that has to run there
if you have main thread that sleeps, and worker threads that poll, then
go ahead and put main on cpu 0.
- don't put two active polling cores on shared hyper-thread.
You can used DPDK's cpu_layout.py script to show this.
For example:
$ ./usertools/cpu_layout.py
======================================================================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
======================================================================
cores = [0, 1, 2, 3]
sockets = [0]
Socket 0
--------
Core 0 [0, 4]
Core 1 [1, 5]
Core 2 [2, 6]
Core 3 [3, 7]
On this system, don't poll on cores 0 and 4 (system activity).
Use lcore 1, 2, 3
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-26 17:03 ` amit sehas
@ 2024-09-27 3:03 ` Stephen Hemminger
2024-09-27 3:13 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2024-09-27 3:03 UTC (permalink / raw)
To: amit sehas; +Cc: Nishant Verma, users
On Thu, 26 Sep 2024 17:03:17 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> If there is a way to determine:
>
> vCPU thread utilization numbers over a period of time, such as a few hours
>
> or which processes are consuming the most CPU
>
> top always indicates that the server is consuming the most CPU.
>
> Now i am begining to wonder if 8 vCPU threads really are capable of running 6 high intensity threads or only 4 such threads? Dont know
>
> Also tried to utilize pthread_setschedparam() explicitly on some of the threads, it made no difference to the performance. But if we do it on more than 1-2 threads then it hangs the whole system.
>
> This is primarily a matter of CPU scheduling, and if we restirct context switching on even 2 critical threads we have a win.
>
>
Some other recommendations.
- avoid CPU 0 you can't isolate it, and it has other stuff that has to run there
if you have main thread that sleeps, and worker threads that poll, then
go ahead and put main on cpu 0.
- don't put two active polling cores on shared hyper-thread.
You can used DPDK's cpu_layout.py script to show this.
For example:
$ ./usertools/cpu_layout.py
======================================================================
Core and Socket Information (as reported by '/sys/devices/system/cpu')
======================================================================
cores = [0, 1, 2, 3]
sockets = [0]
Socket 0
--------
Core 0 [0, 4]
Core 1 [1, 5]
Core 2 [2, 6]
Core 3 [3, 7]
On this system, don't poll on cores 0 and 4 (system activity).
Use lcore 1, 2, 3
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-26 16:56 ` amit sehas
@ 2024-09-26 17:03 ` amit sehas
2024-09-27 3:03 ` Stephen Hemminger
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-26 17:03 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Nishant Verma, users
If there is a way to determine:
vCPU thread utilization numbers over a period of time, such as a few hours
or which processes are consuming the most CPU
top always indicates that the server is consuming the most CPU.
Now i am begining to wonder if 8 vCPU threads really are capable of running 6 high intensity threads or only 4 such threads? Dont know
Also tried to utilize pthread_setschedparam() explicitly on some of the threads, it made no difference to the performance. But if we do it on more than 1-2 threads then it hangs the whole system.
This is primarily a matter of CPU scheduling, and if we restirct context switching on even 2 critical threads we have a win.
regards
On Thursday, September 26, 2024 at 09:56:04 AM PDT, amit sehas <cun23@yahoo.com> wrote:
Belos is the lscpu that was requested, it appears to suggest an 8 vCPU thread setup ... if am reading it correctly:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
BIOS Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 7
BogoMIPS: 4999.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 cl
flush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc re
p_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclm
ulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
imer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpci
d_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx5
12f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xs
aveopt xsavec xgetbv1 xsaves ida arat pku ospke
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 4 MiB (4 instances)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Unknown: Dependent on hypervisor status
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unkn
own
Meltdown: Mitigation; PTI
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unkn
own
Reg file data sampling: Not affected
Retbleed: Vulnerable
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affec
ted; BHI Retpoline
Srbds: Not affected
Tsx async abort: Not affected
On Thursday, September 26, 2024 at 05:32:52 AM PDT, amit sehas <cun23@yahoo.com> wrote:
Simply reordering the launch of different threads brings back a lot of the lost performance, this is a clear evidence that some CPU threads are more predisposed to context switches than the others.
This is a thread scheduling issue at the CPU level as we have expected. In a previous exchange someone has suggested that utilizing rte_thread_set_priority to RTE_THREAD_PRIORITY_REALTIME_CRITICAL is not a good idea
we should be able to prioritize some threads over the other threads ... since we are utilizing rte_eal_remote_launch, one would think that such a functonality should be a part of the library ...
any ideas folks?
regards
On Tuesday, September 24, 2024 at 01:47:05 PM PDT, amit sehas <cun23@yahoo.com> wrote:
Thanks for the suggestions, so this is a database server which is doing lots of stuff, not every thread is heavily involved in dpdk packet processing. As a result the guidelines for attaining the most dpdk performance are applicable to only a few threads.
In this particular issue we are specificially looking at CPU scheduling of threads that are primarily heavily processing database queries. These threads, from our measurements, are not being uniformly scheduled on the CPU ...
This is our primary concern, since we utilized rte_eal_remote_launch to spawn the threads, we are wondering if there are any options in this API that will allow us to more uniformly allocate the CPU to threads that are critical...
regards
On Tuesday, September 24, 2024 at 09:38:16 AM PDT, Stephen Hemminger <stephen@networkplumber.org> wrote:
On Tue, 24 Sep 2024 14:40:49 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> Thanks for your response, and thanks for your input on the set_priority,
>
> The best guess we have at this point is that this is not a dpdk performance issue. This is an issue with some threads running into more context switches than the others and hence not getting the same slice of the CPU. We are certain that this is not a dpdk performance issue, the code
> is uniformly slow in one thread versus the other and the threads are doing a very large amount of work including accessing databases. The threads in question are not really doing packet processing as much as other work.
>
> So this is certainly not a dpdk performance issue. This is an issue of kernel threads not being scheduled properly or in the worse case the cores running on different frequency (which is quite unlikely no the AWS Xeons we are running this on).
>
> If you are asking for the dpdk config files to check for dpdk related performance issue then we are quite certain the issue is not with dpdk performance ...
> On Mon, Sep 23, 2024 at 10:06 PM amit sehas <cun23@yahoo.com> wrote:
> > Thanks for your response, i am not sure i understand your question ... we have our product that utilizes dpdk ... the commands are just our server commands and parameters ... and the lscpu is the hyperthreaded 8 thread Xeon instance in AWS ...
The rules of getting performance in DPDK:
- use DPDK threads (pinned) for datapath
- use isolated CPU's for those DPDK threads
- do not do any system calls
- avoid floating point
You can use tracing tools like strace or BPF to see what the thread is doing.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-26 12:32 ` amit sehas
@ 2024-09-26 16:56 ` amit sehas
2024-09-26 17:03 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-26 16:56 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Nishant Verma, users
Belos is the lscpu that was requested, it appears to suggest an 8 vCPU thread setup ... if am reading it correctly:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
BIOS Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 7
BogoMIPS: 4999.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 cl
flush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc re
p_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclm
ulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
imer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpci
d_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx5
12f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xs
aveopt xsavec xgetbv1 xsaves ida arat pku ospke
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 4 MiB (4 instances)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Unknown: Dependent on hypervisor status
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unkn
own
Meltdown: Mitigation; PTI
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unkn
own
Reg file data sampling: Not affected
Retbleed: Vulnerable
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affec
ted; BHI Retpoline
Srbds: Not affected
Tsx async abort: Not affected
On Thursday, September 26, 2024 at 05:32:52 AM PDT, amit sehas <cun23@yahoo.com> wrote:
Simply reordering the launch of different threads brings back a lot of the lost performance, this is a clear evidence that some CPU threads are more predisposed to context switches than the others.
This is a thread scheduling issue at the CPU level as we have expected. In a previous exchange someone has suggested that utilizing rte_thread_set_priority to RTE_THREAD_PRIORITY_REALTIME_CRITICAL is not a good idea
we should be able to prioritize some threads over the other threads ... since we are utilizing rte_eal_remote_launch, one would think that such a functonality should be a part of the library ...
any ideas folks?
regards
On Tuesday, September 24, 2024 at 01:47:05 PM PDT, amit sehas <cun23@yahoo.com> wrote:
Thanks for the suggestions, so this is a database server which is doing lots of stuff, not every thread is heavily involved in dpdk packet processing. As a result the guidelines for attaining the most dpdk performance are applicable to only a few threads.
In this particular issue we are specificially looking at CPU scheduling of threads that are primarily heavily processing database queries. These threads, from our measurements, are not being uniformly scheduled on the CPU ...
This is our primary concern, since we utilized rte_eal_remote_launch to spawn the threads, we are wondering if there are any options in this API that will allow us to more uniformly allocate the CPU to threads that are critical...
regards
On Tuesday, September 24, 2024 at 09:38:16 AM PDT, Stephen Hemminger <stephen@networkplumber.org> wrote:
On Tue, 24 Sep 2024 14:40:49 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> Thanks for your response, and thanks for your input on the set_priority,
>
> The best guess we have at this point is that this is not a dpdk performance issue. This is an issue with some threads running into more context switches than the others and hence not getting the same slice of the CPU. We are certain that this is not a dpdk performance issue, the code
> is uniformly slow in one thread versus the other and the threads are doing a very large amount of work including accessing databases. The threads in question are not really doing packet processing as much as other work.
>
> So this is certainly not a dpdk performance issue. This is an issue of kernel threads not being scheduled properly or in the worse case the cores running on different frequency (which is quite unlikely no the AWS Xeons we are running this on).
>
> If you are asking for the dpdk config files to check for dpdk related performance issue then we are quite certain the issue is not with dpdk performance ...
> On Mon, Sep 23, 2024 at 10:06 PM amit sehas <cun23@yahoo.com> wrote:
> > Thanks for your response, i am not sure i understand your question ... we have our product that utilizes dpdk ... the commands are just our server commands and parameters ... and the lscpu is the hyperthreaded 8 thread Xeon instance in AWS ...
The rules of getting performance in DPDK:
- use DPDK threads (pinned) for datapath
- use isolated CPU's for those DPDK threads
- do not do any system calls
- avoid floating point
You can use tracing tools like strace or BPF to see what the thread is doing.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-24 20:47 ` amit sehas
@ 2024-09-26 12:32 ` amit sehas
2024-09-26 16:56 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-26 12:32 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Nishant Verma, users
Simply reordering the launch of different threads brings back a lot of the lost performance, this is a clear evidence that some CPU threads are more predisposed to context switches than the others.
This is a thread scheduling issue at the CPU level as we have expected. In a previous exchange someone has suggested that utilizing rte_thread_set_priority to RTE_THREAD_PRIORITY_REALTIME_CRITICAL is not a good idea
we should be able to prioritize some threads over the other threads ... since we are utilizing rte_eal_remote_launch, one would think that such a functonality should be a part of the library ...
any ideas folks?
regards
On Tuesday, September 24, 2024 at 01:47:05 PM PDT, amit sehas <cun23@yahoo.com> wrote:
Thanks for the suggestions, so this is a database server which is doing lots of stuff, not every thread is heavily involved in dpdk packet processing. As a result the guidelines for attaining the most dpdk performance are applicable to only a few threads.
In this particular issue we are specificially looking at CPU scheduling of threads that are primarily heavily processing database queries. These threads, from our measurements, are not being uniformly scheduled on the CPU ...
This is our primary concern, since we utilized rte_eal_remote_launch to spawn the threads, we are wondering if there are any options in this API that will allow us to more uniformly allocate the CPU to threads that are critical...
regards
On Tuesday, September 24, 2024 at 09:38:16 AM PDT, Stephen Hemminger <stephen@networkplumber.org> wrote:
On Tue, 24 Sep 2024 14:40:49 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> Thanks for your response, and thanks for your input on the set_priority,
>
> The best guess we have at this point is that this is not a dpdk performance issue. This is an issue with some threads running into more context switches than the others and hence not getting the same slice of the CPU. We are certain that this is not a dpdk performance issue, the code
> is uniformly slow in one thread versus the other and the threads are doing a very large amount of work including accessing databases. The threads in question are not really doing packet processing as much as other work.
>
> So this is certainly not a dpdk performance issue. This is an issue of kernel threads not being scheduled properly or in the worse case the cores running on different frequency (which is quite unlikely no the AWS Xeons we are running this on).
>
> If you are asking for the dpdk config files to check for dpdk related performance issue then we are quite certain the issue is not with dpdk performance ...
> On Mon, Sep 23, 2024 at 10:06 PM amit sehas <cun23@yahoo.com> wrote:
> > Thanks for your response, i am not sure i understand your question ... we have our product that utilizes dpdk ... the commands are just our server commands and parameters ... and the lscpu is the hyperthreaded 8 thread Xeon instance in AWS ...
The rules of getting performance in DPDK:
- use DPDK threads (pinned) for datapath
- use isolated CPU's for those DPDK threads
- do not do any system calls
- avoid floating point
You can use tracing tools like strace or BPF to see what the thread is doing.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-24 16:38 ` Stephen Hemminger
@ 2024-09-24 20:47 ` amit sehas
2024-09-26 12:32 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-24 20:47 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Nishant Verma, users
Thanks for the suggestions, so this is a database server which is doing lots of stuff, not every thread is heavily involved in dpdk packet processing. As a result the guidelines for attaining the most dpdk performance are applicable to only a few threads.
In this particular issue we are specificially looking at CPU scheduling of threads that are primarily heavily processing database queries. These threads, from our measurements, are not being uniformly scheduled on the CPU ...
This is our primary concern, since we utilized rte_eal_remote_launch to spawn the threads, we are wondering if there are any options in this API that will allow us to more uniformly allocate the CPU to threads that are critical...
regards
On Tuesday, September 24, 2024 at 09:38:16 AM PDT, Stephen Hemminger <stephen@networkplumber.org> wrote:
On Tue, 24 Sep 2024 14:40:49 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> Thanks for your response, and thanks for your input on the set_priority,
>
> The best guess we have at this point is that this is not a dpdk performance issue. This is an issue with some threads running into more context switches than the others and hence not getting the same slice of the CPU. We are certain that this is not a dpdk performance issue, the code
> is uniformly slow in one thread versus the other and the threads are doing a very large amount of work including accessing databases. The threads in question are not really doing packet processing as much as other work.
>
> So this is certainly not a dpdk performance issue. This is an issue of kernel threads not being scheduled properly or in the worse case the cores running on different frequency (which is quite unlikely no the AWS Xeons we are running this on).
>
> If you are asking for the dpdk config files to check for dpdk related performance issue then we are quite certain the issue is not with dpdk performance ...
> On Mon, Sep 23, 2024 at 10:06 PM amit sehas <cun23@yahoo.com> wrote:
> > Thanks for your response, i am not sure i understand your question ... we have our product that utilizes dpdk ... the commands are just our server commands and parameters ... and the lscpu is the hyperthreaded 8 thread Xeon instance in AWS ...
The rules of getting performance in DPDK:
- use DPDK threads (pinned) for datapath
- use isolated CPU's for those DPDK threads
- do not do any system calls
- avoid floating point
You can use tracing tools like strace or BPF to see what the thread is doing.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-24 14:40 ` amit sehas
@ 2024-09-24 16:38 ` Stephen Hemminger
2024-09-24 20:47 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2024-09-24 16:38 UTC (permalink / raw)
To: amit sehas; +Cc: Nishant Verma, users
On Tue, 24 Sep 2024 14:40:49 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> Thanks for your response, and thanks for your input on the set_priority,
>
> The best guess we have at this point is that this is not a dpdk performance issue. This is an issue with some threads running into more context switches than the others and hence not getting the same slice of the CPU. We are certain that this is not a dpdk performance issue, the code
> is uniformly slow in one thread versus the other and the threads are doing a very large amount of work including accessing databases. The threads in question are not really doing packet processing as much as other work.
>
> So this is certainly not a dpdk performance issue. This is an issue of kernel threads not being scheduled properly or in the worse case the cores running on different frequency (which is quite unlikely no the AWS Xeons we are running this on).
>
> If you are asking for the dpdk config files to check for dpdk related performance issue then we are quite certain the issue is not with dpdk performance ...
> On Mon, Sep 23, 2024 at 10:06 PM amit sehas <cun23@yahoo.com> wrote:
> > Thanks for your response, i am not sure i understand your question ... we have our product that utilizes dpdk ... the commands are just our server commands and parameters ... and the lscpu is the hyperthreaded 8 thread Xeon instance in AWS ...
The rules of getting performance in DPDK:
- use DPDK threads (pinned) for datapath
- use isolated CPU's for those DPDK threads
- do not do any system calls
- avoid floating point
You can use tracing tools like strace or BPF to see what the thread is doing.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
[not found] ` <CAHhCjUFjqobchJ79z0BLLRXrLZdb2QyVPM6fbji6T7jpiKLa2Q@mail.gmail.com>
@ 2024-09-24 14:40 ` amit sehas
2024-09-24 16:38 ` Stephen Hemminger
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-24 14:40 UTC (permalink / raw)
To: Nishant Verma, users
Thanks for your response, and thanks for your input on the set_priority,
The best guess we have at this point is that this is not a dpdk performance issue. This is an issue with some threads running into more context switches than the others and hence not getting the same slice of the CPU. We are certain that this is not a dpdk performance issue, the code
is uniformly slow in one thread versus the other and the threads are doing a very large amount of work including accessing databases. The threads in question are not really doing packet processing as much as other work.
So this is certainly not a dpdk performance issue. This is an issue of kernel threads not being scheduled properly or in the worse case the cores running on different frequency (which is quite unlikely no the AWS Xeons we are running this on).
If you are asking for the dpdk config files to check for dpdk related performance issue then we are quite certain the issue is not with dpdk performance ...
regards
On Tuesday, September 24, 2024 at 06:23:39 AM PDT, Nishant Verma <vnish11@gmail.com> wrote:
I assume you are using any variant of linux. So execute command "lscpu" and provide the output.
Also share your command or config file that provides application to know which core to use and how many memory channels and ports.
Thanks.
Regards,
Nishant Verma
On Mon, Sep 23, 2024 at 10:06 PM amit sehas <cun23@yahoo.com> wrote:
> Thanks for your response, i am not sure i understand your question ... we have our product that utilizes dpdk ... the commands are just our server commands and parameters ... and the lscpu is the hyperthreaded 8 thread Xeon instance in AWS ...
>
> regards
>
>
>
>
>
>
> On Monday, September 23, 2024 at 06:14:16 PM PDT, Nishant Verma <vnish11@gmail.com> wrote:
>
>
>
>
>
> Can you share output of lscpu and command you are using to execute the app?
>
> .
>
> Regards,
> Nishant Verma
>
>
> On Mon, Sep 23, 2024 at 7:17 PM amit sehas <cun23@yahoo.com> wrote:
>> Thanks for the responses, this is on AWS, which is utilizing Xeon with hyperthreading. Not utilizing hyperthreading is not an option.
>>
>> After trying a few things i am narrowing down on the following approach:
>>
>> only for the critical threads we could utilize: rte_thread_set_priority to RTE_THREAD_PRIORITY_REALTIME_CRITICAL
>>
>> however this API requires a rte_thread_t parameter, if we utilize rte_eal_remote_launch, we are not provided with this parameter.
>> I am searching through the code to see if there is an API where i can obtain the rte_thread_t for the current thread that was launched with rte_eal_remote_launch.
>>
>> regards
>>
>>
>>
>>
>>
>>
>> On Monday, September 23, 2024 at 03:18:11 PM PDT, Nishant Verma <vnish11@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Also make sure all core you are using are physical core not the logical core.
>> Secondly, check your core isolation options and apply them accordingly.
>>
>>
>> .
>>
>> Regards,
>> Nishant Verma
>>
>>
>> On Mon, Sep 23, 2024 at 6:04 PM Wisam Jaddo <wisamm@nvidia.com> wrote:
>>> Hello Amit,
>>>
>>>> -----Original Message-----
>>>> From: amit sehas <cun23@yahoo.com>
>>>> Sent: Monday, September 23, 2024 11:57 PM
>>>> To: users@dpdk.org
>>>> Subject: core performance
>>>>
>>>> We are seeing different dpdk threads (launched via rte_eal_remote_launch()),
>>>> demonstrate very different performance.
>>>>
>>>>
>>>>
>>>> After placing counters all over the code, we realize that some threads are
>>>> uniformly slow, in other words there is no application level issue that is
>>>> throttling one thread over the other. We come to the conculsion that either
>>>> the Cores on which they are running are not at the same frequency which
>>>> seems doubtful or the threads are not getting a chance to execute on the cores
>>>> uniformly.
>>>>
>>>>
>>>>
>>>> It seems that isolcpus has been deprecated in recent versions of linux.
>>>>
>>>>
>>>>
>>>> What is the recommended approach to prevent the kernel from utilizing some
>>>> CPU threads, for anything other than the threads that are launched on them.
>>>
>>> If you are wishing to run each thread on separate core, try to use rte_eal_mp_remote_launch()
>>> instead of rte_eal_remote_launch(), make sure that your CPU is isolated, and you are passing correct
>>> Cores that were isolated to your app using -c, -l.
>>>
>>>
>>>>
>>>>
>>>>
>>>> Is there some API in dpdk which also helps us determine which CPU core the
>>>> thread is pinned to?
>>>>
>>>> I did not find any code in dpdk which actually performed pinning of a thread to
>>>> a CPU core.
>>>>
>>>>
>>>>
>>>> In our case it is more or less certain that the different threads are simply not
>>>> getting the same CPU core time, as a result some are demonstrating higher
>>>> throughput than the others ...
>>>>
>>>>
>>>>
>>>> how do we fix this?
>>>
>>> BRs,
>>> Wisam Jaddo
>>>
>>
>>
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-23 23:17 ` amit sehas
2024-09-24 1:14 ` Nishant Verma
@ 2024-09-24 13:25 ` Stephen Hemminger
1 sibling, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2024-09-24 13:25 UTC (permalink / raw)
To: amit sehas; +Cc: Wisam Jaddo, Nishant Verma, users
On Mon, 23 Sep 2024 23:17:54 +0000 (UTC)
amit sehas <cun23@yahoo.com> wrote:
> only for the critical threads we could utilize: rte_thread_set_priority to RTE_THREAD_PRIORITY_REALTIME_CRITICAL
Really bad idea on Linux.
Realtime priority will cause kernel starvation.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-23 23:17 ` amit sehas
@ 2024-09-24 1:14 ` Nishant Verma
[not found] ` <2025533199.11789856.1727143607670@mail.yahoo.com>
2024-09-24 13:25 ` Stephen Hemminger
1 sibling, 1 reply; 19+ messages in thread
From: Nishant Verma @ 2024-09-24 1:14 UTC (permalink / raw)
To: amit sehas; +Cc: Wisam Jaddo, users
[-- Attachment #1: Type: text/plain, Size: 3183 bytes --]
Can you share output of lscpu and command you are using to execute the app?
.
Regards,
Nishant Verma
On Mon, Sep 23, 2024 at 7:17 PM amit sehas <cun23@yahoo.com> wrote:
> Thanks for the responses, this is on AWS, which is utilizing Xeon with
> hyperthreading. Not utilizing hyperthreading is not an option.
>
> After trying a few things i am narrowing down on the following approach:
>
> only for the critical threads we could utilize: rte_thread_set_priority
> to RTE_THREAD_PRIORITY_REALTIME_CRITICAL
>
> however this API requires a rte_thread_t parameter, if we utilize
> rte_eal_remote_launch, we are not provided with this parameter.
> I am searching through the code to see if there is an API where i can
> obtain the rte_thread_t for the current thread that was launched with
> rte_eal_remote_launch.
>
> regards
>
>
>
>
>
>
> On Monday, September 23, 2024 at 03:18:11 PM PDT, Nishant Verma <
> vnish11@gmail.com> wrote:
>
>
>
>
>
> Also make sure all core you are using are physical core not the logical
> core.
> Secondly, check your core isolation options and apply them accordingly.
>
>
> .
>
> Regards,
> Nishant Verma
>
>
> On Mon, Sep 23, 2024 at 6:04 PM Wisam Jaddo <wisamm@nvidia.com> wrote:
> > Hello Amit,
> >
> >> -----Original Message-----
> >> From: amit sehas <cun23@yahoo.com>
> >> Sent: Monday, September 23, 2024 11:57 PM
> >> To: users@dpdk.org
> >> Subject: core performance
> >>
> >> We are seeing different dpdk threads (launched
> via rte_eal_remote_launch()),
> >> demonstrate very different performance.
> >>
> >>
> >>
> >> After placing counters all over the code, we realize that some threads
> are
> >> uniformly slow, in other words there is no application level issue that
> is
> >> throttling one thread over the other. We come to the conculsion that
> either
> >> the Cores on which they are running are not at the same frequency which
> >> seems doubtful or the threads are not getting a chance to execute on
> the cores
> >> uniformly.
> >>
> >>
> >>
> >> It seems that isolcpus has been deprecated in recent versions of linux.
> >>
> >>
> >>
> >> What is the recommended approach to prevent the kernel from utilizing
> some
> >> CPU threads, for anything other than the threads that are launched on
> them.
> >
> > If you are wishing to run each thread on separate core, try to use
> rte_eal_mp_remote_launch()
> > instead of rte_eal_remote_launch(), make sure that your CPU is isolated,
> and you are passing correct
> > Cores that were isolated to your app using -c, -l.
> >
> >
> >>
> >>
> >>
> >> Is there some API in dpdk which also helps us determine which CPU core
> the
> >> thread is pinned to?
> >>
> >> I did not find any code in dpdk which actually performed pinning of a
> thread to
> >> a CPU core.
> >>
> >>
> >>
> >> In our case it is more or less certain that the different threads are
> simply not
> >> getting the same CPU core time, as a result some are demonstrating
> higher
> >> throughput than the others ...
> >>
> >>
> >>
> >> how do we fix this?
> >
> > BRs,
> > Wisam Jaddo
> >
>
>
[-- Attachment #2: Type: text/html, Size: 4752 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-23 22:17 ` Nishant Verma
@ 2024-09-23 23:17 ` amit sehas
2024-09-24 1:14 ` Nishant Verma
2024-09-24 13:25 ` Stephen Hemminger
0 siblings, 2 replies; 19+ messages in thread
From: amit sehas @ 2024-09-23 23:17 UTC (permalink / raw)
To: Wisam Jaddo, Nishant Verma; +Cc: users
Thanks for the responses, this is on AWS, which is utilizing Xeon with hyperthreading. Not utilizing hyperthreading is not an option.
After trying a few things i am narrowing down on the following approach:
only for the critical threads we could utilize: rte_thread_set_priority to RTE_THREAD_PRIORITY_REALTIME_CRITICAL
however this API requires a rte_thread_t parameter, if we utilize rte_eal_remote_launch, we are not provided with this parameter.
I am searching through the code to see if there is an API where i can obtain the rte_thread_t for the current thread that was launched with rte_eal_remote_launch.
regards
On Monday, September 23, 2024 at 03:18:11 PM PDT, Nishant Verma <vnish11@gmail.com> wrote:
Also make sure all core you are using are physical core not the logical core.
Secondly, check your core isolation options and apply them accordingly.
.
Regards,
Nishant Verma
On Mon, Sep 23, 2024 at 6:04 PM Wisam Jaddo <wisamm@nvidia.com> wrote:
> Hello Amit,
>
>> -----Original Message-----
>> From: amit sehas <cun23@yahoo.com>
>> Sent: Monday, September 23, 2024 11:57 PM
>> To: users@dpdk.org
>> Subject: core performance
>>
>> We are seeing different dpdk threads (launched via rte_eal_remote_launch()),
>> demonstrate very different performance.
>>
>>
>>
>> After placing counters all over the code, we realize that some threads are
>> uniformly slow, in other words there is no application level issue that is
>> throttling one thread over the other. We come to the conculsion that either
>> the Cores on which they are running are not at the same frequency which
>> seems doubtful or the threads are not getting a chance to execute on the cores
>> uniformly.
>>
>>
>>
>> It seems that isolcpus has been deprecated in recent versions of linux.
>>
>>
>>
>> What is the recommended approach to prevent the kernel from utilizing some
>> CPU threads, for anything other than the threads that are launched on them.
>
> If you are wishing to run each thread on separate core, try to use rte_eal_mp_remote_launch()
> instead of rte_eal_remote_launch(), make sure that your CPU is isolated, and you are passing correct
> Cores that were isolated to your app using -c, -l.
>
>
>>
>>
>>
>> Is there some API in dpdk which also helps us determine which CPU core the
>> thread is pinned to?
>>
>> I did not find any code in dpdk which actually performed pinning of a thread to
>> a CPU core.
>>
>>
>>
>> In our case it is more or less certain that the different threads are simply not
>> getting the same CPU core time, as a result some are demonstrating higher
>> throughput than the others ...
>>
>>
>>
>> how do we fix this?
>
> BRs,
> Wisam Jaddo
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: core performance
2024-09-23 21:56 ` Wisam Jaddo
@ 2024-09-23 22:17 ` Nishant Verma
2024-09-23 23:17 ` amit sehas
0 siblings, 1 reply; 19+ messages in thread
From: Nishant Verma @ 2024-09-23 22:17 UTC (permalink / raw)
To: Wisam Jaddo; +Cc: amit sehas, users
[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]
Also make sure all core you are using are physical core not the logical
core.
Secondly, check your core isolation options and apply them accordingly.
.
Regards,
Nishant Verma
On Mon, Sep 23, 2024 at 6:04 PM Wisam Jaddo <wisamm@nvidia.com> wrote:
> Hello Amit,
>
> > -----Original Message-----
> > From: amit sehas <cun23@yahoo.com>
> > Sent: Monday, September 23, 2024 11:57 PM
> > To: users@dpdk.org
> > Subject: core performance
> >
> > We are seeing different dpdk threads (launched
> via rte_eal_remote_launch()),
> > demonstrate very different performance.
> >
> >
> >
> > After placing counters all over the code, we realize that some threads
> are
> > uniformly slow, in other words there is no application level issue that
> is
> > throttling one thread over the other. We come to the conculsion that
> either
> > the Cores on which they are running are not at the same frequency which
> > seems doubtful or the threads are not getting a chance to execute on the
> cores
> > uniformly.
> >
> >
> >
> > It seems that isolcpus has been deprecated in recent versions of linux.
> >
> >
> >
> > What is the recommended approach to prevent the kernel from utilizing
> some
> > CPU threads, for anything other than the threads that are launched on
> them.
>
> If you are wishing to run each thread on separate core, try to use
> rte_eal_mp_remote_launch()
> instead of rte_eal_remote_launch(), make sure that your CPU is isolated,
> and you are passing correct
> Cores that were isolated to your app using -c, -l.
>
>
> >
> >
> >
> > Is there some API in dpdk which also helps us determine which CPU core
> the
> > thread is pinned to?
> >
> > I did not find any code in dpdk which actually performed pinning of a
> thread to
> > a CPU core.
> >
> >
> >
> > In our case it is more or less certain that the different threads are
> simply not
> > getting the same CPU core time, as a result some are demonstrating higher
> > throughput than the others ...
> >
> >
> >
> > how do we fix this?
>
> BRs,
> Wisam Jaddo
>
[-- Attachment #2: Type: text/html, Size: 3366 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: core performance
2024-09-23 20:56 ` amit sehas
@ 2024-09-23 21:56 ` Wisam Jaddo
2024-09-23 22:17 ` Nishant Verma
0 siblings, 1 reply; 19+ messages in thread
From: Wisam Jaddo @ 2024-09-23 21:56 UTC (permalink / raw)
To: amit sehas, users
Hello Amit,
> -----Original Message-----
> From: amit sehas <cun23@yahoo.com>
> Sent: Monday, September 23, 2024 11:57 PM
> To: users@dpdk.org
> Subject: core performance
>
> We are seeing different dpdk threads (launched via rte_eal_remote_launch()),
> demonstrate very different performance.
>
>
>
> After placing counters all over the code, we realize that some threads are
> uniformly slow, in other words there is no application level issue that is
> throttling one thread over the other. We come to the conculsion that either
> the Cores on which they are running are not at the same frequency which
> seems doubtful or the threads are not getting a chance to execute on the cores
> uniformly.
>
>
>
> It seems that isolcpus has been deprecated in recent versions of linux.
>
>
>
> What is the recommended approach to prevent the kernel from utilizing some
> CPU threads, for anything other than the threads that are launched on them.
If you are wishing to run each thread on separate core, try to use rte_eal_mp_remote_launch()
instead of rte_eal_remote_launch(), make sure that your CPU is isolated, and you are passing correct
Cores that were isolated to your app using -c, -l.
>
>
>
> Is there some API in dpdk which also helps us determine which CPU core the
> thread is pinned to?
>
> I did not find any code in dpdk which actually performed pinning of a thread to
> a CPU core.
>
>
>
> In our case it is more or less certain that the different threads are simply not
> getting the same CPU core time, as a result some are demonstrating higher
> throughput than the others ...
>
>
>
> how do we fix this?
BRs,
Wisam Jaddo
^ permalink raw reply [flat|nested] 19+ messages in thread
* core performance
[not found] <1987164393.11670398.1727125003663.ref@mail.yahoo.com>
@ 2024-09-23 20:56 ` amit sehas
2024-09-23 21:56 ` Wisam Jaddo
0 siblings, 1 reply; 19+ messages in thread
From: amit sehas @ 2024-09-23 20:56 UTC (permalink / raw)
To: users
We are seeing different dpdk threads (launched via rte_eal_remote_launch()), demonstrate very different performance.
After placing counters all over the code, we realize that some threads are uniformly slow, in other words there is no application level issue that is throttling one thread over the other. We come to the conculsion that either the Cores on which they are running are not at the same frequency which seems doubtful or the threads are not getting a chance to execute on the cores uniformly.
It seems that isolcpus has been deprecated in recent versions of linux.
What is the recommended approach to prevent the kernel from utilizing some CPU threads, for anything other than the threads that are launched on them.
Is there some API in dpdk which also helps us determine which CPU core the thread is pinned to?
I did not find any code in dpdk which actually performed pinning of a thread to a CPU core.
In our case it is more or less certain that the different threads are simply not getting the same CPU core time, as a result some are demonstrating higher throughput than the others ...
how do we fix this?
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-09-30 17:31 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <595544330.11681349.1727123476579.ref@mail.yahoo.com>
2024-09-23 20:31 ` core performance amit sehas
2024-09-30 15:57 ` Stephen Hemminger
2024-09-30 17:27 ` Stephen Hemminger
2024-09-30 17:31 ` amit sehas
[not found] <1987164393.11670398.1727125003663.ref@mail.yahoo.com>
2024-09-23 20:56 ` amit sehas
2024-09-23 21:56 ` Wisam Jaddo
2024-09-23 22:17 ` Nishant Verma
2024-09-23 23:17 ` amit sehas
2024-09-24 1:14 ` Nishant Verma
[not found] ` <2025533199.11789856.1727143607670@mail.yahoo.com>
[not found] ` <CAHhCjUFjqobchJ79z0BLLRXrLZdb2QyVPM6fbji6T7jpiKLa2Q@mail.gmail.com>
2024-09-24 14:40 ` amit sehas
2024-09-24 16:38 ` Stephen Hemminger
2024-09-24 20:47 ` amit sehas
2024-09-26 12:32 ` amit sehas
2024-09-26 16:56 ` amit sehas
2024-09-26 17:03 ` amit sehas
2024-09-27 3:03 ` Stephen Hemminger
2024-09-27 3:13 ` amit sehas
2024-09-27 3:23 ` amit sehas
2024-09-24 13:25 ` Stephen Hemminger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).