[dpdk-dev] Performance impact of "declaring" more CPU cores

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] Performance impact of "declaring" more CPU cores
@ 2019-10-24 17:32 Tom Barbette
  2019-10-25 17:35 ` David Christensen
  0 siblings, 1 reply; 3+ messages in thread
From: Tom Barbette @ 2019-10-24 17:32 UTC (permalink / raw)
  To: dev

Hi all,

We're experiencing a very strange problem. The code of our application 
is modified to always use 8 cores. However, when running with "-l 
0-MAXCPU", with MAXCPU varying between 7 and 15 (therefore "allocating" 
more *unused* cores), the performance of the application change. It can 
be multiple Gbps at very high speed (100G) with a large number of cores, 
and it is not linear in the sense that more cores will not necessarily 
increase, nor degrade performance.

An example can be seen here : 
https://kth.box.com/s/v5v1hyidd51ebd7b8ixmcw513lfqwj4a . Note the 
errorbars (10 runs per points) show that it is not "global" variance 
that create the difference between cores. Once you end up in a "class" 
of performance, you stay there even when changing some parameters (such 
as the number of cores you actually use). From a research perspective 
this is problematic as we cannot trust results with X cores are indeed a 
certain %age better than Y core because it could be due to this problem, 
not what we "improved" in the application.

That application is a NAT and a FW, but we could observe that with other 
VNFs, at different scales. The link is saturated at 100G with real 
packets (avg size 1040B).

We have Mellanox ConnectX 5, DPDK 19.02. This could be observed on 
between Skylake and Cascade Lake machines. The example is a 
single-socket machine with 8 cores and HT. But this happened with a NUMA 
machine, or using only variation of the 18 cores of one sockets.

The only useful observation we made is that when we are in a "bad case", 
the LLC has more cache misses.

We could not really reproduce with XL710, but at 40G the problem can be 
barely seen with MLX, so that path is not very conclusive. We do not 
have other 100G NICs (hurry up Intel! :p). Similarly reproducing with 
testpmd is hard, as the problem really shows with something like 8 
cores, and we need the load to be less than the NIC wire speed, but high 
enough to observe the effect...

To rule out that problem, we hardcoded the number of cores to be used 
everywhere inside the application. So except for DPDK internals, all 
resources are allocated for 8 cores.

And that is still the best lead. Could it be that we simply get 
lucky/unlucky buffer allocations for "something per lcore" and therefore 
packet evict each others because of the limited associativity of the 
cache? At those speeds, failure of DDIO is death...
Would there be a way to fix/verify the memory allocation? base-virtaddr 
did not work though. ASLR disabled (enabled does not change anything).

Thanks for the help,

Tom

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [dpdk-dev] Performance impact of "declaring" more CPU cores
  2019-10-24 17:32 [dpdk-dev] Performance impact of "declaring" more CPU cores Tom Barbette
@ 2019-10-25 17:35 ` David Christensen
  2019-10-30 17:20   ` Tom Barbette
  0 siblings, 1 reply; 3+ messages in thread
From: David Christensen @ 2019-10-25 17:35 UTC (permalink / raw)
  To: Tom Barbette, dev

> The only useful observation we made is that when we are in a "bad case", 
> the LLC has more cache misses.

Have you looked closely at the CPU topology on your platform, can you 
provide some examples here of what you're seeing?  The hwloc package is 
very useful in visualizing how your logical cores map to CPU cache. 
There may be benefit is more strategically selecting the lcores you use 
to reduce LLC cache mssies.

Dave

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [dpdk-dev] Performance impact of "declaring" more CPU cores
  2019-10-25 17:35 ` David Christensen
@ 2019-10-30 17:20   ` Tom Barbette
  0 siblings, 0 replies; 3+ messages in thread
From: Tom Barbette @ 2019-10-30 17:20 UTC (permalink / raw)
  To: David Christensen, dev

Thanks for your comment. The raw number of cache misses is just higher almost in every function. While hwloc is indeed useful, the assignation is exactly the same in all cases. What we do is to define with "-l" more or less *unused* cores. But the used ones run at the same place, with the same resources and the same configuration.

The only thing that may change is what DPDK does with those unused cores. Eg, allocate more unused per-core caches, etc. Shifting the allocation of other caches, buffers, etc for the used cores, leading to more "unlucky" alignment and more contention.

I'm trying to reproduce with the smallest possible modification of testpmd so other people might experience this.

Thanks,

Tom                                
________________________________________
De : David Christensen <drc@linux.vnet.ibm.com>
Envoyé : vendredi 25 octobre 2019 19:35
À : Tom Barbette; dev@dpdk.org
Objet : Re: [dpdk-dev] Performance impact of "declaring" more CPU cores

> The only useful observation we made is that when we are in a "bad case",
> the LLC has more cache misses.

Have you looked closely at the CPU topology on your platform, can you
provide some examples here of what you're seeing?  The hwloc package is
very useful in visualizing how your logical cores map to CPU cache.
There may be benefit is more strategically selecting the lcores you use
to reduce LLC cache mssies.

Dave

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-10-30 17:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 17:32 [dpdk-dev] Performance impact of "declaring" more CPU cores Tom Barbette
2019-10-25 17:35 ` David Christensen
2019-10-30 17:20   ` Tom Barbette

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).