From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 4B060B3D9 for ; Wed, 27 Jan 2016 18:30:42 +0100 (CET) Received: from hmsreliant.think-freely.org ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1aOTvJ-00073q-FC; Wed, 27 Jan 2016 12:30:38 -0500 Date: Wed, 27 Jan 2016 12:30:23 -0500 From: Neil Horman To: "Tan, Jianfeng" Message-ID: <20160127173022.GA14166@hmsreliant.think-freely.org> References: <1453661393-85704-1-git-send-email-jianfeng.tan@intel.com> <20160125134636.GA29690@hmsreliant.think-freely.org> <56A6D85A.6030400@intel.com> <20160126141907.GA20685@hmsreliant.think-freely.org> <56A8B1D3.7000201@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56A8B1D3.7000201@intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Score: -1.0 (-) X-Spam-Status: No Cc: dev@dpdk.org, yuanhan.liu@intel.com Subject: Re: [dpdk-dev] [RFC] eal: add cgroup-aware resource self discovery X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jan 2016 17:30:42 -0000 On Wed, Jan 27, 2016 at 08:02:27PM +0800, Tan, Jianfeng wrote: > Hi Neil, > > On 1/26/2016 10:19 PM, Neil Horman wrote: > >On Tue, Jan 26, 2016 at 10:22:18AM +0800, Tan, Jianfeng wrote: > >>Hi Neil, > >> > >>On 1/25/2016 9:46 PM, Neil Horman wrote: > >>>On Mon, Jan 25, 2016 at 02:49:53AM +0800, Jianfeng Tan wrote: > >>... > >>>>-- > >>>>2.1.4 > >>>> > >>>> > >>>This doesn't make a whole lot of sense, for several reasons: > >>> > >>>1) Applications, as a general rule shouldn't be interrogating the cgroups > >>>interface at all. > >>The main reason to do this in DPDK is that DPDK obtains resource information > >>from sysfs and proc, which are not well containerized so far. And DPDK > >>pre-allocates resource instead of on-demand gradual allocating. > >> > >Not disagreeing with this, just suggesting that: > > > >1) Interrogating cgroups really isn't the best way to collect that information > >2) Pre-allocating those resources isn't particularly wise without some mechanism > >to reallocate it, as resource constraints can change (consider your cpuset > >getting rewritten) > > In the case of reallocate, > For cpuset, DPDK panics in the initialization if set_affinity fails, but > after that, cpuset rewritten will not bring any problem I believe. Yes, that seems reasonable, but I think you need to update rte_thread_set_affinity to not assume that success in pthread_setaffinity_np means that all cpus in the provided mask are available. That is to say, cpusetp is subsequently stored in lore information after the set, but may not reflect the actual working set of processors, you should follow a successful set with a call to pthread_getaffinity_np to retrieve the actual working cpuset As for subsequent changes to the cpuset, I'm not sure how you want to handle that. I would think that you might want to run a check periodically or alow for a SIGHUP or some other signal to trigger a rescan of your working cpuset so as to keep the application in sync with the system. > For memory, a running application uses 2G hugepages, then admin decreases > hugetlb cgroup into 1G, the application will not get killed, unless it tries > to access more hugepages (I'll double check this). > No, the semantics should be identical to malloc/mmap (if you use the alloc_hugepages api or the mmap api). You should get a NULL return or other no fatal indicator if you allocate more than is available. > So another way to address this problem is to add an option that DPDK tries > best to allocate those resources, and if fails, it just posts a warning and > uses those allocated resources, instead of panic. What do you think? > Yes, that makes sense > > > >>>2) Cgroups aren't the only way in which a cpuset or memoryset can be restricted > >>>(the isolcpus command line argument, or a taskset on a parent process for > >>>instance, but there are several others). > >>Yes, I agree. To enable that, I'd like design the new API for resource self > >>discovery in a flexible way. A parameter "type" is used to specify the > >>solution to discovery way. In addition, I'm considering to add a callback > >>function pointer so that users can write their own resource discovery > >>functions. > >> > >Why? You don't need an API for this, or if you really want one, it can be very > >generic if you use POSIX apis to gather the information. What you have here is > >going to be very linux specific, and will need reimplementing for BSD or other > >operating systems. To use the cpuset example, instead of reading and parsing > >the mask files in the cgroup filesystem module to find your task and > >corresponding mask, just call sched_setaffinity with an all f's mask, then call > >sched_getaffinity. The returned mask will be all the cpus your process is > >allowed to execute on, taking into account every limiting filter the system you > >are running on offers. > > Yes, it makes sense on cpu's side. > > > > >There are simmilar OS level POSIX apis for most resources out there. You really > >don't need to dig through cgroups just to learn what some of those reources are. > > > >>>Instead of trying to figure out what cpuset is valid for your process by > >>>interrogating the cgroups heirarchy, instead you should follow the proscribed > >>>method of calling sched_getaffinity after calling sched_setaffinity. That will > >>>give you the canonical cpuset that you are executing on, taking all cpuset > >>>filters into account (including cgroups and any other restrictions). Its far > >>>simpler as well, as it doesn't require a ton of file/string processing. > >>Yes, this way is much better for cpuset discovery. But is there such a > >>syscall for hugepages? > >> > >In what capacity? Interrogating how many hugepages you have, or to what node > >they are affined to? Capacity would require reading the requisite proc file, as > >theres no posix api for this resource. Node affinity can be implied by setting > >the numa policy of the dpdk and then writing to /proc/nr_hugepages, as the > >kernel will attempt to distribute hugepages evenly among the tasks' numa policy > >configuration. > > For memory affinity, I believe the existing way of reading > /proc/self/pagemap already handle the problem. What I was asking is how much > memory (or hugepages in Linux's case) can be used. By the way, what is > /proc/nr_hugepages? > For affinity, you can parse /proc/self/pagemap or any number of other procfiles, but again, doing so is going to be very OS specific, and doesn't get you much in terms or resource management. It only tells you where the pages reside now. /proc/nr_hugepages is the proc tunable that lets you allocate/realocate hugepages. > > > >That said, I would advise that you strongly consider not exporting hugepages as > >a resource, as: > > > >a) Applications generally don't need to know that they are using hugepages, and > >so they dont need to know where said hugepages live, they just allocate memory > >via your allocation api and you give them something appropriate > > But the allocation api provider, DPDK library, needs to know if it's using > hugepages or not. > Right, but you're purpose was to expose thie library to applications. I'm saying you really don't need to expose such a library API to applications. If you just want to use it internally to dpdk, thats fine. > >b) Hugepages are a resource that are very specific to Linux, and to X86 Linux at > >that. Some OS implement simmilar resources, but they may have very different > >semantics. And other Arches may or may not implement various forms of compound > >paging at all. As the DPDK expands to support more OS'es and arches, it would > >be nice to ensure that the programming surfaces that you expose have a more > >broad level of support. > > That's why I put current implement in lib/librte_eal/linuxapp/. And the new > API uses the words of cores and memory, which is very generic IMO. In > Linux's context, memory is interpreted into hugepages (maybe not correct > because DPDK can be used with 4K memory). For other OSes, we could add > similar limitation in their semantics. > > > Thanks, > Jianfeng > > > > >Neil > > > >>Thanks, > >>Jianfeng > >> > >>>Neil > >>> > >> > >