From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 56D8C958F for ; Wed, 27 Jan 2016 13:02:47 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP; 27 Jan 2016 04:02:29 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,354,1449561600"; d="scan'208";a="899271706" Received: from shwdeisgchi083.ccr.corp.intel.com (HELO [10.239.67.119]) ([10.239.67.119]) by orsmga002.jf.intel.com with ESMTP; 27 Jan 2016 04:02:28 -0800 To: Neil Horman References: <1453661393-85704-1-git-send-email-jianfeng.tan@intel.com> <20160125134636.GA29690@hmsreliant.think-freely.org> <56A6D85A.6030400@intel.com> <20160126141907.GA20685@hmsreliant.think-freely.org> From: "Tan, Jianfeng" Message-ID: <56A8B1D3.7000201@intel.com> Date: Wed, 27 Jan 2016 20:02:27 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <20160126141907.GA20685@hmsreliant.think-freely.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: dev@dpdk.org, yuanhan.liu@intel.com Subject: Re: [dpdk-dev] [RFC] eal: add cgroup-aware resource self discovery X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jan 2016 12:02:48 -0000 Hi Neil, On 1/26/2016 10:19 PM, Neil Horman wrote: > On Tue, Jan 26, 2016 at 10:22:18AM +0800, Tan, Jianfeng wrote: >> Hi Neil, >> >> On 1/25/2016 9:46 PM, Neil Horman wrote: >>> On Mon, Jan 25, 2016 at 02:49:53AM +0800, Jianfeng Tan wrote: >> ... >>>> -- >>>> 2.1.4 >>>> >>>> >>> This doesn't make a whole lot of sense, for several reasons: >>> >>> 1) Applications, as a general rule shouldn't be interrogating the cgroups >>> interface at all. >> The main reason to do this in DPDK is that DPDK obtains resource information >> from sysfs and proc, which are not well containerized so far. And DPDK >> pre-allocates resource instead of on-demand gradual allocating. >> > Not disagreeing with this, just suggesting that: > > 1) Interrogating cgroups really isn't the best way to collect that information > 2) Pre-allocating those resources isn't particularly wise without some mechanism > to reallocate it, as resource constraints can change (consider your cpuset > getting rewritten) In the case of reallocate, For cpuset, DPDK panics in the initialization if set_affinity fails, but after that, cpuset rewritten will not bring any problem I believe. For memory, a running application uses 2G hugepages, then admin decreases hugetlb cgroup into 1G, the application will not get killed, unless it tries to access more hugepages (I'll double check this). So another way to address this problem is to add an option that DPDK tries best to allocate those resources, and if fails, it just posts a warning and uses those allocated resources, instead of panic. What do you think? > >>> 2) Cgroups aren't the only way in which a cpuset or memoryset can be restricted >>> (the isolcpus command line argument, or a taskset on a parent process for >>> instance, but there are several others). >> Yes, I agree. To enable that, I'd like design the new API for resource self >> discovery in a flexible way. A parameter "type" is used to specify the >> solution to discovery way. In addition, I'm considering to add a callback >> function pointer so that users can write their own resource discovery >> functions. >> > Why? You don't need an API for this, or if you really want one, it can be very > generic if you use POSIX apis to gather the information. What you have here is > going to be very linux specific, and will need reimplementing for BSD or other > operating systems. To use the cpuset example, instead of reading and parsing > the mask files in the cgroup filesystem module to find your task and > corresponding mask, just call sched_setaffinity with an all f's mask, then call > sched_getaffinity. The returned mask will be all the cpus your process is > allowed to execute on, taking into account every limiting filter the system you > are running on offers. Yes, it makes sense on cpu's side. > > There are simmilar OS level POSIX apis for most resources out there. You really > don't need to dig through cgroups just to learn what some of those reources are. > >>> Instead of trying to figure out what cpuset is valid for your process by >>> interrogating the cgroups heirarchy, instead you should follow the proscribed >>> method of calling sched_getaffinity after calling sched_setaffinity. That will >>> give you the canonical cpuset that you are executing on, taking all cpuset >>> filters into account (including cgroups and any other restrictions). Its far >>> simpler as well, as it doesn't require a ton of file/string processing. >> Yes, this way is much better for cpuset discovery. But is there such a >> syscall for hugepages? >> > In what capacity? Interrogating how many hugepages you have, or to what node > they are affined to? Capacity would require reading the requisite proc file, as > theres no posix api for this resource. Node affinity can be implied by setting > the numa policy of the dpdk and then writing to /proc/nr_hugepages, as the > kernel will attempt to distribute hugepages evenly among the tasks' numa policy > configuration. For memory affinity, I believe the existing way of reading /proc/self/pagemap already handle the problem. What I was asking is how much memory (or hugepages in Linux's case) can be used. By the way, what is /proc/nr_hugepages? > > That said, I would advise that you strongly consider not exporting hugepages as > a resource, as: > > a) Applications generally don't need to know that they are using hugepages, and > so they dont need to know where said hugepages live, they just allocate memory > via your allocation api and you give them something appropriate But the allocation api provider, DPDK library, needs to know if it's using hugepages or not. > b) Hugepages are a resource that are very specific to Linux, and to X86 Linux at > that. Some OS implement simmilar resources, but they may have very different > semantics. And other Arches may or may not implement various forms of compound > paging at all. As the DPDK expands to support more OS'es and arches, it would > be nice to ensure that the programming surfaces that you expose have a more > broad level of support. That's why I put current implement in lib/librte_eal/linuxapp/. And the new API uses the words of cores and memory, which is very generic IMO. In Linux's context, memory is interpreted into hugepages (maybe not correct because DPDK can be used with 4K memory). For other OSes, we could add similar limitation in their semantics. Thanks, Jianfeng > > Neil > >> Thanks, >> Jianfeng >> >>> Neil >>> >>