From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <arybchenko@solarflare.com>
Received: from dispatch1-us1.ppe-hosted.com (dispatch1-us1.ppe-hosted.com
 [148.163.129.52]) by dpdk.org (Postfix) with ESMTP id B105DE07
 for <dev@dpdk.org>; Mon, 30 Apr 2018 15:09:07 +0200 (CEST)
X-Virus-Scanned: Proofpoint Essentials engine
Received: from webmail.solarflare.com (uk.solarflare.com [193.34.186.16])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1-us4.ppe-hosted.com (Proofpoint Essentials ESMTP Server) with ESMTPS id
 EE8AFBC005A; Mon, 30 Apr 2018 13:09:05 +0000 (UTC)
Received: from [192.168.38.17] (84.52.114.114) by ukex01.SolarFlarecom.com
 (10.17.10.4) with Microsoft SMTP Server (TLS) id 15.0.1044.25; Mon, 30 Apr
 2018 14:09:01 +0100
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>, <dev@dpdk.org>
CC: <bruce.richardson@intel.com>, <thomas@monjalon.net>
References: <cover.1524652577.git.anatoly.burakov@intel.com>
 <143708d4c806a69b5f5bb94edb2c6f7714fa912b.1524652577.git.anatoly.burakov@intel.com>
 <925daa23-9265-2547-720b-4048ba7aa623@solarflare.com>
 <40b42df8-cc22-e529-e85c-c41edc36a0ff@intel.com>
From: Andrew Rybchenko <arybchenko@solarflare.com>
Message-ID: <278b60f9-1cee-fede-2af9-71fadc8dddb5@solarflare.com>
Date: Mon, 30 Apr 2018 16:08:56 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <40b42df8-cc22-e529-e85c-c41edc36a0ff@intel.com>
Content-Language: en-GB
X-Originating-IP: [84.52.114.114]
X-ClientProxiedBy: ocex03.SolarFlarecom.com (10.20.40.36) To
 ukex01.SolarFlarecom.com (10.17.10.4)
X-TM-AS-Product-Ver: SMEX-11.0.0.1191-8.100.1062-23814.003
X-TM-AS-Result: No--27.420000-0.000000-31
X-TM-AS-User-Approved-Sender: Yes
X-TM-AS-User-Blocked-Sender: No
X-MDID: 1525093746-jqkjf-QQ5oip
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-dev] [PATCH v3 2/2] mem: revert to using flock() and add
 per-segment lockfiles
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Apr 2018 13:09:08 -0000

On 04/30/2018 02:31 PM, Burakov, Anatoly wrote:
> On 28-Apr-18 10:38 AM, Andrew Rybchenko wrote:
>> On 04/25/2018 01:36 PM, Anatoly Burakov wrote:
>>> The original implementation used flock() locks, but was later
>>> switched to using fcntl() locks for page locking, because
>>> fcntl() locks allow locking parts of a file, which is useful
>>> for single-file segments mode, where locking the entire file
>>> isn't as useful because we still need to grow and shrink it.
>>>
>>> However, according to fcntl()'s Ubuntu manpage [1], semantics of
>>> fcntl() locks have a giant oversight:
>>>
>>>    This interface follows the completely stupid semantics of System
>>>    V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>>>    locks associated with a file for a given process are removed
>>>    when any file descriptor for that file is closed by that process.
>>>    This semantic means that applications must be aware of any files
>>>    that a subroutine library may access.
>>>
>>> Basically, closing *any* fd with an fcntl() lock (which we do because
>>> we don't want to leak fd's) will drop the lock completely.
>>>
>>> So, in this commit, we will be reverting back to using flock() locks
>>> everywhere. However, that still leaves the problem of locking parts
>>> of a memseg list file in single file segments mode, and we will be
>>> solving it with creating separate lock files per each page, and
>>> tracking those with flock().
>>>
>>> We will also be removing all of this tailq business and replacing it
>>> with a simple array - saving a few bytes is not worth the extra
>>> hassle of dealing with pointers and potential memory allocation
>>> failures. Also, remove the tailq lock since it is not needed - these
>>> fd lists are per-process, and within a given process, it is always
>>> only one thread handling access to hugetlbfs.
>>>
>>> So, first one to allocate a segment will create a lockfile, and put
>>> a shared lock on it. When we're shrinking the page file, we will be
>>> trying to take out a write lock on that lockfile, which would fail if
>>> any other process is holding onto the lockfile as well. This way, we
>>> can know if we can shrink the segment file. Also, if no other locks
>>> are found in the lock list for a given memseg list, the memseg list
>>> fd is automatically closed.
>>>
>>> One other thing to note is, according to flock() Ubuntu manpage [2],
>>> upgrading the lock from shared to exclusive is implemented by dropping
>>> and reacquiring the lock, which is not atomic and thus would have
>>> created race conditions. So, on attempting to perform operations in
>>> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
>>> that only one process could perform hugetlbfs operations concurrently.
>>>
>>> [1] 
>>> http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
>>> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
>>>
>>> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
>>> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
>>> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
>>> Fixes: 2a04139f66b4 ("eal: add single file segments option")
>>> Cc: anatoly.burakov@intel.com
>>>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
>>
>> We have a problem with the changeset if EAL option -m or --socket-mem 
>> is used.
>> EAL initialization hangs just after EAL: Probing VFIO support...
>> strace points to flock(7, LOCK_EX
>> List of file descriptors:
>> # ls /proc/25452/fd -l
>> total 0
>> lrwx------ 1 root root 64 Apr 28 10:34 0 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:34 1 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:32 2 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:34 3 -> /run/.rte_config
>> lrwx------ 1 root root 64 Apr 28 10:34 4 -> socket:[154166]
>> lrwx------ 1 root root 64 Apr 28 10:34 5 -> socket:[154158]
>> lr-x------ 1 root root 64 Apr 28 10:34 6 -> /dev/hugepages
>> lr-x------ 1 root root 64 Apr 28 10:34 7 -> /dev/hugepages
>>
>> I guess the problem is that there are two /dev/hugepages and
>> it hangs on the second.
>>
>> Ideas how to solve it?
>>
>> Andrew.
>>
>
> Hi Andrew,
>
> Please try the following patch:
>
> http://dpdk.org/dev/patchwork/patch/39166/
>
> This should fix the issue.

Hi Anatoly,

yes, it fixes the issue.

Thanks,
Andrew.