DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
To: David Christensen <drc@linux.vnet.ibm.com>, dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
Date: Fri, 1 May 2020 10:06:28 +0100
Message-ID: <58df8aa5-e9b5-7f9a-2aee-fcb19b6dea04@intel.com> (raw)
In-Reply-To: <6763793c-265b-c5cf-228a-b2c177574c84@linux.vnet.ibm.com>

On 30-Apr-20 6:36 PM, David Christensen wrote:
> 
> 
> On 4/30/20 4:34 AM, Burakov, Anatoly wrote:
>> On 30-Apr-20 12:29 AM, David Christensen wrote:
>>> Current SPAPR IOMMU support code dynamically modifies the DMA window
>>> size in response to every new memory allocation. This is potentially
>>> dangerous because all existing mappings need to be unmapped/remapped in
>>> order to resize the DMA window, leaving hardware holding IOVA addresses
>>> that are not properly prepared for DMA.  The new SPAPR code statically
>>> assigns the DMA window size on first use, using the largest physical
>>> memory address when IOVA=PA and the base_virtaddr + physical memory size
>>> when IOVA=VA.  As a result, memory will only be unmapped when
>>> specifically requested.
>>>
>>> Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
>>> ---
>>
>> Hi David,
>>
>> I haven't yet looked at the code in detail (will do so later), but 
>> some general comments and questions below.
>>
>>> +        /*
>>> +         * Read "System RAM" in /proc/iomem:
>>> +         * 00000000-1fffffffff : System RAM
>>> +         * 200000000000-201fffffffff : System RAM
>>> +         */
>>> +        FILE *fd = fopen(proc_iomem, "r");
>>> +        if (fd == NULL) {
>>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
>>> +            return -1;
>>> +        }
>>
>> A quick check on my machines shows that when cat'ing /proc/iomem as 
>> non-root, you get zeroes everywhere, which leads me to believe that 
>> you have to be root to get anything useful out of /proc/iomem. Since 
>> one of the major selling points of VFIO is the ability to run as 
>> non-root, depending on iomem kind of defeats the purpose a bit.
> 
> I observed the same thing on my system during development.  I didn't see 
> anything that precluded support for RTE_IOVA_PA in the VFIO code.  Are 
> you suggesting that I should explicitly not support that configuration? 
> If you're attempting to use RTE_IOVA_PA then you're already required to 
> run as root, so there shouldn't be an issue accessing this

Oh, right, forgot about that. That's OK then.

> 
>>> +        return 0;
>>> +
>>> +    } else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
>>> +        /* Set the DMA window to base_virtaddr + system memory size */
>>> +        const char proc_meminfo[] = "/proc/meminfo";
>>> +        const char str_memtotal[] = "MemTotal:";
>>> +        int memtotal_len = sizeof(str_memtotal) - 1;
>>> +        char buffer[256];
>>> +        uint64_t size = 0;
>>> +
>>> +        FILE *fd = fopen(proc_meminfo, "r");
>>> +        if (fd == NULL) {
>>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
>>> +            return -1;
>>> +        }
>>> +        while (fgets(buffer, sizeof(buffer), fd)) {
>>> +            if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
>>> +                size = rte_str_to_size(&buffer[memtotal_len]);
>>> +                break;
>>> +            }
>>> +        }
>>> +        fclose(fd);
>>> +
>>> +        if (size == 0) {
>>> +            RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" 
>>> entry "
>>> +                "in file %s\n", proc_meminfo);
>>> +            return -1;
>>> +        }
>>> +
>>> +        RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
>>> +        /* if no base virtual address is configured use 4GB */
>>> +        spapr_dma_win_len = rte_align64pow2(size +
>>> +            (internal_config.base_virtaddr > 0 ?
>>> +            (uint64_t)internal_config.base_virtaddr : 1ULL << 32));
>>> +        rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
>>
>> I'm not sure of the algorithm for "memory size" here.
>>
>> Technically, DPDK can reserve memory segments anywhere in the VA space 
>> allocated by memseg lists. That space may be far bigger than system 
>> memory (on a typical Intel server board you'd see 128GB of VA space 
>> preallocated even though the machine itself might only have, say, 16GB 
>> of RAM installed). The same applies to any other arch running on 
>> Linux, so the window needs to cover at least RTE_MIN(base_virtaddr, 
>> lowest memseglist VA address) and up to highest memseglist VA address. 
>> That's not even mentioning the fact that the user may register 
>> external memory for DMA which may cause the window to be of 
>> insufficient size to cover said external memory.
>>
>> I also think that in general, "system memory" metric is ill suited for 
>> measuring VA space, because unlike system memory, the VA space is 
>> sparse and can therefore span *a lot* of address space even though in 
>> reality it may actually use very little physical memory.
> 
> I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo:
> 
> VmallocTotal:   549755813888 kB
> 
> I tested it with 1GB hugepages and it works, need to check with 2M as 
> well.  If there's no alternative for sizing the window based on 
> available system parameters then I have another option which creates a 
> new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X 
> where X is configured on the EAL command-line (--iova-base, --iova-len). 
>   I use these command-line values to create a static window.
> 

A whole new IOVA mode, while being a cleaner solution, would require a 
lot of testing, and it doesn't really solve the external memory problem, 
because we're still reliant on the user to provide IOVA addresses. 
Perhaps something akin to VA/IOVA address reservation would solve the 
problem, but again, lots of changes and testing, all for a comparatively 
narrow use case.

The vmalloc area seems big enough (512 terabytes on your machine, 32 
terabytes on mine), so it'll probably be OK. I'd settle for:

1) start at base_virtaddr OR lowest memseg list address, whichever is lowest
2) end at lowest addr + VmallocTotal OR highest memseglist addr, 
whichever is higher
3) a check in user DMA map function that would warn/throw an error 
whenever there is an attempt to map an address for DMA that doesn't fit 
into the DMA window

I think that would be best approach. Thoughts?

> Dave
> 
> Dave
> 
> 


-- 
Thanks,
Anatoly

  reply	other threads:[~2020-05-01  9:06 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-29 23:29 [dpdk-dev] [PATCH 0/2] vfio: change spapr DMA window sizing operation David Christensen
2020-04-29 23:29 ` [dpdk-dev] [PATCH 1/2] vfio: use ifdef's for ppc64 spapr code David Christensen
2020-04-30 11:14   ` Burakov, Anatoly
2020-04-30 16:22     ` David Christensen
2020-04-30 16:24       ` Burakov, Anatoly
2020-04-30 17:38         ` David Christensen
2020-05-01  8:49           ` Burakov, Anatoly
2020-04-29 23:29 ` [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-04-30 11:34   ` Burakov, Anatoly
2020-04-30 17:36     ` David Christensen
2020-05-01  9:06       ` Burakov, Anatoly [this message]
2020-05-01 16:48         ` David Christensen
2020-05-05 14:57           ` Burakov, Anatoly
2020-05-05 16:26             ` David Christensen
2020-05-06 10:18               ` Burakov, Anatoly
2020-06-30 21:38 ` [dpdk-dev] [PATCH v2 0/1] vfio: change spapr DMA window sizing operation David Christensen
2020-06-30 21:38   ` [dpdk-dev] [PATCH v2 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-08-10 21:07   ` [dpdk-dev] [PATCH v3 0/1] vfio: change spapr DMA window sizing operation David Christensen
2020-08-10 21:07     ` [dpdk-dev] [PATCH v3 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-09-03 18:55       ` David Christensen
2020-09-17 11:13       ` Burakov, Anatoly
2020-10-07 12:49         ` Thomas Monjalon
2020-10-07 17:44         ` David Christensen
2020-10-08  9:39           ` Burakov, Anatoly
2020-10-12 19:19             ` David Christensen
2020-10-14  9:27               ` Burakov, Anatoly
2020-10-15 17:23     ` [dpdk-dev] [PATCH v4 0/1] vfio: change spapr DMA window sizing operation David Christensen
2020-10-15 17:23       ` [dpdk-dev] [PATCH v4 1/1] vfio: modify spapr iommu support to use static window sizing David Christensen
2020-10-20 12:05         ` Thomas Monjalon
2020-10-29 21:30           ` Thomas Monjalon
2020-11-02 11:04         ` Burakov, Anatoly
2020-11-03 22:05       ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
2020-11-03 22:05         ` [dpdk-dev] [PATCH v5 1/1] " David Christensen
2020-11-04 19:43           ` Thomas Monjalon
2020-11-04 21:00             ` David Christensen
2020-11-04 21:02               ` Thomas Monjalon
2020-11-04 22:25                 ` David Christensen
2020-11-05  7:12                   ` Thomas Monjalon
2020-11-06 22:16                     ` David Christensen
2020-11-07  9:58                       ` Thomas Monjalon
2020-11-09 20:35         ` [dpdk-dev] [PATCH v5 0/1] " David Christensen
2020-11-09 20:35           ` [dpdk-dev] [PATCH v6 1/1] " David Christensen
2020-11-09 21:10             ` Thomas Monjalon
2020-11-10 17:41           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
2020-11-10 17:41             ` [dpdk-dev] [PATCH v7 1/1] " David Christensen
2020-11-10 17:43           ` [dpdk-dev] [PATCH v7 0/1] " David Christensen
2020-11-10 17:43             ` [dpdk-dev] [PATCH v7 1/1] " David Christensen
2020-11-13  8:39               ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58df8aa5-e9b5-7f9a-2aee-fcb19b6dea04@intel.com \
    --to=anatoly.burakov@intel.com \
    --cc=dev@dpdk.org \
    --cc=drc@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git