DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management
@ 2020-03-30  4:10 Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
                   ` (9 more replies)
  0 siblings, 10 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev; +Cc: Dmitry Malloy (MESHCHANINOV), Dmitry Kozlyuk

This RFC implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* No multi-process support. Note that without --in-memory EAL MM
  will create files in current working (runtime) directory.
* No-huge mode for testing with no IOVA available. IOVA could be obtained
  using Address Windowing Extensions, but it is doubtfully demanded.

Roadmap for Windows [1] proposes that memory management (MM) should be
implemented in basic and advanced stages:

1. Basic MM must be sufficient for core libraries and network PMDs:
2. Advanced MM could address features missing from basic stage.

Advanced memory management discussion is out of scope of this RFC.
Windows community calls suggest is will be focused on security and IOMMU
support. I will post a separate thread with background and suggestions.
Cc'ing Dmitry Malloy (MESHCHANINOV) nevertheless.

Because netUIO is not yet committed to dpdk-kmods, the first commit
introduces a new simple driver, virt2phys. It will almost certainly
become a part of netUIO once it is available for patches. Until then, it
must be installed according to documentation provided with the patch.
User-mode code locates the driver interface by GUID, so transition from
virt2phys to netUIO should not require changes to DPDK.

Hugepages allocation on Windows requires some privilege setup. Please
refer to documentation provided in the "initialize hugepage info" patch.

New EAL public functions for memory mapping are introduced. Their
implementation for Linux and FreeBSD is identical. The upcoming patch series
reorganizing EAL directories will help fixing that [2].

Windows MM duplicates quite a lot of code from Linux EAL:

* eal_memalloc_alloc_seg_bulk
* eal_memalloc_free_seg_bulk
* calc_num_pages_per_socket
* rte_eal_hugepage_init

Need input if it should be left as-is to evolve independently, or some
"common to memory hot-plug" code should be factored out. This
duplication may be reduced naturally when advanced MM is implemented.

Notes on checkpatch warnings:

* No space after comma / no space before closing parent in macros---
  definitely a false-positive, unclear how to suppress this.

* Issues from imported BSD code---probably should be ignored?

* Checkpatch is not run against dpdk-kmods (Windows drivers).

[1]: http://core.dpdk.org/roadmap/windows/
[2]: https://patchwork.dpdk.org/project/dpdk/list/?series=9070

Dmitry Kozlyuk (8):
  eal/windows: do not expose private EAL facilities
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal/windows: fix rte_page_sizes with Clang on Windows
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: implement basic memory management

 config/meson.build                            |   12 +-
 doc/guides/windows_gsg/build_dpdk.rst         |   20 -
 doc/guides/windows_gsg/index.rst              |    1 +
 doc/guides/windows_gsg/run_apps.rst           |   47 +
 lib/librte_eal/common/eal_common_fbarray.c    |   57 +-
 lib/librte_eal/common/eal_common_memory.c     |   50 +-
 lib/librte_eal/common/eal_private.h           |  116 +-
 lib/librte_eal/common/include/rte_memory.h    |   69 +
 lib/librte_eal/common/malloc_heap.c           |    1 +
 lib/librte_eal/freebsd/eal/eal.c              |   40 +
 lib/librte_eal/freebsd/eal/eal_memory.c       |  118 +-
 lib/librte_eal/linux/eal/eal.c                |   40 +
 lib/librte_eal/linux/eal/eal_memory.c         |  117 ++
 lib/librte_eal/meson.build                    |    4 +
 lib/librte_eal/rte_eal_exports.def            |  119 ++
 lib/librte_eal/rte_eal_version.map            |    4 +
 lib/librte_eal/windows/eal/eal.c              |  152 +++
 lib/librte_eal/windows/eal/eal_hugepages.c    |  108 ++
 lib/librte_eal/windows/eal/eal_lcore.c        |  187 ++-
 lib/librte_eal/windows/eal/eal_memalloc.c     |  423 ++++++
 lib/librte_eal/windows/eal/eal_memory.c       | 1166 +++++++++++++++++
 lib/librte_eal/windows/eal/eal_mp.c           |  103 ++
 lib/librte_eal/windows/eal/eal_thread.c       |    1 +
 lib/librte_eal/windows/eal/eal_windows.h      |  113 ++
 lib/librte_eal/windows/eal/include/pthread.h  |    2 +
 lib/librte_eal/windows/eal/include/rte_os.h   |   48 +-
 .../windows/eal/include/rte_virt2phys.h       |   34 +
 .../windows/eal/include/rte_windows.h         |   43 +
 .../windows/eal/include/sys/queue.h           |  663 +++++++++-
 lib/librte_eal/windows/eal/include/unistd.h   |    3 +
 lib/librte_eal/windows/eal/meson.build        |   15 +
 31 files changed, 3626 insertions(+), 250 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal/eal_mp.c
 create mode 100644 lib/librte_eal/windows/eal/eal_windows.h
 create mode 100644 lib/librte_eal/windows/eal/include/rte_virt2phys.h
 create mode 100644 lib/librte_eal/windows/eal/include/rte_windows.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  6:58   ` Jerin Jacob
  2020-04-10  1:45   ` Ranjit Menon
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 2/9] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev; +Cc: Dmitry Malloy (MESHCHANINOV), Dmitry Kozlyuk

This patch is for dpdk-kmods tree.

This driver supports Windows EAL memory management by translating
current process virtual addresses to physical addresses (IOVA).
Standalone virt2phys allows using DPDK without PMD and provides a
reference implementation. UIO drivers might also implement virt2phys
interface, thus rendering this separate driver unneeded.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 windows/README.rst                          |  79 +++++++
 windows/virt2phys/virt2phys.c               | 129 +++++++++++
 windows/virt2phys/virt2phys.h               |  34 +++
 windows/virt2phys/virt2phys.inf             |  85 ++++++++
 windows/virt2phys/virt2phys.sln             |  27 +++
 windows/virt2phys/virt2phys.vcxproj         | 228 ++++++++++++++++++++
 windows/virt2phys/virt2phys.vcxproj.filters |  36 ++++
 7 files changed, 618 insertions(+)
 create mode 100644 windows/README.rst
 create mode 100755 windows/virt2phys/virt2phys.c
 create mode 100755 windows/virt2phys/virt2phys.h
 create mode 100755 windows/virt2phys/virt2phys.inf
 create mode 100755 windows/virt2phys/virt2phys.sln
 create mode 100755 windows/virt2phys/virt2phys.vcxproj
 create mode 100755 windows/virt2phys/virt2phys.vcxproj.filters

diff --git a/windows/README.rst b/windows/README.rst
new file mode 100644
index 0000000..84506fa
--- /dev/null
+++ b/windows/README.rst
@@ -0,0 +1,79 @@
+Developing Windows Drivers
+==========================
+
+Prerequisites
+-------------
+
+Building Windows Drivers is only possible on Windows.
+
+1. Visual Studio 2019 Community or Professional Edition
+2. Windows Driver Kit (WDK) for Windows 10, version 1903
+
+Follow the official instructions to obtain all of the above:
+https://docs.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk
+
+
+Build the Drivers
+-----------------
+
+Build from Visual Studio
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Open a solution (``*.sln``) with Visual Studio and build it (Ctrl+Shift+B).
+
+
+Build from Command-Line
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Run "Developer Command Prompt for VS 2019" from the Start menu.
+
+Navigate to the solution directory (with ``*.sln``), then run:
+
+    msbuild
+
+To build a particular combination of configuration and platform:
+
+    msbuild -p:Configuration=Debug;Platform=x64
+
+
+Install the Drivers
+-------------------
+
+Disable Driver Signature Enforcement
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default Windows prohibits installing and loading drivers without digital
+signature obtained from Microsoft (read more: `Driver Signing`_).
+For development signature enforcement may be disabled as follows.
+
+In Elevated Command Prompt:
+
+    bcdedit -set loadoptions DISABLE_INTEGRITY_CHECKS
+    bcdedit -set TESTSIGNING ON
+    shutdown -r -t 0
+
+Upon reboot, an overlay message should appear on the desktop informing
+that Windows is in test mode, which means it allows loading unsigned drivers.
+
+.. Driver Signing: https://docs.microsoft.com/en-us/windows-hardware/drivers/install/driver-signing
+
+
+Install, List, and Uninstall Drivers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Driver package is by default located in a subdirectory of its source tree,
+e.g. ``x64\Debug\virt2phys\virt2phys`` (note two levels of ``virt2phys``).
+
+To install the driver and bind associated devices to it:
+
+    pnputil /add-driver x64\Debug\virt2phys\virt2phys\virt2phys.inf /install
+
+A graphical confirmation to load an unsigned driver will still appear.
+
+To list installed drivers:
+
+    pnputil /enum-drivers
+
+To remove the driver package and to uninstall its devices:
+
+    pnputil /delete-drive oem2.inf /install
diff --git a/windows/virt2phys/virt2phys.c b/windows/virt2phys/virt2phys.c
new file mode 100755
index 0000000..6c494d4
--- /dev/null
+++ b/windows/virt2phys/virt2phys.c
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <ntddk.h>
+#include <wdf.h>
+#include <wdmsec.h>
+#include <initguid.h>
+
+#include "virt2phys.h"
+
+DRIVER_INITIALIZE DriverEntry;
+EVT_WDF_DRIVER_DEVICE_ADD virt2phys_driver_EvtDeviceAdd;
+EVT_WDF_IO_IN_CALLER_CONTEXT virt2phys_device_EvtIoInCallerContext;
+
+NTSTATUS
+DriverEntry(
+	IN PDRIVER_OBJECT driver_object, IN PUNICODE_STRING registry_path)
+{
+	WDF_DRIVER_CONFIG config;
+	WDF_OBJECT_ATTRIBUTES attributes;
+	NTSTATUS status;
+
+	PAGED_CODE();
+
+	WDF_DRIVER_CONFIG_INIT(&config, virt2phys_driver_EvtDeviceAdd);
+	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
+	status = WdfDriverCreate(
+			driver_object, registry_path,
+			&attributes, &config, WDF_NO_HANDLE);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDriverCreate() failed, status=%08x\n", status));
+	}
+
+	return status;
+}
+
+_Use_decl_annotations_
+NTSTATUS
+virt2phys_driver_EvtDeviceAdd(
+	WDFDRIVER driver, PWDFDEVICE_INIT init)
+{
+	WDF_OBJECT_ATTRIBUTES attributes;
+	WDFDEVICE device;
+	NTSTATUS status;
+
+	UNREFERENCED_PARAMETER(driver);
+
+	PAGED_CODE();
+
+	WdfDeviceInitSetIoType(
+		init, WdfDeviceIoNeither);
+	WdfDeviceInitSetIoInCallerContextCallback(
+		init, virt2phys_device_EvtIoInCallerContext);
+
+	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
+
+	status = WdfDeviceCreate(&init, &attributes, &device);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDeviceCreate() failed, status=%08x\n", status));
+		return status;
+	}
+
+	status = WdfDeviceCreateDeviceInterface(
+			device, &GUID_DEVINTERFACE_VIRT2PHYS, NULL);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDeviceCreateDeviceInterface() failed, "
+			"status=%08x\n", status));
+		return status;
+	}
+
+	return STATUS_SUCCESS;
+}
+
+_Use_decl_annotations_
+VOID
+virt2phys_device_EvtIoInCallerContext(
+	IN WDFDEVICE device, IN WDFREQUEST request)
+{
+	WDF_REQUEST_PARAMETERS params;
+	ULONG code;
+	PVOID *virt;
+	PHYSICAL_ADDRESS *phys;
+	size_t size;
+	NTSTATUS status;
+
+	UNREFERENCED_PARAMETER(device);
+
+	PAGED_CODE();
+
+	WDF_REQUEST_PARAMETERS_INIT(&params);
+	WdfRequestGetParameters(request, &params);
+
+	if (params.Type != WdfRequestTypeDeviceControl) {
+		KdPrint(("bogus request type=%u\n", params.Type));
+		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
+		return;
+	}
+
+	code = params.Parameters.DeviceIoControl.IoControlCode;
+	if (code != IOCTL_VIRT2PHYS_TRANSLATE) {
+		KdPrint(("bogus IO control code=%lu\n", code));
+		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
+		return;
+	}
+
+	status = WdfRequestRetrieveInputBuffer(
+			request, sizeof(*virt), (PVOID *)&virt, &size);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfRequestRetrieveInputBuffer() failed, "
+			"status=%08x\n", status));
+		WdfRequestComplete(request, status);
+		return;
+	}
+
+	status = WdfRequestRetrieveOutputBuffer(
+		request, sizeof(*phys), &phys, &size);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfRequestRetrieveOutputBuffer() failed, "
+			"status=%08x\n", status));
+		WdfRequestComplete(request, status);
+		return;
+	}
+
+	*phys = MmGetPhysicalAddress(*virt);
+
+	WdfRequestCompleteWithInformation(
+		request, STATUS_SUCCESS, sizeof(*phys));
+}
diff --git a/windows/virt2phys/virt2phys.h b/windows/virt2phys/virt2phys.h
new file mode 100755
index 0000000..4bb2b4a
--- /dev/null
+++ b/windows/virt2phys/virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/windows/virt2phys/virt2phys.inf b/windows/virt2phys/virt2phys.inf
new file mode 100755
index 0000000..e8adaac
--- /dev/null
+++ b/windows/virt2phys/virt2phys.inf
@@ -0,0 +1,85 @@
+; SPDX-License-Identifier: BSD-3-Clause
+; Copyright (c) 2020 Dmitry Kozlyuk
+
+[Version]
+Signature = "$WINDOWS NT$"
+Class = %ClassName%
+ClassGuid = {78A1C341-4539-11d3-B88D-00C04FAD5171}
+Provider = %ManufacturerName%
+CatalogFile = virt2phys.cat
+DriverVer =
+
+[DestinationDirs]
+DefaultDestDir = 12
+virt2phys_Device_CoInstaller_CopyFiles = 11
+
+; ================= Class section =====================
+
+[ClassInstall32]
+Addreg = virt2phys_ClassReg
+
+[virt2phys_ClassReg]
+HKR,,,0,%ClassName%
+HKR,,Icon,,-5
+
+[SourceDisksNames]
+1 = %DiskName%,,,""
+
+[SourceDisksFiles]
+virt2phys.sys  = 1,,
+WdfCoInstaller$KMDFCOINSTALLERVERSION$.dll = 1
+
+;*****************************************
+; Install Section
+;*****************************************
+
+[Manufacturer]
+%ManufacturerName%=Standard,NT$ARCH$
+
+[Standard.NT$ARCH$]
+%virt2phys.DeviceDesc%=virt2phys_Device, Root\virt2phys
+
+[virt2phys_Device.NT]
+CopyFiles = Drivers_Dir
+
+[Drivers_Dir]
+virt2phys.sys
+
+;-------------- Service installation
+[virt2phys_Device.NT.Services]
+AddService = virt2phys,%SPSVCINST_ASSOCSERVICE%, virt2phys_Service_Inst
+
+; -------------- virt2phys driver install sections
+[virt2phys_Service_Inst]
+DisplayName    = %virt2phys.SVCDESC%
+ServiceType    = 1 ; SERVICE_KERNEL_DRIVER
+StartType      = 3 ; SERVICE_DEMAND_START
+ErrorControl   = 1 ; SERVICE_ERROR_NORMAL
+ServiceBinary  = %12%\virt2phys.sys
+
+;
+;--- virt2phys_Device Coinstaller installation ------
+;
+
+[virt2phys_Device.NT.CoInstallers]
+AddReg = virt2phys_Device_CoInstaller_AddReg
+CopyFiles = virt2phys_Device_CoInstaller_CopyFiles
+
+[virt2phys_Device_CoInstaller_AddReg]
+HKR,,CoInstallers32,0x00010000, "WdfCoInstaller$KMDFCOINSTALLERVERSION$.dll,WdfCoInstaller"
+
+[virt2phys_Device_CoInstaller_CopyFiles]
+WdfCoInstaller$KMDFCOINSTALLERVERSION$.dll
+
+[virt2phys_Device.NT.Wdf]
+KmdfService = virt2phys, virt2phys_wdfsect
+[virt2phys_wdfsect]
+KmdfLibraryVersion = $KMDFVERSION$
+
+[Strings]
+SPSVCINST_ASSOCSERVICE = 0x00000002
+ManufacturerName = "Dmitry Kozlyuk"
+ClassName = "Kernel bypass"
+DiskName = "virt2phys Installation Disk"
+virt2phys.DeviceDesc = "Virtual to physical address translator"
+virt2phys.SVCDESC = "virt2phys Service"
diff --git a/windows/virt2phys/virt2phys.sln b/windows/virt2phys/virt2phys.sln
new file mode 100755
index 0000000..0f5ecdc
--- /dev/null
+++ b/windows/virt2phys/virt2phys.sln
@@ -0,0 +1,27 @@
+
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio Version 16
+VisualStudioVersion = 16.0.29613.14
+MinimumVisualStudioVersion = 10.0.40219.1
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "virt2phys", "virt2phys.vcxproj", "{0EEF826B-9391-43A8-A722-BDD6F6115137}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|x64 = Debug|x64
+		Release|x64 = Release|x64
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.ActiveCfg = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.Build.0 = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.Deploy.0 = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.ActiveCfg = Release|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.Build.0 = Release|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.Deploy.0 = Release|x64
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+	GlobalSection(ExtensibilityGlobals) = postSolution
+		SolutionGuid = {845012FB-4471-4A12-A1C4-FF7E05C40E8E}
+	EndGlobalSection
+EndGlobal
diff --git a/windows/virt2phys/virt2phys.vcxproj b/windows/virt2phys/virt2phys.vcxproj
new file mode 100755
index 0000000..fa51916
--- /dev/null
+++ b/windows/virt2phys/virt2phys.vcxproj
@@ -0,0 +1,228 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|ARM">
+      <Configuration>Debug</Configuration>
+      <Platform>ARM</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|ARM">
+      <Configuration>Release</Configuration>
+      <Platform>ARM</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|ARM64">
+      <Configuration>Debug</Configuration>
+      <Platform>ARM64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|ARM64">
+      <Configuration>Release</Configuration>
+      <Platform>ARM64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="virt2phys.c" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="virt2phys.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <Inf Include="virt2phys.inf" />
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <ProjectGuid>{0EEF826B-9391-43A8-A722-BDD6F6115137}</ProjectGuid>
+    <TemplateGuid>{497e31cb-056b-4f31-abb8-447fd55ee5a5}</TemplateGuid>
+    <TargetFrameworkVersion>v4.5</TargetFrameworkVersion>
+    <MinimumVisualStudioVersion>12.0</MinimumVisualStudioVersion>
+    <Configuration>Debug</Configuration>
+    <Platform Condition="'$(Platform)' == ''">Win32</Platform>
+    <RootNamespace>virt2phys</RootNamespace>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <WppEnabled>false</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+    <Link>
+      <AdditionalDependencies>$(DDK_LIB_PATH)wdmsec.lib;%(AdditionalDependencies)</AdditionalDependencies>
+    </Link>
+    <Inf>
+      <TimeStamp>0.1</TimeStamp>
+    </Inf>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <FilesToPackage Include="$(TargetPath)" />
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+  </ImportGroup>
+</Project>
\ No newline at end of file
diff --git a/windows/virt2phys/virt2phys.vcxproj.filters b/windows/virt2phys/virt2phys.vcxproj.filters
new file mode 100755
index 0000000..0fe65fc
--- /dev/null
+++ b/windows/virt2phys/virt2phys.vcxproj.filters
@@ -0,0 +1,36 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="Header Files">
+      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>
+      <Extensions>h;hpp;hxx;hm;inl;inc;xsd</Extensions>
+    </Filter>
+    <Filter Include="Resource Files">
+      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>
+      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms</Extensions>
+    </Filter>
+    <Filter Include="Driver Files">
+      <UniqueIdentifier>{8E41214B-6785-4CFE-B992-037D68949A14}</UniqueIdentifier>
+      <Extensions>inf;inv;inx;mof;mc;</Extensions>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <Inf Include="virt2phys.inf">
+      <Filter>Driver Files</Filter>
+    </Inf>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="virt2phys.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="virt2phys.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+</Project>
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 2/9] eal/windows: do not expose private EAL facilities
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 3/9] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon, Thomas Monjalon, Anand Rawat

The goal of rte_os.h is to mitigate OS differences for EAL users.
In Windows EAL, rte_os.h did excessive things:

1. It included platform SDK headers (windows.h, etc). Those files are
   huge, require specific inclusion order, and are generally unused by
   the code including rte_os.h. Declarations from platform SDK may
   break otherwise platform-independent code, e.g. min, max, ERROR.

2. It included pthread.h, which is clearly not always required.

3. It defined functions private to Windows EAL.

Reorganize Windows EAL includes in the following way:

1. Create rte_windows.h to properly import Windows-specific facilities.
   Primary users are bus drivers, tests, and external applications.

2. Remove platform SDK includes from rte_os.h to prevent breaking
   otherwise portable code by including rte_os.h on Windows.
   Copy necessary definitions to avoid including those headers.

3. Remove pthread.h include from rte_os.h.

4. Move declarations private to Windows EAL into eal_windows.h.

Fixes: 428eb983f5f7 ("eal: add OS specific header file")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal/eal.c              |  2 +
 lib/librte_eal/windows/eal/eal_lcore.c        |  2 +
 lib/librte_eal/windows/eal/eal_thread.c       |  1 +
 lib/librte_eal/windows/eal/eal_windows.h      | 29 ++++++++++++
 lib/librte_eal/windows/eal/include/pthread.h  |  2 +
 lib/librte_eal/windows/eal/include/rte_os.h   | 44 ++++++-------------
 .../windows/eal/include/rte_windows.h         | 41 +++++++++++++++++
 lib/librte_eal/windows/eal/meson.build        |  1 +
 8 files changed, 91 insertions(+), 31 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal/eal_windows.h
 create mode 100644 lib/librte_eal/windows/eal/include/rte_windows.h

diff --git a/lib/librte_eal/windows/eal/eal.c b/lib/librte_eal/windows/eal/eal.c
index e4b50df3b..2cf7a04ef 100644
--- a/lib/librte_eal/windows/eal/eal.c
+++ b/lib/librte_eal/windows/eal/eal.c
@@ -18,6 +18,8 @@
 #include <eal_options.h>
 #include <eal_private.h>
 
+#include "eal_windows.h"
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
diff --git a/lib/librte_eal/windows/eal/eal_lcore.c b/lib/librte_eal/windows/eal/eal_lcore.c
index b3a6c63af..82ee45413 100644
--- a/lib/librte_eal/windows/eal/eal_lcore.c
+++ b/lib/librte_eal/windows/eal/eal_lcore.c
@@ -2,12 +2,14 @@
  * Copyright(c) 2019 Intel Corporation
  */
 
+#include <pthread.h>
 #include <stdint.h>
 
 #include <rte_common.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_windows.h"
 
 /* global data structure that contains the CPU map */
 static struct _wcpu_map {
diff --git a/lib/librte_eal/windows/eal/eal_thread.c b/lib/librte_eal/windows/eal/eal_thread.c
index 9e4bbaa08..e149199a6 100644
--- a/lib/librte_eal/windows/eal/eal_thread.c
+++ b/lib/librte_eal/windows/eal/eal_thread.c
@@ -14,6 +14,7 @@
 #include <eal_thread.h>
 
 #include "eal_private.h"
+#include "eal_windows.h"
 
 RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
 RTE_DEFINE_PER_LCORE(unsigned int, _socket_id) = (unsigned int)SOCKET_ID_ANY;
diff --git a/lib/librte_eal/windows/eal/eal_windows.h b/lib/librte_eal/windows/eal/eal_windows.h
new file mode 100644
index 000000000..fadd676b2
--- /dev/null
+++ b/lib/librte_eal/windows/eal/eal_windows.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#ifndef _EAL_WINDOWS_H_
+#define _EAL_WINDOWS_H_
+
+/**
+ * @file Facilities private to Windows EAL
+ */
+
+#include <rte_windows.h>
+
+/**
+ * Create a map of processors and cores on the system.
+ */
+void eal_create_cpu_map(void);
+
+/**
+ * Create a thread.
+ *
+ * @param thread
+ *   The location to store the thread id if successful.
+ * @return
+ *   0 for success, -1 if the thread is not created.
+ */
+int eal_thread_create(pthread_t *thread);
+
+#endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/eal/include/pthread.h b/lib/librte_eal/windows/eal/include/pthread.h
index b9dd18e56..cfd53f0b8 100644
--- a/lib/librte_eal/windows/eal/include/pthread.h
+++ b/lib/librte_eal/windows/eal/include/pthread.h
@@ -5,6 +5,8 @@
 #ifndef _PTHREAD_H_
 #define _PTHREAD_H_
 
+#include <stdint.h>
+
 /**
  * This file is required to support the common code in eal_common_proc.c,
  * eal_common_thread.c and common\include\rte_per_lcore.h as Microsoft libc
diff --git a/lib/librte_eal/windows/eal/include/rte_os.h b/lib/librte_eal/windows/eal/include/rte_os.h
index e1e0378e6..510e39e03 100644
--- a/lib/librte_eal/windows/eal/include/rte_os.h
+++ b/lib/librte_eal/windows/eal/include/rte_os.h
@@ -8,20 +8,18 @@
 /**
  * This is header should contain any function/macro definition
  * which are not supported natively or named differently in the
- * Windows OS. Functions will be added in future releases.
+ * Windows OS. It must not include Windows-specific headers.
  */
 
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-#include <windows.h>
-#include <basetsd.h>
-#include <pthread.h>
-#include <stdio.h>
-
-/* limits.h replacement */
-#include <stdlib.h>
+/* limits.h replacement, value as in <windows.h> */
 #ifndef PATH_MAX
 #define PATH_MAX _MAX_PATH
 #endif
@@ -31,8 +29,6 @@ extern "C" {
 /* strdup is deprecated in Microsoft libc and _strdup is preferred */
 #define strdup(str) _strdup(str)
 
-typedef SSIZE_T ssize_t;
-
 #define strtok_r(str, delim, saveptr) strtok_s(str, delim, saveptr)
 
 #define index(a, b)     strchr(a, b)
@@ -40,22 +36,14 @@ typedef SSIZE_T ssize_t;
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
-/**
- * Create a thread.
- * This function is private to EAL.
- *
- * @param thread
- *   The location to store the thread id if successful.
- * @return
- *   0 for success, -1 if the thread is not created.
- */
-int eal_thread_create(pthread_t *thread);
+/* cpu_set macros implementation */
+#define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
+#define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
+#define RTE_CPU_FILL(set) CPU_FILL(set)
+#define RTE_CPU_NOT(dst, src) CPU_NOT(dst, src)
 
-/**
- * Create a map of processors and cores on the system.
- * This function is private to EAL.
- */
-void eal_create_cpu_map(void);
+/* as in <windows.h> */
+typedef long long ssize_t;
 
 #ifndef RTE_TOOLCHAIN_GCC
 static inline int
@@ -86,12 +74,6 @@ asprintf(char **buffer, const char *format, ...)
 }
 #endif /* RTE_TOOLCHAIN_GCC */
 
-/* cpu_set macros implementation */
-#define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
-#define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
-#define RTE_CPU_FILL(set) CPU_FILL(set)
-#define RTE_CPU_NOT(dst, src) CPU_NOT(dst, src)
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/windows/eal/include/rte_windows.h b/lib/librte_eal/windows/eal/include/rte_windows.h
new file mode 100644
index 000000000..ed6e4c148
--- /dev/null
+++ b/lib/librte_eal/windows/eal/include/rte_windows.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#ifndef _RTE_WINDOWS_H_
+#define _RTE_WINDOWS_H_
+
+/**
+ * @file Windows-specific facilities
+ *
+ * This file should be included by DPDK libraries and applications
+ * that need access to Windows API. It includes platform SDK headers
+ * in compatible order with proper options and defines error-handling macros.
+ */
+
+/* Disable excessive libraries. */
+#ifndef WIN32_LEAN_AND_MEAN
+#define WIN32_LEAN_AND_MEAN
+#endif
+
+/* Must come first. */
+#include <windows.h>
+
+#include <basetsd.h>
+#include <psapi.h>
+
+/* Have GUIDs defined. */
+#ifndef INITGUID
+#define INITGUID
+#endif
+#include <initguid.h>
+
+/**
+ * Log GetLastError() with context, usually a Win32 API function and arguments.
+ */
+#define RTE_LOG_WIN32_ERR(...) \
+	RTE_LOG(DEBUG, EAL, RTE_FMT("GetLastError()=%lu: " \
+		RTE_FMT_HEAD(__VA_ARGS__,) "\n", GetLastError(), \
+		RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#endif /* _RTE_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/eal/meson.build b/lib/librte_eal/windows/eal/meson.build
index 2a062c365..21cd84459 100644
--- a/lib/librte_eal/windows/eal/meson.build
+++ b/lib/librte_eal/windows/eal/meson.build
@@ -6,6 +6,7 @@ eal_inc += include_directories('include')
 env_objs = []
 env_headers = files(
 	'include/rte_os.h',
+	'include/rte_windows.h',
 )
 common_sources = files(
 	'../../common/eal_common_bus.c',
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 3/9] eal/windows: improve CPU and NUMA node detection
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 2/9] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 4/9] eal/windows: initialize hugepage info Dmitry Kozlyuk
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon, Anand Rawat, Jeff Shaw

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal/eal_lcore.c   | 185 ++++++++++++++---------
 lib/librte_eal/windows/eal/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal/eal_lcore.c b/lib/librte_eal/windows/eal/eal_lcore.c
index 82ee45413..c11f37de3 100644
--- a/lib/librte_eal/windows/eal/eal_lcore.c
+++ b/lib/librte_eal/windows/eal/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() is not yet available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal/eal_windows.h b/lib/librte_eal/windows/eal/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal/eal_windows.h
+++ b/lib/librte_eal/windows/eal/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 4/9] eal/windows: initialize hugepage info
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
                   ` (2 preceding siblings ...)
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 3/9] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 5/9] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Dmitry Kozlyuk, Thomas Monjalon, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                         |   2 +
 doc/guides/windows_gsg/build_dpdk.rst      |  20 ----
 doc/guides/windows_gsg/index.rst           |   1 +
 doc/guides/windows_gsg/run_apps.rst        |  47 +++++++++
 lib/librte_eal/windows/eal/eal.c           |  14 +++
 lib/librte_eal/windows/eal/eal_hugepages.c | 108 +++++++++++++++++++++
 lib/librte_eal/windows/eal/meson.build     |   1 +
 7 files changed, 173 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal/eal_hugepages.c

diff --git a/config/meson.build b/config/meson.build
index abedd76f2..73cf69814 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -263,6 +263,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal/eal.c b/lib/librte_eal/windows/eal/eal.c
index 2cf7a04ef..a84b6147a 100644
--- a/lib/librte_eal/windows/eal/eal.c
+++ b/lib/librte_eal/windows/eal/eal.c
@@ -18,8 +18,11 @@
 #include <eal_options.h>
 #include <eal_private.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -242,6 +245,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information.");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal/eal_hugepages.c b/lib/librte_eal/windows/eal/eal_hugepages.c
new file mode 100644
index 000000000..b099d13f9
--- /dev/null
+++ b/lib/librte_eal/windows/eal/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available in Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem in Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal/meson.build b/lib/librte_eal/windows/eal/meson.build
index 21cd84459..8b407c9ae 100644
--- a/lib/librte_eal/windows/eal/meson.build
+++ b/lib/librte_eal/windows/eal/meson.build
@@ -22,6 +22,7 @@ common_sources = files(
 )
 env_sources = files('eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_thread.c',
 	'getopt.c',
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 5/9] eal: introduce internal wrappers for file operations
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
                   ` (3 preceding siblings ...)
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 4/9] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  7:04   ` Jerin Jacob
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 6/9] eal: introduce memory management wrappers Dmitry Kozlyuk
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Dmitry Kozlyuk, Bruce Richardson, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

EAL common code uses file locking and truncation. Introduce
OS-independent wrapeprs in order to support both POSIX and Windows:

* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Wrappers follow POSIX semantics, but interface is not POSIX,
so that it can be made more clean, e.g. by not mixing locking
operation and behaviour on conflict.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>

WIP2
---
 lib/librte_eal/common/eal_private.h | 45 ++++++++++++++++
 lib/librte_eal/freebsd/eal/eal.c    | 40 ++++++++++++++
 lib/librte_eal/linux/eal/eal.c      | 40 ++++++++++++++
 lib/librte_eal/windows/eal/eal.c    | 83 +++++++++++++++++++++++++++++
 4 files changed, 208 insertions(+)

diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index ddcfbe2e4..0130571e8 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -443,4 +443,49 @@ rte_option_usage(void);
 uint64_t
 eal_get_baseaddr(void);
 
+/** File locking operation. */
+enum rte_flock_op {
+	RTE_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	RTE_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	RTE_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum rte_flock_mode {
+	RTE_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	RTE_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_lock(int fd, enum rte_flock_op op, enum rte_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/eal/eal.c b/lib/librte_eal/freebsd/eal/eal.c
index 6ae37e7e6..4bbcc7ab7 100644
--- a/lib/librte_eal/freebsd/eal/eal.c
+++ b/lib/librte_eal/freebsd/eal/eal.c
@@ -697,6 +697,46 @@ static void rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum rte_flock_op op, enum rte_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == RTE_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case RTE_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case RTE_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case RTE_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
 /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
diff --git a/lib/librte_eal/linux/eal/eal.c b/lib/librte_eal/linux/eal/eal.c
index 9530ee55f..d75c162c1 100644
--- a/lib/librte_eal/linux/eal/eal.c
+++ b/lib/librte_eal/linux/eal/eal.c
@@ -956,6 +956,46 @@ is_iommu_enabled(void)
 	return n > 2;
 }
 
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum rte_flock_op op, enum rte_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == RTE_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case RTE_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case RTE_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case RTE_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
 /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
diff --git a/lib/librte_eal/windows/eal/eal.c b/lib/librte_eal/windows/eal/eal.c
index a84b6147a..4932185ec 100644
--- a/lib/librte_eal/windows/eal/eal.c
+++ b/lib/librte_eal/windows/eal/eal.c
@@ -224,6 +224,89 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum rte_flock_op op, enum rte_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == RTE_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == RTE_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum rte_flock_op op, enum rte_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case RTE_FLOCK_EXCLUSIVE:
+	case RTE_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case RTE_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
+
  /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 6/9] eal: introduce memory management wrappers
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
                   ` (4 preceding siblings ...)
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 5/9] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  7:31   ` Thomas Monjalon
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 7/9] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Dmitry Kozlyuk, Thomas Monjalon, Anatoly Burakov,
	Bruce Richardson, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

System meory management is implemented differently for POSIX and
Windows. Introduce wrapper functions for operations used across DPDK:

* rte_mem_map()
  Create memory mapping for a regular file or a page file (swap).
  This supports mapping to a reserved memory region even on Windows.

* rte_mem_unmap()
  Remove mapping created with rte_mem_map().

* rte_get_page_size()
  Obtain default system page size.

* rte_mem_lock()
  Make arbitrary-sized memory region non-swappable.

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                         |  10 +-
 lib/librte_eal/common/eal_private.h        |  71 +++-
 lib/librte_eal/common/include/rte_memory.h |  63 +++
 lib/librte_eal/freebsd/eal/eal_memory.c    | 117 ++++++
 lib/librte_eal/linux/eal/eal_memory.c      | 117 ++++++
 lib/librte_eal/rte_eal_exports.def         |   4 +
 lib/librte_eal/rte_eal_version.map         |   4 +
 lib/librte_eal/windows/eal/eal.c           |   6 +
 lib/librte_eal/windows/eal/eal_memory.c    | 433 +++++++++++++++++++++
 lib/librte_eal/windows/eal/eal_windows.h   |  51 +++
 lib/librte_eal/windows/eal/meson.build     |   1 +
 11 files changed, 871 insertions(+), 6 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal/eal_memory.c

diff --git a/config/meson.build b/config/meson.build
index 73cf69814..295425742 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -256,14 +256,20 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
 	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 0130571e8..2e5e4312a 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,16 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum rte_mem_reserve_flags {
+	/**< Reserve hugepages. */
+	RTE_RESERVE_HUGEPAGES = 1 << 0,
+	/**< Fail if requested address is not available. */
+	RTE_RESERVE_EXACT_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -215,8 +226,8 @@ int rte_eal_check_module(const char *module_name);
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to eal_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -232,8 +243,8 @@ int rte_eal_check_module(const char *module_name);
 #define EAL_VIRTUAL_AREA_UNMAP (1 << 2)
 /**< immediately unmap reserved virtual area. */
 void *
-eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
+	int flags, enum rte_mem_reserve_flags reserve_flags);
 
 /**
  * Get cpu core_id.
@@ -488,4 +499,56 @@ int eal_file_lock(int fd, enum rte_flock_op op, enum rte_flock_mode mode);
  */
 int eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation address. The system may not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *eal_mem_reserve(void *requested_addr, size_t size,
+	enum rte_mem_reserve_flags flags);
+
+/**
+ * Allocate a contiguous chunk of virtual memory.
+ *
+ * Use eal_mem_free() to free allocated memory.
+ *
+ * @param size
+ *  Number of bytes to allocate.
+ * @param page_size
+ *  If non-zero, means memory must be allocated in hugepages
+ *  of the specified size. The @code size @endcode parameter
+ *  must then be a multiple of the largest hugepage size requested.
+ * @return
+ *  Address of allocated memory or NULL on failure (rte_errno is set).
+ */
+void *eal_mem_alloc(size_t size, enum rte_page_sizes page_size);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If @code virt @endcode and @code size @endcode describe a part of the
+ * reserved region, only this part of the region is freed (accurately
+ * up to the system page size). If @code virt @endcode points to allocated
+ * memory, @code size @endcode must match the one specified on allocation.
+ * The behavior is undefined if the memory pointed by @code virt @endcode
+ * is obtained from another source than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void eal_mem_free(void *virt, size_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 3d8d0bd69..1742fde9a 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -85,6 +85,69 @@ struct rte_memseg_list {
 	struct rte_fbarray memseg_arr;
 };
 
+/**
+ * Memory protection flags.
+ */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,   /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/**
+ * Memory mapping additional flags.
+ *
+ * In Linux and FreeBSD, each flag is semantically equivalent
+ * to OS-specific mmap(3) flag with the same or similar name.
+ * In Windows, POSIX and MAP_ANONYMOUS semantics are followed.
+ */
+enum rte_map_flags {
+	/** Changes of mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/** Fail if requested address cannot be taken. */
+	RTE_MAP_FIXED = 1 << 3
+};
+
+/**
+ * OS-independent implementation of POSIX mmap(3)
+ * with MAP_ANONYMOUS Linux/FreeBSD extension.
+ */
+__rte_experimental
+void *rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_experimental
+int rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never failes.
+ *
+ * @return
+ *   Positive page size in bytes.
+ */
+__rte_experimental
+int rte_get_page_size(void);
+
+/**
+ * Lock region in physical memory and prevent it from swapping.
+ *
+ * @param virt
+ *   The virtual address.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ */
+__rte_experimental
+int rte_mem_lock(const void *virt, size_t size);
+
 /**
  * Lock page in physical memory and prevent from swapping.
  *
diff --git a/lib/librte_eal/freebsd/eal/eal_memory.c b/lib/librte_eal/freebsd/eal/eal_memory.c
index a97d8f0f0..bcceba636 100644
--- a/lib/librte_eal/freebsd/eal/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal/eal_memory.c
@@ -534,3 +534,120 @@ rte_eal_memseg_init(void)
 			memseg_primary_init() :
 			memseg_secondary_init();
 }
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size,
+	enum rte_mem_reserve_flags flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & RTE_RESERVE_HUGEPAGES)
+		sys_flags |= MAP_HUGETLB;
+	if (flags & RTE_RESERVE_EXACT_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_READ, sys_flags, -1, 0);
+}
+
+void *
+eal_mem_alloc(size_t size, enum rte_page_sizes page_size)
+{
+	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+
+	if (page_size != 0) {
+		/* as per mmap() manpage, all page sizes are log2 of page size
+		 * shifted by MAP_HUGE_SHIFT
+		 */
+		int page_flag = rte_log2_u64(page_size) << MAP_HUGE_SHIFT;
+		flags |= MAP_HUGETLB | page_flag;
+	}
+
+	return mem_map(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+static int
+mem_rte_to_sys_prot(enum rte_mem_prot prot)
+{
+	int sys_prot = 0;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(ERR, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset)
+{
+	int sys_prot = 0;
+	int sys_flags = 0;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FIXED)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+int
+rte_get_page_size(void)
+{
+	return getpagesize();
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	return mlock(virt, size);
+}
diff --git a/lib/librte_eal/linux/eal/eal_memory.c b/lib/librte_eal/linux/eal/eal_memory.c
index 7a9c97ff8..72205580a 100644
--- a/lib/librte_eal/linux/eal/eal_memory.c
+++ b/lib/librte_eal/linux/eal/eal_memory.c
@@ -2479,3 +2479,120 @@ rte_eal_memseg_init(void)
 #endif
 			memseg_secondary_init();
 }
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(ERR, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size,
+	enum rte_mem_reserve_flags flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & RTE_RESERVE_HUGEPAGES)
+		sys_flags |= MAP_HUGETLB;
+	if (flags & RTE_RESERVE_EXACT_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_READ, sys_flags, -1, 0);
+}
+
+void *
+eal_mem_alloc(size_t size, enum rte_page_sizes page_size)
+{
+	int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+
+	if (page_size != 0) {
+		/* as per mmap() manpage, all page sizes are log2 of page size
+		 * shifted by MAP_HUGE_SHIFT
+		 */
+		int page_flag = rte_log2_u64(page_size) << MAP_HUGE_SHIFT;
+		flags |= MAP_HUGETLB | page_flag;
+	}
+
+	return mem_map(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+static int
+mem_rte_to_sys_prot(enum rte_mem_prot prot)
+{
+	int sys_prot = 0;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset)
+{
+	int sys_prot = 0;
+	int sys_flags = 0;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FIXED)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+int
+rte_get_page_size(void)
+{
+	return getpagesize();
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	return mlock(virt, size);
+}
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..bacf9a107 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -5,5 +5,9 @@ EXPORTS
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
 	rte_eal_remote_launch
+	rte_get_page_size
 	rte_log
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_unmap
 	rte_vlog
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index f9ede5b41..07128898f 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -337,5 +337,9 @@ EXPERIMENTAL {
 	rte_thread_is_intr;
 
 	# added in 20.05
+	rte_get_page_size;
 	rte_log_can_log;
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_unmap;
 };
diff --git a/lib/librte_eal/windows/eal/eal.c b/lib/librte_eal/windows/eal/eal.c
index 4932185ec..98afd8a68 100644
--- a/lib/librte_eal/windows/eal/eal.c
+++ b/lib/librte_eal/windows/eal/eal.c
@@ -339,6 +339,12 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal/eal_memory.c b/lib/librte_eal/windows/eal/eal_memory.c
new file mode 100644
index 000000000..f8b312d7c
--- /dev/null
+++ b/lib/librte_eal/windows/eal/eal_memory.c
@@ -0,0 +1,433 @@
+#include <io.h>
+
+#include <rte_errno.h>
+#include <rte_memory.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterMax,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	OSVERSIONINFO info;
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	/* IsWindows10OrGreater() may also be unavailable. */
+	memset(&info, 0, sizeof(info));
+	info.dwOSVersionInfoSize = sizeof(info);
+	GetVersionEx(&info);
+
+	/* Checking for Windows 10+ will also detect Windows Server 2016+.
+	 * Do not abort, because Windows may report false version depending
+	 * on executable manifest, compatibility mode, etc.
+	 */
+	if (info.dwMajorVersion < 10)
+		RTE_LOG(DEBUG, EAL, "Windows 10+ or Windows Server 2016+ "
+			"is required for advanced memory features\n");
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* no VirtualAlloc2() */
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static int
+win32_alloc_error_to_errno(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		return 0;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		return ENOMEM;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		return EINVAL;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size,
+	enum rte_mem_reserve_flags flags)
+{
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & RTE_RESERVE_HUGEPAGES) {
+		RTE_LOG(ERR, EAL, "Hugepage reservation is not supported\n");
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	virt = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		rte_errno = win32_alloc_error_to_errno(GetLastError());
+	}
+
+	if ((flags & RTE_RESERVE_EXACT_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFree(virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFree()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc(size_t size, enum rte_page_sizes page_size)
+{
+	if (page_size != 0)
+		return eal_mem_alloc_socket(size, SOCKET_ID_ANY);
+
+	return VirtualAlloc(
+		NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void*
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+		if (VirtualQuery(requested_addr, &info, sizeof(info)) == 0) {
+			RTE_LOG_WIN32_ERR("VirtualQuery()");
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) &&
+			!VirtualFree(requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+				"<split placeholder>)", requested_addr, size);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	if (requested_addr != NULL)
+		flags |= MEM_REPLACE_PLACEHOLDER;
+
+	addr = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		int err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, "
+			"<replace placeholder>)", addr, size);
+		rte_errno = win32_alloc_error_to_errno(err);
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, ...)", addr, size);
+		return -1;
+	}
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	if (VirtualQuery(addr, &info, sizeof(info)) == 0) {
+		RTE_LOG_WIN32_ERR("VirtualQuery()");
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFree(addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+			"MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)", addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFree(addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* TODO: there is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FIXED) && (virt != requested_addr)) {
+		BOOL ret = UnmapViewOfFile(virt);
+		virt = NULL;
+		if (!ret)
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		rte_errno = GetLastError();
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		return -1;
+	}
+	return 0;
+}
+
+int
+rte_get_page_size(void)
+{
+	SYSTEM_INFO info;
+	GetSystemInfo(&info);
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock()");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal/eal_windows.h b/lib/librte_eal/windows/eal/eal_windows.h
index 390d2fd66..002d7e4a5 100644
--- a/lib/librte_eal/windows/eal/eal_windows.h
+++ b/lib/librte_eal/windows/eal/eal_windows.h
@@ -36,4 +36,55 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with @ref eal_mem_reserve()
+ * or decommitted from hugepages by @ref eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and @code rte_errno @endcode is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit.
+ *
+ * The @code addr @endcode and @code param @endcode must match
+ * location and size of previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/eal/meson.build b/lib/librte_eal/windows/eal/meson.build
index 8b407c9ae..7b0f03f0a 100644
--- a/lib/librte_eal/windows/eal/meson.build
+++ b/lib/librte_eal/windows/eal/meson.build
@@ -24,6 +24,7 @@ env_sources = files('eal.c',
 	'eal_debug.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
+	'eal_memory.c',
 	'eal_thread.c',
 	'getopt.c',
 )
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 7/9] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
                   ` (5 preceding siblings ...)
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 6/9] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 8/9] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev; +Cc: Dmitry Malloy (MESHCHANINOV), Dmitry Kozlyuk, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are signed 32-bit,
Enum rte_page_size has members valued beyond 2^32.  EAL cannot use
-fno-ms-compatibility because its code is OS-dependent.  The only option
is to define these values outside enum, but this prohibits using
-fstrict-enums.  Another consequence is that enum rte_page_size cannot
be used to hold page size, because on Windows it won't be able to hold
all possible values.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/include/rte_memory.h | 6 ++++++
 lib/librte_eal/meson.build                 | 1 +
 2 files changed, 7 insertions(+)

diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 1742fde9a..5b0e2d8b5 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -34,8 +34,14 @@ enum rte_page_sizes {
 	RTE_PGSIZE_256M  = 1ULL << 28,
 	RTE_PGSIZE_512M  = 1ULL << 29,
 	RTE_PGSIZE_1G    = 1ULL << 30,
+/* Work around Clang on Windows being limited to 32-bit underlying type. */
+#if !defined(RTE_TOOLCHAIN_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)
 	RTE_PGSIZE_4G    = 1ULL << 32,
 	RTE_PGSIZE_16G   = 1ULL << 34,
+#else
+#define RTE_PGSIZE_4G  (1ULL << 32)
+#define RTE_PGSIZE_16G (1ULL << 34)
+#endif
 };
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 1730d603f..ec80bd6be 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -23,6 +23,7 @@ endif
 if cc.has_header('getopt.h')
 	cflags += ['-DHAVE_GETOPT_H', '-DHAVE_GETOPT', '-DHAVE_GETOPT_LONG']
 endif
+cflags += '-fno-strict-enums'
 sources = common_sources + env_sources
 objs = common_objs + env_objs
 headers = common_headers + env_headers
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 8/9] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
                   ` (6 preceding siblings ...)
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 7/9] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 9/9] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
  9 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 .../windows/eal/include/sys/queue.h           | 663 ++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/eal/include/sys/queue.h b/lib/librte_eal/windows/eal/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/eal/include/sys/queue.h
+++ b/lib/librte_eal/windows/eal/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [RFC PATCH 9/9] eal/windows: implement basic memory management
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
                   ` (7 preceding siblings ...)
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 8/9] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-03-30  4:10 ` Dmitry Kozlyuk
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
  9 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30  4:10 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Dmitry Kozlyuk, Thomas Monjalon, Anatoly Burakov,
	Bruce Richardson, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver to obtain IOVAs of
hugepages allocated from user-mode.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                            |   2 +-
 lib/librte_eal/common/eal_common_fbarray.c    |  57 +-
 lib/librte_eal/common/eal_common_memory.c     |  50 +-
 lib/librte_eal/common/malloc_heap.c           |   1 +
 lib/librte_eal/freebsd/eal/eal_memory.c       |   1 -
 lib/librte_eal/linux/eal/eal_memory.c         |   2 +-
 lib/librte_eal/meson.build                    |   5 +-
 lib/librte_eal/rte_eal_exports.def            | 119 ++-
 lib/librte_eal/windows/eal/eal.c              |  47 ++
 lib/librte_eal/windows/eal/eal_memalloc.c     | 423 ++++++++++
 lib/librte_eal/windows/eal/eal_memory.c       | 735 +++++++++++++++++-
 lib/librte_eal/windows/eal/eal_mp.c           | 103 +++
 lib/librte_eal/windows/eal/eal_windows.h      |  23 +
 lib/librte_eal/windows/eal/include/rte_os.h   |   4 +
 .../windows/eal/include/rte_virt2phys.h       |  34 +
 .../windows/eal/include/rte_windows.h         |   2 +
 lib/librte_eal/windows/eal/include/unistd.h   |   3 +
 lib/librte_eal/windows/eal/meson.build        |  12 +
 18 files changed, 1557 insertions(+), 66 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal/eal_mp.c
 create mode 100644 lib/librte_eal/windows/eal/include/rte_virt2phys.h

diff --git a/config/meson.build b/config/meson.build
index 295425742..49d208006 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -270,7 +270,7 @@ if is_windows
 		add_project_link_arguments('-lmincore', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 1312f936b..06f8e6055 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,15 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -85,19 +85,16 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
 		/* pass errno up the chain */
 		rte_errno = errno;
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FIXED, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -735,7 +732,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -756,9 +753,12 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		void *new_data = rte_mem_map(
+			data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_FIXED | RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS,
+			fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
 					__func__, strerror(errno));
 			goto fail;
@@ -778,7 +778,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 					__func__, path, strerror(errno));
 			rte_errno = errno;
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, RTE_FLOCK_EXCLUSIVE, RTE_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
 					__func__, path, strerror(errno));
 			rte_errno = EBUSY;
@@ -789,10 +790,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, RTE_FLOCK_SHARED, RTE_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -824,7 +823,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -862,7 +861,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -895,10 +894,8 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, RTE_FLOCK_SHARED, RTE_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -916,7 +913,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -944,8 +941,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -964,7 +960,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -999,8 +995,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1025,7 +1020,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, RTE_FLOCK_EXCLUSIVE, RTE_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,14 +1037,14 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, RTE_FLOCK_SHARED, RTE_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index cc7d54e0c..5d2506184 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,7 +11,6 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
@@ -43,7 +42,7 @@ static uint64_t system_page_sz;
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, enum rte_mem_reserve_flags reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -51,9 +50,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_get_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -97,24 +94,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
-			*size -= page_sz;
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
+			size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, *size);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -124,20 +121,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -153,7 +147,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -171,12 +165,12 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	return aligned_addr;
@@ -532,10 +526,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	int page_size = rte_get_page_size();
+	uintptr_t aligned = (virtual & ~(page_size - 1));
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 842eb9de7..6534c895c 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -729,6 +729,7 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg,
 		if (ret != NULL)
 			return ret;
 	}
+
 	return NULL;
 }
 
diff --git a/lib/librte_eal/freebsd/eal/eal_memory.c b/lib/librte_eal/freebsd/eal/eal_memory.c
index bcceba636..b38cc380b 100644
--- a/lib/librte_eal/freebsd/eal/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal/eal_memory.c
@@ -393,7 +393,6 @@ alloc_va_space(struct rte_memseg_list *msl)
 	return 0;
 }
 
-
 static int
 memseg_primary_init(void)
 {
diff --git a/lib/librte_eal/linux/eal/eal_memory.c b/lib/librte_eal/linux/eal/eal_memory.c
index 72205580a..1d24579b5 100644
--- a/lib/librte_eal/linux/eal/eal_memory.c
+++ b/lib/librte_eal/linux/eal/eal_memory.c
@@ -2518,7 +2518,7 @@ eal_mem_reserve(void *requested_addr, size_t size,
 	if (flags & RTE_RESERVE_EXACT_ADDRESS)
 		sys_flags |= MAP_FIXED;
 
-	return mem_map(requested_addr, size, PROT_READ, sys_flags, -1, 0);
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
 }
 
 void *
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index ec80bd6be..6a6ec0ddb 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -10,6 +10,7 @@ subdir('common') # defines common_sources, common_objs, etc.
 # Now do OS/exec-env specific settings, including building kernel modules
 # The <exec-env>/eal/meson.build file should define env_sources, etc.
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
 subdir(exec_env + '/eal')
 
 allow_experimental_apis = true
@@ -23,7 +24,9 @@ endif
 if cc.has_header('getopt.h')
 	cflags += ['-DHAVE_GETOPT_H', '-DHAVE_GETOPT', '-DHAVE_GETOPT_LONG']
 endif
-cflags += '-fno-strict-enums'
+if cc.get_id() == 'clang'
+	cflags += '-fno-strict-enums'
+endif
 sources = common_sources + env_sources
 objs = common_objs + env_objs
 headers = common_headers + env_headers
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index bacf9a107..854b83bcd 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,13 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
-	rte_get_page_size
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
 	rte_log
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
+	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_get_page_size
 	rte_mem_lock
 	rte_mem_map
 	rte_mem_unmap
-	rte_vlog
diff --git a/lib/librte_eal/windows/eal/eal.c b/lib/librte_eal/windows/eal/eal.c
index 98afd8a68..8e8478823 100644
--- a/lib/librte_eal/windows/eal/eal.c
+++ b/lib/librte_eal/windows/eal/eal.c
@@ -93,6 +93,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -345,6 +363,35 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		rte_eal_init_alert("Cannot open virt2phys driver interface");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal/eal_memalloc.c b/lib/librte_eal/windows/eal/eal_memalloc.c
new file mode 100644
index 000000000..c7c3cf8df
--- /dev/null
+++ b/lib/librte_eal/windows/eal/eal_memalloc.c
@@ -0,0 +1,423 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+#include <rte_windows.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory and enforce address hint. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+
+		if (addr != requested_addr) {
+			RTE_LOG(DEBUG, EAL, "Address hint %p not respected, "
+				"got %p\n", requested_addr, addr);
+			goto error;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu)\n", requested_addr, alloc_sz);
+			goto error;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx()");
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	eal_mem_decommit(addr, alloc_sz);
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len))
+		return -1;
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info);
+				i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal/eal_memory.c b/lib/librte_eal/windows/eal/eal_memory.c
index f8b312d7c..9c0caca4a 100644
--- a/lib/librte_eal/windows/eal/eal_memory.c
+++ b/lib/librte_eal/windows/eal/eal_memory.c
@@ -1,11 +1,23 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2010-2014 Intel Corporation (functions from Linux EAL)
+ * Copyright (c) 2020 Dmitry Kozlyuk (Windows specifics)
+ */
+
+#include <inttypes.h>
 #include <io.h>
 
 #include <rte_errno.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
 #include "eal_private.h"
 #include "eal_windows.h"
 
+#include <rte_virt2phys.h>
+
 /* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
  * Provide a copy of definitions and code to load it dynamically.
  * Note: definitions are copied verbatim from Microsoft documentation
@@ -120,6 +132,119 @@ eal_mem_win32api_init(void)
 
 #endif /* no VirtualAlloc2() */
 
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
 /* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
 static int
 win32_alloc_error_to_errno(DWORD code)
@@ -360,7 +485,7 @@ rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
 		return NULL;
 	}
 
-	/* TODO: there is a race for the requested_addr between mem_free()
+	/* There is a race for the requested_addr between mem_free()
 	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
 	 * region with a mapping in a single operation, but it does not support
 	 * private mappings.
@@ -410,6 +535,16 @@ rte_mem_unmap(void *virt, size_t size)
 	return 0;
 }
 
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
 int
 rte_get_page_size(void)
 {
@@ -431,3 +566,601 @@ rte_mem_lock(const void *virt, size_t size)
 
 	return 0;
 }
+
+#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
+
+static int
+memseg_alloc_list(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = 1; /* mark it as a heap segment */
+
+	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zx kB at socket %i\n",
+			(size_t)page_sz >> 10, socket_id);
+
+	return 0;
+}
+
+static int
+memseg_alloc_va_space(struct rte_memseg_list *msl)
+{
+	uint64_t page_sz;
+	size_t mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, 0);
+	if (addr == NULL) {
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Could not mmap %zu bytes "
+				"at [%p] - use '--" OPT_BASE_VIRTADDR "'"
+				" option\n", mem_sz, msl->base_va);
+		else
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	return 0;
+}
+
+static int
+memseg_primary_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (memseg_alloc_list(msl, pagesz, n_segs,
+					socket_id, cur_seglist))
+				goto out;
+
+			if (memseg_alloc_va_space(msl)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int
+memseg_secondary_init(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+		return memseg_primary_init();
+	return memseg_secondary_init();
+}
+
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+static int
+calc_num_pages_per_socket(uint64_t *memory,
+		struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used,
+		unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from cpu mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			strncpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int) (internal_config.memory / 0x100000);
+		available = requested - (unsigned int) (total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
+
+/* Limit is checked by validator itself, nothing left to analyze.*/
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+static int
+eal_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES; socket_id++)
+		memory[socket_id] = internal_config.socket_mem[socket_id];
+
+	/* calculate final number of pages */
+	if (calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket "
+					"limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs, cur_seg;
+	uint64_t page_sz;
+	void *addr;
+	struct rte_fbarray *arr;
+	struct rte_memseg *ms;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	/* create a memseg list */
+	msl = &mcfg->memsegs[0];
+
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = internal_config.memory / page_sz;
+
+	if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
+		sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		return -1;
+	}
+
+	addr = eal_mem_alloc(internal_config.memory, 0);
+	if (addr == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate %zu bytes",
+		internal_config.memory);
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->page_sz = page_sz;
+	msl->socket_id = 0;
+	msl->len = internal_config.memory;
+	msl->heap = 1;
+
+	/* populate memsegs. each memseg is one page long */
+	for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
+		arr = &msl->memseg_arr;
+
+		ms = rte_fbarray_get(arr, cur_seg);
+		ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, cur_seg);
+
+		addr = RTE_PTR_ADD(addr, (size_t)page_sz);
+	}
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal/eal_mp.c b/lib/librte_eal/windows/eal/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal/eal_windows.h b/lib/librte_eal/windows/eal/eal_windows.h
index 002d7e4a5..4b504a023 100644
--- a/lib/librte_eal/windows/eal/eal_windows.h
+++ b/lib/librte_eal/windows/eal/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,6 +52,13 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
 /**
  * Locate Win32 memory management routines in system libraries.
  *
diff --git a/lib/librte_eal/windows/eal/include/rte_os.h b/lib/librte_eal/windows/eal/include/rte_os.h
index 510e39e03..62805a307 100644
--- a/lib/librte_eal/windows/eal/include/rte_os.h
+++ b/lib/librte_eal/windows/eal/include/rte_os.h
@@ -36,6 +36,10 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define open _open
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
diff --git a/lib/librte_eal/windows/eal/include/rte_virt2phys.h b/lib/librte_eal/windows/eal/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/eal/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/eal/include/rte_windows.h b/lib/librte_eal/windows/eal/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/eal/include/rte_windows.h
+++ b/lib/librte_eal/windows/eal/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/eal/include/unistd.h b/lib/librte_eal/windows/eal/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/eal/include/unistd.h
+++ b/lib/librte_eal/windows/eal/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/eal/meson.build b/lib/librte_eal/windows/eal/meson.build
index 7b0f03f0a..519bb5a3b 100644
--- a/lib/librte_eal/windows/eal/meson.build
+++ b/lib/librte_eal/windows/eal/meson.build
@@ -7,24 +7,36 @@ env_objs = []
 env_headers = files(
 	'include/rte_os.h',
 	'include/rte_windows.h',
+	'include/rte_virt2phys.h',
 )
 common_sources = files(
 	'../../common/eal_common_bus.c',
 	'../../common/eal_common_class.c',
 	'../../common/eal_common_devargs.c',
 	'../../common/eal_common_errno.c',
+	'../../common/eal_common_fbarray.c',
 	'../../common/eal_common_launch.c',
 	'../../common/eal_common_lcore.c',
 	'../../common/eal_common_log.c',
+	'../../common/eal_common_mcfg.c',
+	'../../common/eal_common_memalloc.c',
+	'../../common/eal_common_memory.c',
+	'../../common/eal_common_memzone.c',
 	'../../common/eal_common_options.c',
+	'../../common/eal_common_tailqs.c',
 	'../../common/eal_common_thread.c',
+	'../../common/malloc_elem.c',
+	'../../common/malloc_heap.c',
+	'../../common/rte_malloc.c',
 	'../../common/rte_option.c',
 )
 env_sources = files('eal.c',
 	'eal_debug.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
+	'eal_memalloc.c',
 	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'getopt.c',
 )
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows
  2020-03-30  4:10 ` [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
@ 2020-03-30  6:58   ` Jerin Jacob
  2020-03-30 13:41     ` Dmitry Kozlyuk
  2020-04-10  1:45   ` Ranjit Menon
  1 sibling, 1 reply; 218+ messages in thread
From: Jerin Jacob @ 2020-03-30  6:58 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV)

On Mon, Mar 30, 2020 at 9:40 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> This patch is for dpdk-kmods tree.
>
> This driver supports Windows EAL memory management by translating
> current process virtual addresses to physical addresses (IOVA).
> Standalone virt2phys allows using DPDK without PMD and provides a
> reference implementation. UIO drivers might also implement virt2phys
> interface, thus rendering this separate driver unneeded.
>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  windows/README.rst                          |  79 +++++++
>  windows/virt2phys/virt2phys.c               | 129 +++++++++++
>  windows/virt2phys/virt2phys.h               |  34 +++
>  windows/virt2phys/virt2phys.inf             |  85 ++++++++
>  windows/virt2phys/virt2phys.sln             |  27 +++
>  windows/virt2phys/virt2phys.vcxproj         | 228 ++++++++++++++++++++
>  windows/virt2phys/virt2phys.vcxproj.filters |  36 ++++
>  7 files changed, 618 insertions(+)
>  create mode 100644 windows/README.rst
>  create mode 100755 windows/virt2phys/virt2phys.c
>  create mode 100755 windows/virt2phys/virt2phys.h
>  create mode 100755 windows/virt2phys/virt2phys.inf
>  create mode 100755 windows/virt2phys/virt2phys.sln
>  create mode 100755 windows/virt2phys/virt2phys.vcxproj
>  create mode 100755 windows/virt2phys/virt2phys.vcxproj.filters
>
> diff --git a/windows/README.rst b/windows/README.rst
> new file mode 100644
> index 0000000..84506fa
> --- /dev/null
> +++ b/windows/README.rst
> @@ -0,0 +1,79 @@
> +Developing Windows Drivers
> +==========================
> +
> +Prerequisites
> +-------------
> +
> +Building Windows Drivers is only possible on Windows.
> +
> +1. Visual Studio 2019 Community or Professional Edition
> +2. Windows Driver Kit (WDK) for Windows 10, version 1903

Looks like WDK is licence is "Proprietary commercial software".

I am just wondering, Can we package this driver in binary form for
end-users to whose scope
is to not develop Windows DPDK but fixing any issues reported by build
CI for Window due
to generic EAL patches or so.


> +
> +Follow the official instructions to obtain all of the above:
> +https://docs.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk
> +

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 5/9] eal: introduce internal wrappers for file operations
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 5/9] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-03-30  7:04   ` Jerin Jacob
  0 siblings, 0 replies; 218+ messages in thread
From: Jerin Jacob @ 2020-03-30  7:04 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Bruce Richardson, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

On Mon, Mar 30, 2020 at 9:41 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> EAL common code uses file locking and truncation. Introduce
> OS-independent wrapeprs in order to support both POSIX and Windows:
>
> * eal_file_lock: lock or unlock an open file.
> * eal_file_truncate: enforce a given size for an open file.
>
> Wrappers follow POSIX semantics, but interface is not POSIX,
> so that it can be made more clean, e.g. by not mixing locking
> operation and behaviour on conflict.
>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
>
> WIP2
> ---
>  lib/librte_eal/common/eal_private.h | 45 ++++++++++++++++
>  lib/librte_eal/freebsd/eal/eal.c    | 40 ++++++++++++++
>  lib/librte_eal/linux/eal/eal.c      | 40 ++++++++++++++
>  lib/librte_eal/windows/eal/eal.c    | 83 +++++++++++++++++++++++++++++
>  4 files changed, 208 insertions(+)
>
> diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
> index ddcfbe2e4..0130571e8 100644
> --- a/lib/librte_eal/common/eal_private.h
> +++ b/lib/librte_eal/common/eal_private.h
> @@ -443,4 +443,49 @@ rte_option_usage(void);
>  uint64_t
>  eal_get_baseaddr(void);
>
> +/** File locking operation. */
> +enum rte_flock_op {
> +       RTE_FLOCK_SHARED,    /**< Acquire a shared lock. */
> +       RTE_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
> +       RTE_FLOCK_UNLOCK     /**< Release a previously taken lock. */
> +};
> +
> +/** Behavior on file locking conflict. */
> +enum rte_flock_mode {
> +       RTE_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
> +       RTE_FLOCK_RETURN /**< Return immediately if the file is locked. */
> +};

Avoid using RTE_ for internal symbols. IMO, EAL_FLOCK_* would be
enough for some something defined in eal_private.h.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 6/9] eal: introduce memory management wrappers
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 6/9] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-03-30  7:31   ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-03-30  7:31 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Anatoly Burakov, Bruce Richardson, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

30/03/2020 06:10, Dmitry Kozlyuk:
> +/**
> + * Lock region in physical memory and prevent it from swapping.
> + *
> + * @param virt
> + *   The virtual address.
> + * @param size
> + *   Size of the region.
> + * @return
> + *   0 on success, negative on error.
> + */
> +__rte_experimental
> +int rte_mem_lock(const void *virt, size_t size);

Please refer to the similar rte_mem_lock_page in the above doxygen comment.
You could use the syntax @see.




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows
  2020-03-30  6:58   ` Jerin Jacob
@ 2020-03-30 13:41     ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-03-30 13:41 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV)

> > +Developing Windows Drivers
> > +==========================
> > +
> > +Prerequisites
> > +-------------
> > +
> > +Building Windows Drivers is only possible on Windows.
> > +
> > +1. Visual Studio 2019 Community or Professional Edition
> > +2. Windows Driver Kit (WDK) for Windows 10, version 1903  
> 
> Looks like WDK is licence is "Proprietary commercial software".
> 
> I am just wondering, Can we package this driver in binary form for
> end-users to whose scope
> is to not develop Windows DPDK but fixing any issues reported by build
> CI for Window due
> to generic EAL patches or so.

As cover letter says, this driver should not be required in the end. For
netUIO, Harini announced at Windows community call 2020-03-14 that it will be
freely available in binary form hosted by Microsoft. In general, publishing
driver binaries is OK, take virtio for example:
https://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers
Driver signing is a different matter, but CI probably doesn't require it.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows
  2020-03-30  4:10 ` [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
  2020-03-30  6:58   ` Jerin Jacob
@ 2020-04-10  1:45   ` Ranjit Menon
  2020-04-10  2:50     ` Dmitry Kozlyuk
  1 sibling, 1 reply; 218+ messages in thread
From: Ranjit Menon @ 2020-04-10  1:45 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev; +Cc: Dmitry Malloy (MESHCHANINOV)

On 3/29/2020 9:10 PM, Dmitry Kozlyuk wrote:
> This patch is for dpdk-kmods tree.
> 
> This driver supports Windows EAL memory management by translating
> current process virtual addresses to physical addresses (IOVA).
> Standalone virt2phys allows using DPDK without PMD and provides a
> reference implementation. UIO drivers might also implement virt2phys
> interface, thus rendering this separate driver unneeded.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<Snip!>

> +
> +_Use_decl_annotations_
> +VOID
> +virt2phys_device_EvtIoInCallerContext(
> +	IN WDFDEVICE device, IN WDFREQUEST request)
> +{
> +	WDF_REQUEST_PARAMETERS params;
> +	ULONG code;
> +	PVOID *virt;

Should this be PVOID virt; (instead of PVOID *virt)?
If so, changes will be required to parameters passed in to
WdfRequestRetrieveInputBuffer() call.

> +	PHYSICAL_ADDRESS *phys;
> +	size_t size;
> +	NTSTATUS status;
> +
> +	UNREFERENCED_PARAMETER(device);
> +
> +	PAGED_CODE();
> +
> +	WDF_REQUEST_PARAMETERS_INIT(&params);
> +	WdfRequestGetParameters(request, &params);
> +
> +	if (params.Type != WdfRequestTypeDeviceControl) {
> +		KdPrint(("bogus request type=%u\n", params.Type));
> +		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
> +		return;
> +	}
> +
> +	code = params.Parameters.DeviceIoControl.IoControlCode;
> +	if (code != IOCTL_VIRT2PHYS_TRANSLATE) {
> +		KdPrint(("bogus IO control code=%lu\n", code));
> +		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
> +		return;
> +	}
> +
> +	status = WdfRequestRetrieveInputBuffer(
> +			request, sizeof(*virt), (PVOID *)&virt, &size);
> +	if (!NT_SUCCESS(status)) {
> +		KdPrint(("WdfRequestRetrieveInputBuffer() failed, "
> +			"status=%08x\n", status));
> +		WdfRequestComplete(request, status);
> +		return;
> +	}
> +
> +	status = WdfRequestRetrieveOutputBuffer(
> +		request, sizeof(*phys), &phys, &size);

Better to put a (PVOID *)typecast for &phys here:
	status = WdfRequestRetrieveOutputBuffer(
		request, sizeof(*phys), (PVOID *)&phys, &size);

> +	if (!NT_SUCCESS(status)) {
> +		KdPrint(("WdfRequestRetrieveOutputBuffer() failed, "
> +			"status=%08x\n", status));
> +		WdfRequestComplete(request, status);
> +		return;
> +	}
> +
> +	*phys = MmGetPhysicalAddress(*virt);
> +
> +	WdfRequestCompleteWithInformation(
> +		request, STATUS_SUCCESS, sizeof(*phys));
> +}

<Snip!>

Co-installers are no longer required (and discouraged) as per Microsoft. 
So you can remove the lines indicated below from the .inf file.

> diff --git a/windows/virt2phys/virt2phys.inf b/windows/virt2phys/virt2phys.inf
> new file mode 100755
> index 0000000..e8adaac
> --- /dev/null
> +++ b/windows/virt2phys/virt2phys.inf
> @@ -0,0 +1,85 @@
> +; SPDX-License-Identifier: BSD-3-Clause
> +; Copyright (c) 2020 Dmitry Kozlyuk
> +
> +[Version]
> +Signature = "$WINDOWS NT$"
> +Class = %ClassName%
> +ClassGuid = {78A1C341-4539-11d3-B88D-00C04FAD5171}
> +Provider = %ManufacturerName%
> +CatalogFile = virt2phys.cat
> +DriverVer =
> +
> +[DestinationDirs]
> +DefaultDestDir = 12
> +virt2phys_Device_CoInstaller_CopyFiles = 11
Remove this line

> +
> +; ================= Class section =====================
> +
> +[ClassInstall32]
> +Addreg = virt2phys_ClassReg
> +
> +[virt2phys_ClassReg]
> +HKR,,,0,%ClassName%
> +HKR,,Icon,,-5
> +
> +[SourceDisksNames]
> +1 = %DiskName%,,,""
> +
> +[SourceDisksFiles]
> +virt2phys.sys  = 1,,
> +WdfCoInstaller$KMDFCOINSTALLERVERSION$.dll = 1
Remove this line

> +
> +;*****************************************
> +; Install Section
> +;*****************************************
> +
> +[Manufacturer]
> +%ManufacturerName%=Standard,NT$ARCH$
> +
> +[Standard.NT$ARCH$]
> +%virt2phys.DeviceDesc%=virt2phys_Device, Root\virt2phys
> +
> +[virt2phys_Device.NT]
> +CopyFiles = Drivers_Dir
> +
> +[Drivers_Dir]
> +virt2phys.sys
> +
> +;-------------- Service installation
> +[virt2phys_Device.NT.Services]
> +AddService = virt2phys,%SPSVCINST_ASSOCSERVICE%, virt2phys_Service_Inst
> +
> +; -------------- virt2phys driver install sections
> +[virt2phys_Service_Inst]
> +DisplayName    = %virt2phys.SVCDESC%
> +ServiceType    = 1 ; SERVICE_KERNEL_DRIVER
> +StartType      = 3 ; SERVICE_DEMAND_START
> +ErrorControl   = 1 ; SERVICE_ERROR_NORMAL
> +ServiceBinary  = %12%\virt2phys.sys
> +

Remove entire co-installer section below
> +;
> +;--- virt2phys_Device Coinstaller installation ------
> +;
> +
> +[virt2phys_Device.NT.CoInstallers]
> +AddReg = virt2phys_Device_CoInstaller_AddReg
> +CopyFiles = virt2phys_Device_CoInstaller_CopyFiles
> +
> +[virt2phys_Device_CoInstaller_AddReg]
> +HKR,,CoInstallers32,0x00010000, "WdfCoInstaller$KMDFCOINSTALLERVERSION$.dll,WdfCoInstaller"
> +
> +[virt2phys_Device_CoInstaller_CopyFiles]
> +WdfCoInstaller$KMDFCOINSTALLERVERSION$.dll
> +
Remove up to here

> +[virt2phys_Device.NT.Wdf]
> +KmdfService = virt2phys, virt2phys_wdfsect
> +[virt2phys_wdfsect]
> +KmdfLibraryVersion = $KMDFVERSION$
> +
> +[Strings]
> +SPSVCINST_ASSOCSERVICE = 0x00000002
> +ManufacturerName = "Dmitry Kozlyuk"
> +ClassName = "Kernel bypass"
> +DiskName = "virt2phys Installation Disk"
> +virt2phys.DeviceDesc = "Virtual to physical address translator"
> +virt2phys.SVCDESC = "virt2phys Service"

<Snip!>

ranjit m.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-10  1:45   ` Ranjit Menon
@ 2020-04-10  2:50     ` Dmitry Kozlyuk
  2020-04-10  2:59       ` Dmitry Kozlyuk
  2020-04-10 19:39       ` Ranjit Menon
  0 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10  2:50 UTC (permalink / raw)
  To: Ranjit Menon; +Cc: dev, Dmitry Malloy (MESHCHANINOV), Narcisa Ana Maria Vasile

> > +
> > +_Use_decl_annotations_
> > +VOID
> > +virt2phys_device_EvtIoInCallerContext(
> > +	IN WDFDEVICE device, IN WDFREQUEST request)
> > +{
> > +	WDF_REQUEST_PARAMETERS params;
> > +	ULONG code;
> > +	PVOID *virt;  
> 
> Should this be PVOID virt; (instead of PVOID *virt)?
> If so, changes will be required to parameters passed in to
> WdfRequestRetrieveInputBuffer() call.

This should be PVOID *virt (pointer to an untyped pointer). User-mode passes
a virtual address as a PVOID value, WdfRequestRetrieveInputBuffer() fills
virt with the address of that parameter, so that *virt is the virtual address
user-mode wants to translate into a physical one.

> 
> > +	PHYSICAL_ADDRESS *phys;
> > +	size_t size;
> > +	NTSTATUS status;
> > +
[snip]
> > +
> > +	status = WdfRequestRetrieveOutputBuffer(
> > +		request, sizeof(*phys), &phys, &size);  
> 
> Better to put a (PVOID *)typecast for &phys here:
> 	status = WdfRequestRetrieveOutputBuffer(
> 		request, sizeof(*phys), (PVOID *)&phys, &size); 

What do you mean? Without a typecast the built-in static analyzer emits a
warning (and all warnings are treated as errors for a driver):

virt2phys.c(108,46): error C2220: the following warning is treated as an error
virt2phys.c(108,46): warning C4047: 'function': 'PVOID *' differs in levels of indirection from 'PVOID **'
virt2phys.c(108,46): warning C4022: 'WdfRequestRetrieveInputBuffer': pointer mismatch for actual parameter 3

> > +	if (!NT_SUCCESS(status)) {
> > +		KdPrint(("WdfRequestRetrieveOutputBuffer() failed, "
> > +			"status=%08x\n", status));
> > +		WdfRequestComplete(request, status);
> > +		return;
> > +	}
> > +
> > +	*phys = MmGetPhysicalAddress(*virt);
> > +
> > +	WdfRequestCompleteWithInformation(
> > +		request, STATUS_SUCCESS, sizeof(*phys));
> > +}  
> 
> <Snip!>
> 
> Co-installers are no longer required (and discouraged) as per Microsoft. 
> So you can remove the lines indicated below from the .inf file.

Thanks, will remove in v2. They were generated by WDK solution template.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-10  2:50     ` Dmitry Kozlyuk
@ 2020-04-10  2:59       ` Dmitry Kozlyuk
  2020-04-10 19:39       ` Ranjit Menon
  1 sibling, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10  2:59 UTC (permalink / raw)
  To: Ranjit Menon; +Cc: dev, Dmitry Malloy (MESHCHANINOV), Narcisa Ana Maria Vasile

> >   
> > > +	PHYSICAL_ADDRESS *phys;
> > > +	size_t size;
> > > +	NTSTATUS status;
> > > +  
> [snip]
> > > +
> > > +	status = WdfRequestRetrieveOutputBuffer(
> > > +		request, sizeof(*phys), &phys, &size);    
> > 
> > Better to put a (PVOID *)typecast for &phys here:
> > 	status = WdfRequestRetrieveOutputBuffer(
> > 		request, sizeof(*phys), (PVOID *)&phys, &size);   
> 
> What do you mean? Without a typecast the built-in static analyzer emits a
> warning (and all warnings are treated as errors for a driver):
> 
> virt2phys.c(108,46): error C2220: the following warning is treated as an error
> virt2phys.c(108,46): warning C4047: 'function': 'PVOID *' differs in levels of indirection from 'PVOID **'
> virt2phys.c(108,46): warning C4022: 'WdfRequestRetrieveInputBuffer': pointer mismatch for actual parameter 3
> 

Never mind, I thought you were referring to the existing typecast in
WdfRequestRetrieveInputBuffer(). Thanks for the suggestion, will add in v2.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 00/10] eal: Windows basic memory management
  2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
                   ` (8 preceding siblings ...)
  2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 9/9] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-04-10 16:43 ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
                     ` (11 more replies)
  9 siblings, 12 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing with IOVA unavailable.


The first commit introduces a new kernel-mode driver, virt2phys.
It translates user-mode virtual addresses into physical addresses.
On Windows community call 2020-04-01 it was decided this driver can be
used for now, later netUIO may pick up its code/interface or not.


New EAL public functions for memory mapping are introduced
to mitigate OS differences in DPDK libraries and applications:

* rte_mem_map
* rte_mem_unmap
* rte_mem_lock

To support common MM routines, internal wrappers for low-level
memory reservation and file management are introduced. These changes
affect Linux and FreeBSD EAL. Shared code is placed unded /unix/
subdirectory (suggested by Thomas).

Also, entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.


Windows MM duplicates quite a lot of code from Linux EAL:

* eal_memalloc_alloc_seg_bulk
* eal_memalloc_free_seg_bulk
* calc_num_pages_per_socket
* rte_eal_hugepage_init

Perhaps this should be left as-is until Windows MM evolves into having
some specific requirements for these parts.


Notes on checkpatch warnings:

* No space after comma / no space before closing parent in macros---
  definitely a false-positive, unclear how to suppress this.

* Issues from imported BSD code---probably should be ignored?

* Checkpatch is not run against dpdk-kmods (Windows drivers).

---

There were comments from Mellanox and Microsoft outside of this mailing list.
Still missing input from Anatoly Burakov.

v2:

    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.

Dmitry Kozlyuk (9):
  eal/windows: do not expose private EAL facilities
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal: extract common code for memseg list initialization
  eal/windows: fix rte_page_sizes with Clang on Windows
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: implement basic memory management

 config/meson.build                            |   12 +-
 doc/guides/windows_gsg/build_dpdk.rst         |   20 -
 doc/guides/windows_gsg/index.rst              |    1 +
 doc/guides/windows_gsg/run_apps.rst           |   77 ++
 lib/librte_eal/common/eal_common_fbarray.c    |   57 +-
 lib/librte_eal/common/eal_common_memory.c     |  104 +-
 lib/librte_eal/common/eal_private.h           |  134 +-
 lib/librte_eal/common/malloc_heap.c           |    1 +
 lib/librte_eal/common/meson.build             |    9 +
 lib/librte_eal/freebsd/eal_memory.c           |   55 +-
 lib/librte_eal/include/rte_memory.h           |   74 ++
 lib/librte_eal/linux/eal_memory.c             |   68 +-
 lib/librte_eal/meson.build                    |    5 +
 lib/librte_eal/rte_eal_exports.def            |  119 ++
 lib/librte_eal/rte_eal_version.map            |    4 +
 lib/librte_eal/unix/eal.c                     |   47 +
 lib/librte_eal/unix/eal_memory.c              |  112 ++
 lib/librte_eal/unix/meson.build               |    7 +
 lib/librte_eal/windows/eal.c                  |  160 +++
 lib/librte_eal/windows/eal_hugepages.c        |  108 ++
 lib/librte_eal/windows/eal_lcore.c            |  187 +--
 lib/librte_eal/windows/eal_memalloc.c         |  423 ++++++
 lib/librte_eal/windows/eal_memory.c           | 1133 +++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               |  103 ++
 lib/librte_eal/windows/eal_thread.c           |    1 +
 lib/librte_eal/windows/eal_windows.h          |  129 ++
 lib/librte_eal/windows/include/meson.build    |    2 +
 lib/librte_eal/windows/include/pthread.h      |    2 +
 lib/librte_eal/windows/include/rte_os.h       |   48 +-
 .../windows/include/rte_virt2phys.h           |   34 +
 lib/librte_eal/windows/include/rte_windows.h  |   43 +
 lib/librte_eal/windows/include/sys/queue.h    |  663 +++++++++-
 lib/librte_eal/windows/include/unistd.h       |    3 +
 lib/librte_eal/windows/meson.build            |    4 +
 34 files changed, 3599 insertions(+), 350 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/unix/eal.c
 create mode 100644 lib/librte_eal/unix/eal_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/eal_windows.h
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h
 create mode 100644 lib/librte_eal/windows/include/rte_windows.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-13  5:32     ` Ranjit Menon
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 02/10] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
                     ` (10 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk

This driver supports Windows EAL memory management by translating
current process virtual addresses to physical addresses (IOVA).
Standalone virt2phys allows using DPDK without PMD and provides a
reference implementation.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---

    Note: this patch is for dpdk-kmods tree.

 windows/README.rst                          |  92 ++++++++
 windows/virt2phys/virt2phys.c               | 129 +++++++++++
 windows/virt2phys/virt2phys.h               |  34 +++
 windows/virt2phys/virt2phys.inf             |  64 ++++++
 windows/virt2phys/virt2phys.sln             |  27 +++
 windows/virt2phys/virt2phys.vcxproj         | 228 ++++++++++++++++++++
 windows/virt2phys/virt2phys.vcxproj.filters |  36 ++++
 7 files changed, 610 insertions(+)
 create mode 100644 windows/README.rst
 create mode 100755 windows/virt2phys/virt2phys.c
 create mode 100755 windows/virt2phys/virt2phys.h
 create mode 100755 windows/virt2phys/virt2phys.inf
 create mode 100755 windows/virt2phys/virt2phys.sln
 create mode 100755 windows/virt2phys/virt2phys.vcxproj
 create mode 100755 windows/virt2phys/virt2phys.vcxproj.filters

diff --git a/windows/README.rst b/windows/README.rst
new file mode 100644
index 0000000..e30d4dc
--- /dev/null
+++ b/windows/README.rst
@@ -0,0 +1,92 @@
+Developing Windows Drivers
+==========================
+
+Prerequisites
+-------------
+
+Building Windows Drivers is only possible on Windows.
+
+1. Visual Studio 2019 Community or Professional Edition
+2. Windows Driver Kit (WDK) for Windows 10, version 1903
+
+Follow the official instructions to obtain all of the above:
+https://docs.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk
+
+
+Build the Drivers
+-----------------
+
+Build from Visual Studio
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Open a solution (``*.sln``) with Visual Studio and build it (Ctrl+Shift+B).
+
+
+Build from Command-Line
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Run *Developer Command Prompt for VS 2019* from the Start menu.
+
+Navigate to the solution directory (with ``*.sln``), then run:
+
+.. code-block:: console
+
+    msbuild
+
+To build a particular combination of configuration and platform:
+
+.. code-block:: console
+
+    msbuild -p:Configuration=Debug;Platform=x64
+
+
+Install the Drivers
+-------------------
+
+Disable Driver Signature Enforcement
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default Windows prohibits installing and loading drivers without `digital
+signature`_ obtained from Microsoft. For development signature enforcement may
+be disabled as follows.
+
+In Elevated Command Prompt (from this point, sufficient privileges are
+assumed):
+
+.. code-block:: console
+
+    bcdedit -set loadoptions DISABLE_INTEGRITY_CHECKS
+    bcdedit -set TESTSIGNING ON
+    shutdown -r -t 0
+
+Upon reboot, an overlay message should appear on the desktop informing
+that Windows is in test mode, which means it allows loading unsigned drivers.
+
+.. _digital signature: https://docs.microsoft.com/en-us/windows-hardware/drivers/install/driver-signing
+
+
+Install, List, and Remove Drivers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Driver package is by default located in a subdirectory of its source tree,
+e.g. ``x64\Debug\virt2phys\virt2phys`` (note two levels of ``virt2phys``).
+
+To install the driver and bind associated devices to it:
+
+.. code-block:: console
+
+    pnputil /add-driver x64\Debug\virt2phys\virt2phys\virt2phys.inf /install
+
+A graphical confirmation to load an unsigned driver will still appear.
+
+To list installed drivers:
+
+.. code-block:: console
+
+    pnputil /enum-drivers
+
+To remove the driver package and to uninstall its devices:
+
+.. code-block:: console
+
+    pnputil /delete-driver oem2.inf /uninstall
diff --git a/windows/virt2phys/virt2phys.c b/windows/virt2phys/virt2phys.c
new file mode 100755
index 0000000..e157e9c
--- /dev/null
+++ b/windows/virt2phys/virt2phys.c
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <ntddk.h>
+#include <wdf.h>
+#include <wdmsec.h>
+#include <initguid.h>
+
+#include "virt2phys.h"
+
+DRIVER_INITIALIZE DriverEntry;
+EVT_WDF_DRIVER_DEVICE_ADD virt2phys_driver_EvtDeviceAdd;
+EVT_WDF_IO_IN_CALLER_CONTEXT virt2phys_device_EvtIoInCallerContext;
+
+NTSTATUS
+DriverEntry(
+	IN PDRIVER_OBJECT driver_object, IN PUNICODE_STRING registry_path)
+{
+	WDF_DRIVER_CONFIG config;
+	WDF_OBJECT_ATTRIBUTES attributes;
+	NTSTATUS status;
+
+	PAGED_CODE();
+
+	WDF_DRIVER_CONFIG_INIT(&config, virt2phys_driver_EvtDeviceAdd);
+	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
+	status = WdfDriverCreate(
+			driver_object, registry_path,
+			&attributes, &config, WDF_NO_HANDLE);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDriverCreate() failed, status=%08x\n", status));
+	}
+
+	return status;
+}
+
+_Use_decl_annotations_
+NTSTATUS
+virt2phys_driver_EvtDeviceAdd(
+	WDFDRIVER driver, PWDFDEVICE_INIT init)
+{
+	WDF_OBJECT_ATTRIBUTES attributes;
+	WDFDEVICE device;
+	NTSTATUS status;
+
+	UNREFERENCED_PARAMETER(driver);
+
+	PAGED_CODE();
+
+	WdfDeviceInitSetIoType(
+		init, WdfDeviceIoNeither);
+	WdfDeviceInitSetIoInCallerContextCallback(
+		init, virt2phys_device_EvtIoInCallerContext);
+
+	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
+
+	status = WdfDeviceCreate(&init, &attributes, &device);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDeviceCreate() failed, status=%08x\n", status));
+		return status;
+	}
+
+	status = WdfDeviceCreateDeviceInterface(
+			device, &GUID_DEVINTERFACE_VIRT2PHYS, NULL);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDeviceCreateDeviceInterface() failed, "
+			"status=%08x\n", status));
+		return status;
+	}
+
+	return STATUS_SUCCESS;
+}
+
+_Use_decl_annotations_
+VOID
+virt2phys_device_EvtIoInCallerContext(
+	IN WDFDEVICE device, IN WDFREQUEST request)
+{
+	WDF_REQUEST_PARAMETERS params;
+	ULONG code;
+	PVOID *virt;
+	PHYSICAL_ADDRESS *phys;
+	size_t size;
+	NTSTATUS status;
+
+	UNREFERENCED_PARAMETER(device);
+
+	PAGED_CODE();
+
+	WDF_REQUEST_PARAMETERS_INIT(&params);
+	WdfRequestGetParameters(request, &params);
+
+	if (params.Type != WdfRequestTypeDeviceControl) {
+		KdPrint(("bogus request type=%u\n", params.Type));
+		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
+		return;
+	}
+
+	code = params.Parameters.DeviceIoControl.IoControlCode;
+	if (code != IOCTL_VIRT2PHYS_TRANSLATE) {
+		KdPrint(("bogus IO control code=%lu\n", code));
+		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
+		return;
+	}
+
+	status = WdfRequestRetrieveInputBuffer(
+			request, sizeof(*virt), (PVOID *)&virt, &size);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfRequestRetrieveInputBuffer() failed, "
+			"status=%08x\n", status));
+		WdfRequestComplete(request, status);
+		return;
+	}
+
+	status = WdfRequestRetrieveOutputBuffer(
+		request, sizeof(*phys), (PVOID *)&phys, &size);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfRequestRetrieveOutputBuffer() failed, "
+			"status=%08x\n", status));
+		WdfRequestComplete(request, status);
+		return;
+	}
+
+	*phys = MmGetPhysicalAddress(*virt);
+
+	WdfRequestCompleteWithInformation(
+		request, STATUS_SUCCESS, sizeof(*phys));
+}
diff --git a/windows/virt2phys/virt2phys.h b/windows/virt2phys/virt2phys.h
new file mode 100755
index 0000000..4bb2b4a
--- /dev/null
+++ b/windows/virt2phys/virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/windows/virt2phys/virt2phys.inf b/windows/virt2phys/virt2phys.inf
new file mode 100755
index 0000000..e35765e
--- /dev/null
+++ b/windows/virt2phys/virt2phys.inf
@@ -0,0 +1,64 @@
+; SPDX-License-Identifier: BSD-3-Clause
+; Copyright (c) 2020 Dmitry Kozlyuk
+
+[Version]
+Signature = "$WINDOWS NT$"
+Class = %ClassName%
+ClassGuid = {78A1C341-4539-11d3-B88D-00C04FAD5171}
+Provider = %ManufacturerName%
+CatalogFile = virt2phys.cat
+DriverVer =
+
+[DestinationDirs]
+DefaultDestDir = 12
+
+; ================= Class section =====================
+
+[ClassInstall32]
+Addreg = virt2phys_ClassReg
+
+[virt2phys_ClassReg]
+HKR,,,0,%ClassName%
+HKR,,Icon,,-5
+
+[SourceDisksNames]
+1 = %DiskName%,,,""
+
+[SourceDisksFiles]
+virt2phys.sys  = 1,,
+
+;*****************************************
+; Install Section
+;*****************************************
+
+[Manufacturer]
+%ManufacturerName%=Standard,NT$ARCH$
+
+[Standard.NT$ARCH$]
+%virt2phys.DeviceDesc%=virt2phys_Device, Root\virt2phys
+
+[virt2phys_Device.NT]
+CopyFiles = Drivers_Dir
+
+[Drivers_Dir]
+virt2phys.sys
+
+;-------------- Service installation
+[virt2phys_Device.NT.Services]
+AddService = virt2phys,%SPSVCINST_ASSOCSERVICE%, virt2phys_Service_Inst
+
+; -------------- virt2phys driver install sections
+[virt2phys_Service_Inst]
+DisplayName    = %virt2phys.SVCDESC%
+ServiceType    = 1 ; SERVICE_KERNEL_DRIVER
+StartType      = 3 ; SERVICE_DEMAND_START
+ErrorControl   = 1 ; SERVICE_ERROR_NORMAL
+ServiceBinary  = %12%\virt2phys.sys
+
+[Strings]
+SPSVCINST_ASSOCSERVICE = 0x00000002
+ManufacturerName = "Dmitry Kozlyuk"
+ClassName = "Kernel bypass"
+DiskName = "virt2phys Installation Disk"
+virt2phys.DeviceDesc = "Virtual to physical address translator"
+virt2phys.SVCDESC = "virt2phys Service"
diff --git a/windows/virt2phys/virt2phys.sln b/windows/virt2phys/virt2phys.sln
new file mode 100755
index 0000000..0f5ecdc
--- /dev/null
+++ b/windows/virt2phys/virt2phys.sln
@@ -0,0 +1,27 @@
+
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio Version 16
+VisualStudioVersion = 16.0.29613.14
+MinimumVisualStudioVersion = 10.0.40219.1
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "virt2phys", "virt2phys.vcxproj", "{0EEF826B-9391-43A8-A722-BDD6F6115137}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|x64 = Debug|x64
+		Release|x64 = Release|x64
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.ActiveCfg = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.Build.0 = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.Deploy.0 = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.ActiveCfg = Release|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.Build.0 = Release|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.Deploy.0 = Release|x64
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+	GlobalSection(ExtensibilityGlobals) = postSolution
+		SolutionGuid = {845012FB-4471-4A12-A1C4-FF7E05C40E8E}
+	EndGlobalSection
+EndGlobal
diff --git a/windows/virt2phys/virt2phys.vcxproj b/windows/virt2phys/virt2phys.vcxproj
new file mode 100755
index 0000000..fa51916
--- /dev/null
+++ b/windows/virt2phys/virt2phys.vcxproj
@@ -0,0 +1,228 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|ARM">
+      <Configuration>Debug</Configuration>
+      <Platform>ARM</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|ARM">
+      <Configuration>Release</Configuration>
+      <Platform>ARM</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|ARM64">
+      <Configuration>Debug</Configuration>
+      <Platform>ARM64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|ARM64">
+      <Configuration>Release</Configuration>
+      <Platform>ARM64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="virt2phys.c" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="virt2phys.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <Inf Include="virt2phys.inf" />
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <ProjectGuid>{0EEF826B-9391-43A8-A722-BDD6F6115137}</ProjectGuid>
+    <TemplateGuid>{497e31cb-056b-4f31-abb8-447fd55ee5a5}</TemplateGuid>
+    <TargetFrameworkVersion>v4.5</TargetFrameworkVersion>
+    <MinimumVisualStudioVersion>12.0</MinimumVisualStudioVersion>
+    <Configuration>Debug</Configuration>
+    <Platform Condition="'$(Platform)' == ''">Win32</Platform>
+    <RootNamespace>virt2phys</RootNamespace>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <WppEnabled>false</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+    <Link>
+      <AdditionalDependencies>$(DDK_LIB_PATH)wdmsec.lib;%(AdditionalDependencies)</AdditionalDependencies>
+    </Link>
+    <Inf>
+      <TimeStamp>0.1</TimeStamp>
+    </Inf>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <FilesToPackage Include="$(TargetPath)" />
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+  </ImportGroup>
+</Project>
\ No newline at end of file
diff --git a/windows/virt2phys/virt2phys.vcxproj.filters b/windows/virt2phys/virt2phys.vcxproj.filters
new file mode 100755
index 0000000..0fe65fc
--- /dev/null
+++ b/windows/virt2phys/virt2phys.vcxproj.filters
@@ -0,0 +1,36 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="Header Files">
+      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>
+      <Extensions>h;hpp;hxx;hm;inl;inc;xsd</Extensions>
+    </Filter>
+    <Filter Include="Resource Files">
+      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>
+      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms</Extensions>
+    </Filter>
+    <Filter Include="Driver Files">
+      <UniqueIdentifier>{8E41214B-6785-4CFE-B992-037D68949A14}</UniqueIdentifier>
+      <Extensions>inf;inv;inx;mof;mc;</Extensions>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <Inf Include="virt2phys.inf">
+      <Filter>Driver Files</Filter>
+    </Inf>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="virt2phys.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="virt2phys.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+</Project>
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 02/10] eal/windows: do not expose private EAL facilities
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 03/10] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon, Anand Rawat, Thomas Monjalon

The goal of rte_os.h is to mitigate OS differences for EAL users.
In Windows EAL, rte_os.h did excessive things:

1. It included platform SDK headers (windows.h, etc). Those files are
   huge, require specific inclusion order, and are generally unused by
   the code including rte_os.h. Declarations from platform SDK may
   break otherwise platform-independent code, e.g. min, max, ERROR.

2. It included pthread.h, which is clearly not always required.

3. It defined functions private to Windows EAL.

Reorganize Windows EAL includes in the following way:

1. Create rte_windows.h to properly import Windows-specific facilities.
   Primary users are bus drivers, tests, and external applications.

2. Remove platform SDK includes from rte_os.h to prevent breaking
   otherwise portable code by including rte_os.h on Windows.
   Copy necessary definitions to avoid including those headers.

3. Remove pthread.h include from rte_os.h.

4. Move declarations private to Windows EAL into eal_windows.h.

Fixes: 428eb983f5f7 ("eal: add OS specific header file")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal.c                 |  2 +
 lib/librte_eal/windows/eal_lcore.c           |  2 +
 lib/librte_eal/windows/eal_thread.c          |  1 +
 lib/librte_eal/windows/eal_windows.h         | 29 +++++++++++++
 lib/librte_eal/windows/include/meson.build   |  1 +
 lib/librte_eal/windows/include/pthread.h     |  2 +
 lib/librte_eal/windows/include/rte_os.h      | 44 ++++++--------------
 lib/librte_eal/windows/include/rte_windows.h | 41 ++++++++++++++++++
 8 files changed, 91 insertions(+), 31 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_windows.h
 create mode 100644 lib/librte_eal/windows/include/rte_windows.h

diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index e4b50df3b..2cf7a04ef 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -18,6 +18,8 @@
 #include <eal_options.h>
 #include <eal_private.h>
 
+#include "eal_windows.h"
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index b3a6c63af..82ee45413 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -2,12 +2,14 @@
  * Copyright(c) 2019 Intel Corporation
  */
 
+#include <pthread.h>
 #include <stdint.h>
 
 #include <rte_common.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_windows.h"
 
 /* global data structure that contains the CPU map */
 static struct _wcpu_map {
diff --git a/lib/librte_eal/windows/eal_thread.c b/lib/librte_eal/windows/eal_thread.c
index 9e4bbaa08..e149199a6 100644
--- a/lib/librte_eal/windows/eal_thread.c
+++ b/lib/librte_eal/windows/eal_thread.c
@@ -14,6 +14,7 @@
 #include <eal_thread.h>
 
 #include "eal_private.h"
+#include "eal_windows.h"
 
 RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
 RTE_DEFINE_PER_LCORE(unsigned int, _socket_id) = (unsigned int)SOCKET_ID_ANY;
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
new file mode 100644
index 000000000..fadd676b2
--- /dev/null
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#ifndef _EAL_WINDOWS_H_
+#define _EAL_WINDOWS_H_
+
+/**
+ * @file Facilities private to Windows EAL
+ */
+
+#include <rte_windows.h>
+
+/**
+ * Create a map of processors and cores on the system.
+ */
+void eal_create_cpu_map(void);
+
+/**
+ * Create a thread.
+ *
+ * @param thread
+ *   The location to store the thread id if successful.
+ * @return
+ *   0 for success, -1 if the thread is not created.
+ */
+int eal_thread_create(pthread_t *thread);
+
+#endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 7d18dd52f..5fb1962ac 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,4 +5,5 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/pthread.h b/lib/librte_eal/windows/include/pthread.h
index b9dd18e56..cfd53f0b8 100644
--- a/lib/librte_eal/windows/include/pthread.h
+++ b/lib/librte_eal/windows/include/pthread.h
@@ -5,6 +5,8 @@
 #ifndef _PTHREAD_H_
 #define _PTHREAD_H_
 
+#include <stdint.h>
+
 /**
  * This file is required to support the common code in eal_common_proc.c,
  * eal_common_thread.c and common\include\rte_per_lcore.h as Microsoft libc
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index e1e0378e6..510e39e03 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -8,20 +8,18 @@
 /**
  * This is header should contain any function/macro definition
  * which are not supported natively or named differently in the
- * Windows OS. Functions will be added in future releases.
+ * Windows OS. It must not include Windows-specific headers.
  */
 
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-#include <windows.h>
-#include <basetsd.h>
-#include <pthread.h>
-#include <stdio.h>
-
-/* limits.h replacement */
-#include <stdlib.h>
+/* limits.h replacement, value as in <windows.h> */
 #ifndef PATH_MAX
 #define PATH_MAX _MAX_PATH
 #endif
@@ -31,8 +29,6 @@ extern "C" {
 /* strdup is deprecated in Microsoft libc and _strdup is preferred */
 #define strdup(str) _strdup(str)
 
-typedef SSIZE_T ssize_t;
-
 #define strtok_r(str, delim, saveptr) strtok_s(str, delim, saveptr)
 
 #define index(a, b)     strchr(a, b)
@@ -40,22 +36,14 @@ typedef SSIZE_T ssize_t;
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
-/**
- * Create a thread.
- * This function is private to EAL.
- *
- * @param thread
- *   The location to store the thread id if successful.
- * @return
- *   0 for success, -1 if the thread is not created.
- */
-int eal_thread_create(pthread_t *thread);
+/* cpu_set macros implementation */
+#define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
+#define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
+#define RTE_CPU_FILL(set) CPU_FILL(set)
+#define RTE_CPU_NOT(dst, src) CPU_NOT(dst, src)
 
-/**
- * Create a map of processors and cores on the system.
- * This function is private to EAL.
- */
-void eal_create_cpu_map(void);
+/* as in <windows.h> */
+typedef long long ssize_t;
 
 #ifndef RTE_TOOLCHAIN_GCC
 static inline int
@@ -86,12 +74,6 @@ asprintf(char **buffer, const char *format, ...)
 }
 #endif /* RTE_TOOLCHAIN_GCC */
 
-/* cpu_set macros implementation */
-#define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
-#define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
-#define RTE_CPU_FILL(set) CPU_FILL(set)
-#define RTE_CPU_NOT(dst, src) CPU_NOT(dst, src)
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
new file mode 100644
index 000000000..ed6e4c148
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#ifndef _RTE_WINDOWS_H_
+#define _RTE_WINDOWS_H_
+
+/**
+ * @file Windows-specific facilities
+ *
+ * This file should be included by DPDK libraries and applications
+ * that need access to Windows API. It includes platform SDK headers
+ * in compatible order with proper options and defines error-handling macros.
+ */
+
+/* Disable excessive libraries. */
+#ifndef WIN32_LEAN_AND_MEAN
+#define WIN32_LEAN_AND_MEAN
+#endif
+
+/* Must come first. */
+#include <windows.h>
+
+#include <basetsd.h>
+#include <psapi.h>
+
+/* Have GUIDs defined. */
+#ifndef INITGUID
+#define INITGUID
+#endif
+#include <initguid.h>
+
+/**
+ * Log GetLastError() with context, usually a Win32 API function and arguments.
+ */
+#define RTE_LOG_WIN32_ERR(...) \
+	RTE_LOG(DEBUG, EAL, RTE_FMT("GetLastError()=%lu: " \
+		RTE_FMT_HEAD(__VA_ARGS__,) "\n", GetLastError(), \
+		RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#endif /* _RTE_WINDOWS_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 03/10] eal/windows: improve CPU and NUMA node detection
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 02/10] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 04/10] eal/windows: initialize hugepage info Dmitry Kozlyuk
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon, Jeff Shaw, Anand Rawat

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal_lcore.c   | 185 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..9d931d50a 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() may not be available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 04/10] eal/windows: initialize hugepage info
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (2 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 03/10] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/build_dpdk.rst  |  20 -----
 doc/guides/windows_gsg/index.rst       |   1 +
 doc/guides/windows_gsg/run_apps.rst    |  47 +++++++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 7 files changed, 173 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/config/meson.build b/config/meson.build
index 58421342b..4607655d9 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -263,6 +263,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 2cf7a04ef..63461f51a 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -18,8 +18,11 @@
 #include <eal_options.h>
 #include <eal_private.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -242,6 +245,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..b099d13f9
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available in Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem in Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 09dd4ab2f..5f118bfe2 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_thread.c',
 	'getopt.c',
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 05/10] eal: introduce internal wrappers for file operations
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (3 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 04/10] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon

EAL common code uses file locking and truncation. Introduce
OS-independent wrappers in order to support both Linux/FreeBSD
and Windows:

* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Wrappers follow POSIX semantics, but interface is not POSIX,
so that it can be made more clean, e.g. by not mixing locking
operation and behaviour on conflict.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. Files should be named
after the ones from which the code is factored in OS subdirectory.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_private.h | 45 ++++++++++++++++
 lib/librte_eal/meson.build          |  4 ++
 lib/librte_eal/unix/eal.c           | 47 ++++++++++++++++
 lib/librte_eal/unix/meson.build     |  6 +++
 lib/librte_eal/windows/eal.c        | 83 +++++++++++++++++++++++++++++
 5 files changed, 185 insertions(+)
 create mode 100644 lib/librte_eal/unix/eal.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index ddcfbe2e4..65d61ff13 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -443,4 +443,49 @@ rte_option_usage(void);
 uint64_t
 eal_get_baseaddr(void);
 
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 9d219a0e6..1f89efb88 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal.c b/lib/librte_eal/unix/eal.c
new file mode 100644
index 000000000..a337b59b1
--- /dev/null
+++ b/lib/librte_eal/unix/eal.c
@@ -0,0 +1,47 @@
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..13564838e
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal.c',
+)
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 63461f51a..9dba895e7 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -224,6 +224,89 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
+
  /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 06/10] eal: introduce memory management wrappers
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (4 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-13  7:50     ` Tal Shnaiderman
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
                     ` (5 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Anatoly Burakov,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

System meory management is implemented differently for POSIX and
Windows. Introduce wrapper functions for operations used across DPDK:

* rte_mem_map()
  Create memory mapping for a regular file or a page file (swap).
  This supports mapping to a reserved memory region even on Windows.

* rte_mem_unmap()
  Remove mapping created with rte_mem_map().

* rte_get_page_size()
  Obtain default system page size.

* rte_mem_lock()
  Make arbitrary-sized memory region non-swappable.

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                   |  10 +-
 lib/librte_eal/common/eal_private.h  |  51 +++-
 lib/librte_eal/include/rte_memory.h  |  68 +++++
 lib/librte_eal/rte_eal_exports.def   |   4 +
 lib/librte_eal/rte_eal_version.map   |   4 +
 lib/librte_eal/unix/eal_memory.c     | 112 +++++++
 lib/librte_eal/unix/meson.build      |   1 +
 lib/librte_eal/windows/eal.c         |   6 +
 lib/librte_eal/windows/eal_memory.c  | 433 +++++++++++++++++++++++++++
 lib/librte_eal/windows/eal_windows.h |  67 +++++
 lib/librte_eal/windows/meson.build   |   1 +
 11 files changed, 753 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c

diff --git a/config/meson.build b/config/meson.build
index 4607655d9..bceb5ef7b 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -256,14 +256,20 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
 	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 65d61ff13..1e89338f2 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,16 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**< Reserve hugepages (support may be limited or missing). */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**< Fail if requested address is not available. */
+	EAL_RESERVE_EXACT_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -232,8 +243,8 @@ int rte_eal_check_module(const char *module_name);
 #define EAL_VIRTUAL_AREA_UNMAP (1 << 2)
 /**< immediately unmap reserved virtual area. */
 void *
-eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
+	int flags, int mmap_flags);
 
 /**
  * Get cpu core_id.
@@ -488,4 +499,40 @@ int eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
  */
 int eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation address. The system may not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *eal_mem_reserve(void *requested_addr, size_t size,
+	enum eal_mem_reserve_flags flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If @code virt @endcode and @code size @endcode describe a part of the
+ * reserved region, only this part of the region is freed (accurately
+ * up to the system page size). If @code virt @endcode points to allocated
+ * memory, @code size @endcode must match the one specified on allocation.
+ * The behavior is undefined if the memory pointed by @code virt @endcode
+ * is obtained from another source than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void eal_mem_free(void *virt, size_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..1b7c3e5df 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -85,6 +85,74 @@ struct rte_memseg_list {
 	struct rte_fbarray memseg_arr;
 };
 
+/**
+ * Memory protection flags.
+ */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,   /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/**
+ * Memory mapping additional flags.
+ *
+ * In Linux and FreeBSD, each flag is semantically equivalent
+ * to OS-specific mmap(3) flag with the same or similar name.
+ * In Windows, POSIX and MAP_ANONYMOUS semantics are followed.
+ */
+enum rte_map_flags {
+	/** Changes of mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/** Fail if requested address cannot be taken. */
+	RTE_MAP_FIXED = 1 << 3
+};
+
+/**
+ * OS-independent implementation of POSIX mmap(3)
+ * with MAP_ANONYMOUS Linux/FreeBSD extension.
+ */
+__rte_experimental
+void *rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_experimental
+int rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never failes.
+ *
+ * @return
+ *   Positive page size in bytes.
+ */
+__rte_experimental
+int rte_get_page_size(void);
+
+/**
+ * Lock region in physical memory and prevent it from swapping.
+ *
+ * @param virt
+ *   The virtual address.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @note Implementations may require @p virt and @p size to be multiples
+ *       of system page size.
+ * @see rte_get_page_size()
+ * @see rte_mem_lock_page()
+ */
+__rte_experimental
+int rte_mem_lock(const void *virt, size_t size);
+
 /**
  * Lock page in physical memory and prevent from swapping.
  *
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..bacf9a107 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -5,5 +5,9 @@ EXPORTS
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
 	rte_eal_remote_launch
+	rte_get_page_size
 	rte_log
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_unmap
 	rte_vlog
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index f9ede5b41..07128898f 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -337,5 +337,9 @@ EXPERIMENTAL {
 	rte_thread_is_intr;
 
 	# added in 20.05
+	rte_get_page_size;
 	rte_log_can_log;
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_unmap;
 };
diff --git a/lib/librte_eal/unix/eal_memory.c b/lib/librte_eal/unix/eal_memory.c
new file mode 100644
index 000000000..312560b49
--- /dev/null
+++ b/lib/librte_eal/unix/eal_memory.c
@@ -0,0 +1,112 @@
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+
+#include "eal_private.h"
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(ERR, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size,
+	enum eal_mem_reserve_flags flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+#ifdef MAP_HUGETLB
+	if (flags & EAL_RESERVE_HUGEPAGES)
+		sys_flags |= MAP_HUGETLB;
+#endif
+	if (flags & EAL_RESERVE_EXACT_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+static int
+mem_rte_to_sys_prot(enum rte_mem_prot prot)
+{
+	int sys_prot = 0;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset)
+{
+	int sys_prot = 0;
+	int sys_flags = 0;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FIXED)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+int
+rte_get_page_size(void)
+{
+	return getpagesize();
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	return mlock(virt, size);
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index 13564838e..50c019a56 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal.c',
+	'eal_memory.c',
 )
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 9dba895e7..cf55b56da 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -339,6 +339,12 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..59606d84c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,433 @@
+#include <io.h>
+
+#include <rte_errno.h>
+#include <rte_memory.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterMax,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	OSVERSIONINFO info;
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	/* IsWindows10OrGreater() may also be unavailable. */
+	memset(&info, 0, sizeof(info));
+	info.dwOSVersionInfoSize = sizeof(info);
+	GetVersionEx(&info);
+
+	/* Checking for Windows 10+ will also detect Windows Server 2016+.
+	 * Do not abort, because Windows may report false version depending
+	 * on executable manifest, compatibility mode, etc.
+	 */
+	if (info.dwMajorVersion < 10)
+		RTE_LOG(DEBUG, EAL, "Windows 10+ or Windows Server 2016+ "
+			"is required for advanced memory features\n");
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* no VirtualAlloc2() */
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static int
+win32_alloc_error_to_errno(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		return 0;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		return ENOMEM;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		return EINVAL;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size,
+	enum eal_mem_reserve_flags flags)
+{
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		RTE_LOG(ERR, EAL, "Hugepage reservation is not supported\n");
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	virt = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		rte_errno = win32_alloc_error_to_errno(GetLastError());
+	}
+
+	if ((flags & EAL_RESERVE_EXACT_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFree(virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFree()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc(size_t size, enum rte_page_sizes page_size)
+{
+	if (page_size != 0)
+		return eal_mem_alloc_socket(size, SOCKET_ID_ANY);
+
+	return VirtualAlloc(
+		NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void*
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+		if (VirtualQuery(requested_addr, &info, sizeof(info)) == 0) {
+			RTE_LOG_WIN32_ERR("VirtualQuery()");
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) &&
+			!VirtualFree(requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+				"<split placeholder>)", requested_addr, size);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	if (requested_addr != NULL)
+		flags |= MEM_REPLACE_PLACEHOLDER;
+
+	addr = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		int err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, "
+			"<replace placeholder>)", addr, size);
+		rte_errno = win32_alloc_error_to_errno(err);
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, ...)", addr, size);
+		return -1;
+	}
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	if (VirtualQuery(addr, &info, sizeof(info)) == 0) {
+		RTE_LOG_WIN32_ERR("VirtualQuery()");
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFree(addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+			"MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)", addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFree(addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* TODO: there is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FIXED) && (virt != requested_addr)) {
+		BOOL ret = UnmapViewOfFile(virt);
+		virt = NULL;
+		if (!ret)
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		rte_errno = GetLastError();
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		return -1;
+	}
+	return 0;
+}
+
+int
+rte_get_page_size(void)
+{
+	SYSTEM_INFO info;
+	GetSystemInfo(&info);
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock()");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index 390d2fd66..b202a1aa5 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -36,4 +36,71 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate a contiguous chunk of virtual memory.
+ *
+ * Use eal_mem_free() to free allocated memory.
+ *
+ * @param size
+ *  Number of bytes to allocate.
+ * @param page_size
+ *  If non-zero, means memory must be allocated in hugepages
+ *  of the specified size. The @code size @endcode parameter
+ *  must then be a multiple of the largest hugepage size requested.
+ * @return
+ *  Address of allocated memory or NULL on failure (rte_errno is set).
+ */
+void *eal_mem_alloc(size_t size, enum rte_page_sizes page_size);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with @ref eal_mem_reserve()
+ * or decommitted from hugepages by @ref eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and @code rte_errno @endcode is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit.
+ *
+ * The @code addr @endcode and @code param @endcode must match
+ * location and size of previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 5f118bfe2..81d3ee095 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -8,6 +8,7 @@ sources += files(
 	'eal_debug.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
+	'eal_memory.c',
 	'eal_thread.c',
 	'getopt.c',
 )
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 07/10] eal: extract common code for memseg list initialization
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (5 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 08/10] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Anatoly Burakov, Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c | 54 ++++++++++++++++++
 lib/librte_eal/common/eal_private.h       | 34 ++++++++++++
 lib/librte_eal/freebsd/eal_memory.c       | 54 +++---------------
 lib/librte_eal/linux/eal_memory.c         | 68 +++++------------------
 4 files changed, 110 insertions(+), 100 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index cc7d54e0c..d9764681a 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -25,6 +25,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -182,6 +183,59 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_reserve_memseg_list(struct rte_memseg_list *msl,
+		enum eal_mem_reserve_flags flags)
+{
+	uint64_t page_sz;
+	size_t mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
+	if (addr == NULL) {
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+		else
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	return 0;
+}
+
+int
+eal_alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
+			(size_t)page_sz >> 10, socket_id);
+
+	return 0;
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 1e89338f2..76938e379 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -246,6 +246,40 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
 	int flags, int mmap_flags);
 
+/**
+ * Reserve VA space for a memory segment list.
+ *
+ * @param msl
+ *  Memory segment list with page size defined.
+ * @param flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_reserve_memseg_list(struct rte_memseg_list *msl,
+	enum eal_mem_reserve_flags flags);
+
+/**
+ * Initialize a memory segment list with its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ */
+int
+eal_alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index a97d8f0f0..5174f9cd0 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -336,61 +336,23 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_alloc(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_alloc_memseg_list(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_reserve(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
+	enum eal_reserve_flags flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_reserve_memseg_list(msl, flags);
 }
 
 
@@ -479,7 +441,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_alloc(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..a01a7ce76 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_alloc(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_alloc_memseg_list(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_reserve(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_reserve_memseg_list(msl, 0);
 }
 
 /*
@@ -1009,12 +969,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_alloc(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_reserve(msl) < 0)
 				return -1;
 		}
 	}
@@ -2191,7 +2151,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_alloc(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2160,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_reserve(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2355,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_alloc(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_reserve(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2393,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_reserve(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (6 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_size has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Define these values outside of the enum for Clang on Windows only.
This does not affect runtime, because Windows doesn't run on machines
with 4GiB and 16GiB hugepages.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/include/rte_memory.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 1b7c3e5df..3ec673f51 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -34,8 +34,14 @@ enum rte_page_sizes {
 	RTE_PGSIZE_256M  = 1ULL << 28,
 	RTE_PGSIZE_512M  = 1ULL << 29,
 	RTE_PGSIZE_1G    = 1ULL << 30,
+/* Work around Clang on Windows being limited to 32-bit underlying type. */
+#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)
 	RTE_PGSIZE_4G    = 1ULL << 32,
 	RTE_PGSIZE_16G   = 1ULL << 34,
+#else
+#define RTE_PGSIZE_4G  (1ULL << 32)
+#define RTE_PGSIZE_16G (1ULL << 34)
+#endif
 };
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (7 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 08/10] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v2 10/10] eal/windows: implement basic memory management
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (8 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-04-10 16:43   ` Dmitry Kozlyuk
  2020-04-10 22:04     ` Narcisa Ana Maria Vasile
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-10 16:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic, Anatoly Burakov, Bruce Richardson

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                            |   2 +-
 doc/guides/windows_gsg/run_apps.rst           |  30 +
 lib/librte_eal/common/eal_common_fbarray.c    |  57 +-
 lib/librte_eal/common/eal_common_memory.c     |  50 +-
 lib/librte_eal/common/eal_private.h           |   6 +-
 lib/librte_eal/common/malloc_heap.c           |   1 +
 lib/librte_eal/common/meson.build             |   9 +
 lib/librte_eal/freebsd/eal_memory.c           |   1 -
 lib/librte_eal/meson.build                    |   1 +
 lib/librte_eal/rte_eal_exports.def            | 119 ++-
 lib/librte_eal/windows/eal.c                  |  55 ++
 lib/librte_eal/windows/eal_memalloc.c         | 423 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 702 +++++++++++++++++-
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  23 +
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |   4 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   2 +
 21 files changed, 1561 insertions(+), 67 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/config/meson.build b/config/meson.build
index bceb5ef7b..7b8baa788 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -270,7 +270,7 @@ if is_windows
 		add_project_link_arguments('-lmincore', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..e858cf8c1 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -27,6 +27,36 @@ See `Large-Page Support`_ in MSDN for details.
 .. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
 
 
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+This driver is not signed, so signature checking must be disabled to load it.
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation.
+
+Compiled package, consisting of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``, is installed as follows (from Elevated Command Line):
+
+.. code-block:: console
+
+    pnputil /add-driver virt2phys.inf /install
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
+
+
 Run the ``helloworld`` Example
 ------------------------------
 
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 1312f936b..236db9cb7 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,15 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -85,19 +85,16 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
 		/* pass errno up the chain */
 		rte_errno = errno;
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FIXED, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -735,7 +732,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -756,9 +753,12 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		void *new_data = rte_mem_map(
+			data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_FIXED | RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS,
+			fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
 					__func__, strerror(errno));
 			goto fail;
@@ -778,7 +778,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 					__func__, path, strerror(errno));
 			rte_errno = errno;
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
 					__func__, path, strerror(errno));
 			rte_errno = EBUSY;
@@ -789,10 +790,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -824,7 +823,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -862,7 +861,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -895,10 +894,8 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -916,7 +913,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -944,8 +941,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -964,7 +960,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -999,8 +995,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1025,7 +1020,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,14 +1037,14 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index d9764681a..5c3cf1f75 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,7 +11,6 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
@@ -44,7 +43,7 @@ static uint64_t system_page_sz;
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, enum eal_mem_reserve_flags reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -52,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_get_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -98,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
-			*size -= page_sz;
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
+			size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, *size);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -125,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -154,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -172,12 +166,12 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	return aligned_addr;
@@ -586,10 +580,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	int page_size = rte_get_page_size();
+	uintptr_t aligned = (virtual & ~(page_size - 1));
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 76938e379..59ac41916 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -226,8 +226,8 @@ enum eal_mem_reserve_flags {
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to rte_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -244,7 +244,7 @@ enum eal_mem_reserve_flags {
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
-	int flags, int mmap_flags);
+	int flags, enum eal_mem_reserve_flags reserve_flags);
 
 /**
  * Reserve VA space for a memory segment list.
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 842eb9de7..6534c895c 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -729,6 +729,7 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg,
 		if (ret != NULL)
 			return ret;
 	}
+
 	return NULL;
 }
 
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 02d9280cc..6dcdcc890 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -9,11 +9,20 @@ if is_windows
 		'eal_common_class.c',
 		'eal_common_devargs.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 		'rte_option.c',
 	)
 	subdir_done()
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5174f9cd0..99bf6ec9e 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -355,7 +355,6 @@ memseg_list_reserve(struct rte_memseg_list *msl)
 	return eal_reserve_memseg_list(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 1f89efb88..1d750f003 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -11,6 +11,7 @@ if not is_windows
 endif
 
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
 subdir(exec_env)
 
 subdir(arch_subdir)
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index bacf9a107..854b83bcd 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,13 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
-	rte_get_page_size
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
 	rte_log
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
+	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_get_page_size
 	rte_mem_lock
 	rte_mem_map
 	rte_mem_unmap
-	rte_vlog
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index cf55b56da..38f17f09c 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -93,6 +93,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -328,6 +346,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.no_shconf == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.no_shconf = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -345,6 +370,36 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..c7c3cf8df
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,423 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+#include <rte_windows.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory and enforce address hint. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+
+		if (addr != requested_addr) {
+			RTE_LOG(DEBUG, EAL, "Address hint %p not respected, "
+				"got %p\n", requested_addr, addr);
+			goto error;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu)\n", requested_addr, alloc_sz);
+			goto error;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx()");
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	eal_mem_decommit(addr, alloc_sz);
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len))
+		return -1;
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info);
+				i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
index 59606d84c..a9a35b7dc 100644
--- a/lib/librte_eal/windows/eal_memory.c
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -1,11 +1,23 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2010-2014 Intel Corporation (functions from Linux EAL)
+ * Copyright (c) 2020 Dmitry Kozlyuk (Windows specifics)
+ */
+
+#include <inttypes.h>
 #include <io.h>
 
 #include <rte_errno.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
 #include "eal_private.h"
 #include "eal_windows.h"
 
+#include <rte_virt2phys.h>
+
 /* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
  * Provide a copy of definitions and code to load it dynamically.
  * Note: definitions are copied verbatim from Microsoft documentation
@@ -120,6 +132,119 @@ eal_mem_win32api_init(void)
 
 #endif /* no VirtualAlloc2() */
 
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
 /* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
 static int
 win32_alloc_error_to_errno(DWORD code)
@@ -360,7 +485,7 @@ rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
 		return NULL;
 	}
 
-	/* TODO: there is a race for the requested_addr between mem_free()
+	/* There is a race for the requested_addr between mem_free()
 	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
 	 * region with a mapping in a single operation, but it does not support
 	 * private mappings.
@@ -410,6 +535,16 @@ rte_mem_unmap(void *virt, size_t size)
 	return 0;
 }
 
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
 int
 rte_get_page_size(void)
 {
@@ -431,3 +566,568 @@ rte_mem_lock(const void *virt, size_t size)
 
 	return 0;
 }
+
+static int
+memseg_list_alloc(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx)
+{
+	return eal_alloc_memseg_list(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
+}
+
+static int
+memseg_list_reserve(struct rte_memseg_list *msl)
+{
+	return eal_reserve_memseg_list(msl, 0);
+}
+
+/*
+ * Remaining code in this file largely duplicates Linux EAL.
+ * Although Windows EAL supports only one hugepage size currently,
+ * code structure and comments are preserved so that changes may be
+ * easily ported until duplication is removed.
+ */
+
+static int
+memseg_primary_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (memseg_list_alloc(msl, pagesz, n_segs,
+					socket_id, cur_seglist))
+				goto out;
+
+			if (memseg_list_reserve(msl)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int
+memseg_secondary_init(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+		return memseg_primary_init();
+	return memseg_secondary_init();
+}
+
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+static int
+calc_num_pages_per_socket(uint64_t *memory,
+		struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used,
+		unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from cpu mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			strncpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int) (internal_config.memory / 0x100000);
+		available = requested - (unsigned int) (total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
+
+/* Limit is checked by validator itself, nothing left to analyze.*/
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+static int
+eal_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES; socket_id++)
+		memory[socket_id] = internal_config.socket_mem[socket_id];
+
+	/* calculate final number of pages */
+	if (calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket "
+					"limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs, cur_seg;
+	uint64_t page_sz;
+	void *addr;
+	struct rte_fbarray *arr;
+	struct rte_memseg *ms;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	/* create a memseg list */
+	msl = &mcfg->memsegs[0];
+
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = internal_config.memory / page_sz;
+
+	if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
+		sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		return -1;
+	}
+
+	addr = eal_mem_alloc(internal_config.memory, 0);
+	if (addr == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate %zu bytes",
+		internal_config.memory);
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->page_sz = page_sz;
+	msl->socket_id = 0;
+	msl->len = internal_config.memory;
+	msl->heap = 1;
+
+	/* populate memsegs. each memseg is one page long */
+	for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
+		arr = &msl->memseg_arr;
+
+		ms = rte_fbarray_get(arr, cur_seg);
+		ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, cur_seg);
+
+		addr = RTE_PTR_ADD(addr, (size_t)page_sz);
+	}
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index b202a1aa5..083ab8b93 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,6 +52,13 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
 /**
  * Locate Win32 memory management routines in system libraries.
  *
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..62805a307 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -36,6 +36,10 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define open _open
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 81d3ee095..2f4fa91a9 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -8,7 +8,9 @@ sources += files(
 	'eal_debug.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
+	'eal_memalloc.c',
 	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'getopt.c',
 )
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-10  2:50     ` Dmitry Kozlyuk
  2020-04-10  2:59       ` Dmitry Kozlyuk
@ 2020-04-10 19:39       ` Ranjit Menon
  1 sibling, 0 replies; 218+ messages in thread
From: Ranjit Menon @ 2020-04-10 19:39 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV), Narcisa Ana Maria Vasile

On 4/9/2020 7:50 PM, Dmitry Kozlyuk wrote:
>>> +
>>> +_Use_decl_annotations_
>>> +VOID
>>> +virt2phys_device_EvtIoInCallerContext(
>>> +	IN WDFDEVICE device, IN WDFREQUEST request)
>>> +{
>>> +	WDF_REQUEST_PARAMETERS params;
>>> +	ULONG code;
>>> +	PVOID *virt;
>>
>> Should this be PVOID virt; (instead of PVOID *virt)?
>> If so, changes will be required to parameters passed in to
>> WdfRequestRetrieveInputBuffer() call.
> 
> This should be PVOID *virt (pointer to an untyped pointer). User-mode passes
> a virtual address as a PVOID value, WdfRequestRetrieveInputBuffer() fills
> virt with the address of that parameter, so that *virt is the virtual address
> user-mode wants to translate into a physical one.
> 

Makes sense. Thanks for the explanation, Dmitry.

>>
>>> +	PHYSICAL_ADDRESS *phys;
>>> +	size_t size;
>>> +	NTSTATUS status;
>>> +
> [snip]
>>> +
>>> +	status = WdfRequestRetrieveOutputBuffer(
>>> +		request, sizeof(*phys), &phys, &size);
>>
>> Better to put a (PVOID *)typecast for &phys here:
>> 	status = WdfRequestRetrieveOutputBuffer(
>> 		request, sizeof(*phys), (PVOID *)&phys, &size);
> 
> What do you mean? Without a typecast the built-in static analyzer emits a
> warning (and all warnings are treated as errors for a driver):
> 
> virt2phys.c(108,46): error C2220: the following warning is treated as an error
> virt2phys.c(108,46): warning C4047: 'function': 'PVOID *' differs in levels of indirection from 'PVOID **'
> virt2phys.c(108,46): warning C4022: 'WdfRequestRetrieveInputBuffer': pointer mismatch for actual parameter 3
> 
>>> +	if (!NT_SUCCESS(status)) {
>>> +		KdPrint(("WdfRequestRetrieveOutputBuffer() failed, "
>>> +			"status=%08x\n", status));
>>> +		WdfRequestComplete(request, status);
>>> +		return;
>>> +	}
>>> +
>>> +	*phys = MmGetPhysicalAddress(*virt);
>>> +
>>> +	WdfRequestCompleteWithInformation(
>>> +		request, STATUS_SUCCESS, sizeof(*phys));
>>> +}
>>
>> <Snip!>
>>
>> Co-installers are no longer required (and discouraged) as per Microsoft.
>> So you can remove the lines indicated below from the .inf file.
> 
> Thanks, will remove in v2. They were generated by WDK solution template.
> 

I see the above changes in v2. Thanks!

ranjit m.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v2 10/10] eal/windows: implement basic memory management
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-04-10 22:04     ` Narcisa Ana Maria Vasile
  2020-04-11  1:16       ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Narcisa Ana Maria Vasile @ 2020-04-10 22:04 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic,
	Anatoly Burakov, Bruce Richardson

On Fri, Apr 10, 2020 at 07:43:42PM +0300, Dmitry Kozlyuk wrote:
> Basic memory management supports core libraries and PMDs operating in
> IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
> IOVAs of hugepages allocated from user-mode.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  config/meson.build                            |   2 +-
>  doc/guides/windows_gsg/run_apps.rst           |  30 +
>  lib/librte_eal/common/eal_common_fbarray.c    |  57 +-
>  lib/librte_eal/common/eal_common_memory.c     |  50 +-
>  lib/librte_eal/common/eal_private.h           |   6 +-
>  lib/librte_eal/common/malloc_heap.c           |   1 +
>  lib/librte_eal/common/meson.build             |   9 +
>  lib/librte_eal/freebsd/eal_memory.c           |   1 -
>  lib/librte_eal/meson.build                    |   1 +
>  lib/librte_eal/rte_eal_exports.def            | 119 ++-
>  lib/librte_eal/windows/eal.c                  |  55 ++
>  lib/librte_eal/windows/eal_memalloc.c         | 423 +++++++++++
>  lib/librte_eal/windows/eal_memory.c           | 702 +++++++++++++++++-
>  lib/librte_eal/windows/eal_mp.c               | 103 +++
>  lib/librte_eal/windows/eal_windows.h          |  23 +
>  lib/librte_eal/windows/include/meson.build    |   1 +
>  lib/librte_eal/windows/include/rte_os.h       |   4 +
>  .../windows/include/rte_virt2phys.h           |  34 +
>  lib/librte_eal/windows/include/rte_windows.h  |   2 +
>  lib/librte_eal/windows/include/unistd.h       |   3 +
>  lib/librte_eal/windows/meson.build            |   2 +
>  21 files changed, 1561 insertions(+), 67 deletions(-)
>  create mode 100644 lib/librte_eal/windows/eal_memalloc.c
>  create mode 100644 lib/librte_eal/windows/eal_mp.c
>  create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h
> 
> diff --git a/config/meson.build b/config/meson.build
> index bceb5ef7b..7b8baa788 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -270,7 +270,7 @@ if is_windows
>  		add_project_link_arguments('-lmincore', language: 'c')
>  	endif
>  
> -	add_project_link_arguments('-ladvapi32', language: 'c')
> +	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
>  endif
>  
>  if get_option('b_lto')
> diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
> index 21ac7f6c1..e858cf8c1 100644
> --- a/doc/guides/windows_gsg/run_apps.rst
> +++ b/doc/guides/windows_gsg/run_apps.rst
> @@ -27,6 +27,36 @@ See `Large-Page Support`_ in MSDN for details.
>  .. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
>  
>  
> +Load virt2phys Driver
> +---------------------
> +
> +Access to physical addresses is provided by a kernel-mode driver, virt2phys.
> +It is mandatory at least for using hardware PMDs, but may also be required
> +for mempools.
> +
> +This driver is not signed, so signature checking must be disabled to load it.
> +Refer to documentation in ``dpdk-kmods`` repository for details on system
> +setup, driver build and installation.
> +
> +Compiled package, consisting of ``virt2phys.inf``, ``virt2phys.cat``,
> +and ``virt2phys.sys``, is installed as follows (from Elevated Command Line):
> +
> +.. code-block:: console
> +
> +    pnputil /add-driver virt2phys.inf /install
> +
> +When loaded successfully, the driver is shown in *Device Manager* as *Virtual
> +to physical address translator* device under *Kernel bypass* category.
> +
> +If DPDK is unable to communicate with the driver, a warning is printed
> +on initialization (debug-level logs provide more details):
> +
> +.. code-block:: text
> +
> +    EAL: Cannot open virt2phys driver interface
> +
> +
> +
>  Run the ``helloworld`` Example
>  ------------------------------
>  
> diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
> index 1312f936b..236db9cb7 100644
> --- a/lib/librte_eal/common/eal_common_fbarray.c
> +++ b/lib/librte_eal/common/eal_common_fbarray.c
> @@ -5,15 +5,15 @@
>  #include <fcntl.h>
>  #include <inttypes.h>
>  #include <limits.h>
> -#include <sys/mman.h>
>  #include <stdint.h>
>  #include <errno.h>
> -#include <sys/file.h>
>  #include <string.h>
> +#include <unistd.h>
>  
>  #include <rte_common.h>
> -#include <rte_log.h>
>  #include <rte_errno.h>
> +#include <rte_log.h>
> +#include <rte_memory.h>
>  #include <rte_spinlock.h>
>  #include <rte_tailq.h>
>  
> @@ -85,19 +85,16 @@ resize_and_map(int fd, void *addr, size_t len)
>  	char path[PATH_MAX];
>  	void *map_addr;
>  
> -	if (ftruncate(fd, len)) {
> +	if (eal_file_truncate(fd, len)) {
>  		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
>  		/* pass errno up the chain */
>  		rte_errno = errno;
>  		return -1;
>  	}
>  
> -	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
> -			MAP_SHARED | MAP_FIXED, fd, 0);
> +	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
> +			RTE_MAP_SHARED | RTE_MAP_FIXED, fd, 0);
>  	if (map_addr != addr) {
> -		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
> -		/* pass errno up the chain */
> -		rte_errno = errno;
>  		return -1;
>  	}
>  	return 0;
> @@ -735,7 +732,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  		return -1;
>  	}
>  
> -	page_sz = sysconf(_SC_PAGESIZE);
> +	page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1) {
>  		free(ma);
>  		return -1;
> @@ -756,9 +753,12 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  
>  	if (internal_config.no_shconf) {
>  		/* remap virtual area as writable */
> -		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
> -				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
> -		if (new_data == MAP_FAILED) {
> +		void *new_data = rte_mem_map(
> +			data, mmap_len,
> +			RTE_PROT_READ | RTE_PROT_WRITE,
> +			RTE_MAP_FIXED | RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS,
> +			fd, 0);
> +		if (new_data == NULL) {
>  			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
>  					__func__, strerror(errno));
>  			goto fail;
> @@ -778,7 +778,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  					__func__, path, strerror(errno));
>  			rte_errno = errno;
>  			goto fail;
> -		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
> +		} else if (eal_file_lock(
> +				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
>  			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
>  					__func__, path, strerror(errno));
>  			rte_errno = EBUSY;
> @@ -789,10 +790,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  		 * still attach to it, but no other process could reinitialize
>  		 * it.
>  		 */
> -		if (flock(fd, LOCK_SH | LOCK_NB)) {
> -			rte_errno = errno;
> +		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
>  			goto fail;
> -		}
>  
>  		if (resize_and_map(fd, data, mmap_len))
>  			goto fail;
> @@ -824,7 +823,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  	return 0;
>  fail:
>  	if (data)
> -		munmap(data, mmap_len);
> +		rte_mem_unmap(data, mmap_len);
>  	if (fd >= 0)
>  		close(fd);
>  	free(ma);
> @@ -862,7 +861,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  		return -1;
>  	}
>  
> -	page_sz = sysconf(_SC_PAGESIZE);
> +	page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1) {
>  		free(ma);
>  		return -1;
> @@ -895,10 +894,8 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  	}
>  
>  	/* lock the file, to let others know we're using it */
> -	if (flock(fd, LOCK_SH | LOCK_NB)) {
> -		rte_errno = errno;
> +	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
>  		goto fail;
> -	}
>  
>  	if (resize_and_map(fd, data, mmap_len))
>  		goto fail;
> @@ -916,7 +913,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  	return 0;
>  fail:
>  	if (data)
> -		munmap(data, mmap_len);
> +		rte_mem_unmap(data, mmap_len);
>  	if (fd >= 0)
>  		close(fd);
>  	free(ma);
> @@ -944,8 +941,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
>  	 * really do anything about it, things will blow up either way.
>  	 */
>  
> -	size_t page_sz = sysconf(_SC_PAGESIZE);
> -
> +	size_t page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1)
>  		return -1;
>  
> @@ -964,7 +960,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
>  		goto out;
>  	}
>  
> -	munmap(arr->data, mmap_len);
> +	rte_mem_unmap(arr->data, mmap_len);
>  
>  	/* area is unmapped, close fd and remove the tailq entry */
>  	if (tmp->fd >= 0)
> @@ -999,8 +995,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  	 * really do anything about it, things will blow up either way.
>  	 */
>  
> -	size_t page_sz = sysconf(_SC_PAGESIZE);
> -
> +	size_t page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1)
>  		return -1;
>  
> @@ -1025,7 +1020,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  		 * has been detached by all other processes
>  		 */
>  		fd = tmp->fd;
> -		if (flock(fd, LOCK_EX | LOCK_NB)) {
> +		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
>  			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
>  			rte_errno = EBUSY;
>  			ret = -1;
> @@ -1042,14 +1037,14 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  			 * we're still holding an exclusive lock, so drop it to
>  			 * shared.
>  			 */
> -			flock(fd, LOCK_SH | LOCK_NB);
> +			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
>  
>  			ret = -1;
>  			goto out;
>  		}
>  		close(fd);
>  	}
> -	munmap(arr->data, mmap_len);
> +	rte_mem_unmap(arr->data, mmap_len);
>  
>  	/* area is unmapped, remove the tailq entry */
>  	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
> diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
> index d9764681a..5c3cf1f75 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> @@ -11,7 +11,6 @@
>  #include <string.h>
>  #include <unistd.h>
>  #include <inttypes.h>
> -#include <sys/mman.h>
>  #include <sys/queue.h>
>  
>  #include <rte_fbarray.h>
> @@ -44,7 +43,7 @@ static uint64_t system_page_sz;
>  #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
>  void *
>  eal_get_virtual_area(void *requested_addr, size_t *size,
> -		size_t page_sz, int flags, int mmap_flags)
> +	size_t page_sz, int flags, enum eal_mem_reserve_flags reserve_flags)
>  {
>  	bool addr_is_hint, allow_shrink, unmap, no_align;
>  	uint64_t map_sz;
> @@ -52,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  	uint8_t try = 0;
>  
>  	if (system_page_sz == 0)
> -		system_page_sz = sysconf(_SC_PAGESIZE);
> -
> -	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
> +		system_page_sz = rte_get_page_size();
>  
>  	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
>  
> @@ -98,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  			return NULL;
>  		}
>  
> -		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
> -				mmap_flags, -1, 0);
> -		if (mapped_addr == MAP_FAILED && allow_shrink)
> -			*size -= page_sz;
> +		mapped_addr = eal_mem_reserve(
> +			requested_addr, (size_t)map_sz, reserve_flags);
> +		if ((mapped_addr == NULL) && allow_shrink)
> +			size -= page_sz;
>  
> -		if (mapped_addr != MAP_FAILED && addr_is_hint &&
> -		    mapped_addr != requested_addr) {
> +		if ((mapped_addr != NULL) && addr_is_hint &&
> +				(mapped_addr != requested_addr)) {
>  			try++;
>  			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
>  			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
>  				/* hint was not used. Try with another offset */
> -				munmap(mapped_addr, map_sz);
> -				mapped_addr = MAP_FAILED;
> +				eal_mem_free(mapped_addr, *size);
> +				mapped_addr = NULL;
>  				requested_addr = next_baseaddr;
>  			}
>  		}
>  	} while ((allow_shrink || addr_is_hint) &&
> -		 mapped_addr == MAP_FAILED && *size > 0);
> +		(mapped_addr == NULL) && (*size > 0));
>  
>  	/* align resulting address - if map failed, we will ignore the value
>  	 * anyway, so no need to add additional checks.
> @@ -125,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  
>  	if (*size == 0) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
> -			strerror(errno));
> -		rte_errno = errno;
> +			strerror(rte_errno));
>  		return NULL;
> -	} else if (mapped_addr == MAP_FAILED) {
> +	} else if (mapped_addr == NULL) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
> -			strerror(errno));
> -		/* pass errno up the call chain */
> -		rte_errno = errno;
> +			strerror(rte_errno));
>  		return NULL;
>  	} else if (requested_addr != NULL && !addr_is_hint &&
>  			aligned_addr != requested_addr) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
>  			requested_addr, aligned_addr);
> -		munmap(mapped_addr, map_sz);
> +		eal_mem_free(mapped_addr, map_sz);
>  		rte_errno = EADDRNOTAVAIL;
>  		return NULL;
>  	} else if (requested_addr != NULL && addr_is_hint &&
> @@ -154,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  		aligned_addr, *size);
>  
>  	if (unmap) {
> -		munmap(mapped_addr, map_sz);
> +		eal_mem_free(mapped_addr, map_sz);
>  	} else if (!no_align) {
>  		void *map_end, *aligned_end;
>  		size_t before_len, after_len;
> @@ -172,12 +166,12 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  		/* unmap space before aligned mmap address */
>  		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
>  		if (before_len > 0)
> -			munmap(mapped_addr, before_len);
> +			eal_mem_free(mapped_addr, before_len);
>  
>  		/* unmap space after aligned end mmap address */
>  		after_len = RTE_PTR_DIFF(map_end, aligned_end);
>  		if (after_len > 0)
> -			munmap(aligned_end, after_len);
> +			eal_mem_free(aligned_end, after_len);
>  	}
>  
>  	return aligned_addr;
> @@ -586,10 +580,10 @@ rte_eal_memdevice_init(void)
>  int
>  rte_mem_lock_page(const void *virt)
>  {
> -	unsigned long virtual = (unsigned long)virt;
> -	int page_size = getpagesize();
> -	unsigned long aligned = (virtual & ~(page_size - 1));
> -	return mlock((void *)aligned, page_size);
> +	uintptr_t virtual = (uintptr_t)virt;
> +	int page_size = rte_get_page_size();
> +	uintptr_t aligned = (virtual & ~(page_size - 1));
> +	return rte_mem_lock((void *)aligned, page_size);
>  }
>  
>  int
> diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
> index 76938e379..59ac41916 100644
> --- a/lib/librte_eal/common/eal_private.h
> +++ b/lib/librte_eal/common/eal_private.h
> @@ -226,8 +226,8 @@ enum eal_mem_reserve_flags {
>   *   Page size on which to align requested virtual area.
>   * @param flags
>   *   EAL_VIRTUAL_AREA_* flags.
> - * @param mmap_flags
> - *   Extra flags passed directly to mmap().
> + * @param reserve_flags
> + *   Extra flags passed directly to rte_mem_reserve().
>   *
>   * @return
>   *   Virtual area address if successful.
> @@ -244,7 +244,7 @@ enum eal_mem_reserve_flags {
>  /**< immediately unmap reserved virtual area. */
>  void *
>  eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
> -	int flags, int mmap_flags);
> +	int flags, enum eal_mem_reserve_flags reserve_flags);
>  
>  /**
>   * Reserve VA space for a memory segment list.
> diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
> index 842eb9de7..6534c895c 100644
> --- a/lib/librte_eal/common/malloc_heap.c
> +++ b/lib/librte_eal/common/malloc_heap.c
> @@ -729,6 +729,7 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg,
>  		if (ret != NULL)
>  			return ret;
>  	}
> +
>  	return NULL;
>  }
>  
> diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
> index 02d9280cc..6dcdcc890 100644
> --- a/lib/librte_eal/common/meson.build
> +++ b/lib/librte_eal/common/meson.build
> @@ -9,11 +9,20 @@ if is_windows
>  		'eal_common_class.c',
>  		'eal_common_devargs.c',
>  		'eal_common_errno.c',
> +		'eal_common_fbarray.c',
>  		'eal_common_launch.c',
>  		'eal_common_lcore.c',
>  		'eal_common_log.c',
> +		'eal_common_mcfg.c',
> +		'eal_common_memalloc.c',
> +		'eal_common_memory.c',
> +		'eal_common_memzone.c',
>  		'eal_common_options.c',
> +		'eal_common_tailqs.c',
>  		'eal_common_thread.c',
> +		'malloc_elem.c',
> +		'malloc_heap.c',
> +		'rte_malloc.c',
>  		'rte_option.c',
>  	)
>  	subdir_done()
> diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
> index 5174f9cd0..99bf6ec9e 100644
> --- a/lib/librte_eal/freebsd/eal_memory.c
> +++ b/lib/librte_eal/freebsd/eal_memory.c
> @@ -355,7 +355,6 @@ memseg_list_reserve(struct rte_memseg_list *msl)
>  	return eal_reserve_memseg_list(msl, flags);
>  }
>  
> -
>  static int
>  memseg_primary_init(void)
>  {
> diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
> index 1f89efb88..1d750f003 100644
> --- a/lib/librte_eal/meson.build
> +++ b/lib/librte_eal/meson.build
> @@ -11,6 +11,7 @@ if not is_windows
>  endif
>  
>  dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
> +dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
>  subdir(exec_env)
>  
>  subdir(arch_subdir)
> diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
> index bacf9a107..854b83bcd 100644
> --- a/lib/librte_eal/rte_eal_exports.def
> +++ b/lib/librte_eal/rte_eal_exports.def
> @@ -1,13 +1,128 @@
>  EXPORTS
>  	__rte_panic
> +	rte_calloc
> +	rte_calloc_socket
>  	rte_eal_get_configuration
> +	rte_eal_has_hugepages
>  	rte_eal_init
> +	rte_eal_iova_mode
>  	rte_eal_mp_remote_launch
>  	rte_eal_mp_wait_lcore
> +	rte_eal_process_type
>  	rte_eal_remote_launch
> -	rte_get_page_size
> +	rte_eal_tailq_lookup
> +	rte_eal_tailq_register
> +	rte_eal_using_phys_addrs
> +	rte_free
>  	rte_log
> +	rte_malloc
> +	rte_malloc_dump_stats
> +	rte_malloc_get_socket_stats
> +	rte_malloc_set_limit
> +	rte_malloc_socket
> +	rte_malloc_validate
> +	rte_malloc_virt2iova
> +	rte_mcfg_mem_read_lock
> +	rte_mcfg_mem_read_unlock
> +	rte_mcfg_mem_write_lock
> +	rte_mcfg_mem_write_unlock
> +	rte_mcfg_mempool_read_lock
> +	rte_mcfg_mempool_read_unlock
> +	rte_mcfg_mempool_write_lock
> +	rte_mcfg_mempool_write_unlock
> +	rte_mcfg_tailq_read_lock
> +	rte_mcfg_tailq_read_unlock
> +	rte_mcfg_tailq_write_lock
> +	rte_mcfg_tailq_write_unlock
> +	rte_mem_lock_page
> +	rte_mem_virt2iova
> +	rte_mem_virt2phy
> +	rte_memory_get_nchannel
> +	rte_memory_get_nrank
> +	rte_memzone_dump
> +	rte_memzone_free
> +	rte_memzone_lookup
> +	rte_memzone_reserve
> +	rte_memzone_reserve_aligned
> +	rte_memzone_reserve_bounded
> +	rte_memzone_walk
> +	rte_vlog
> +	rte_realloc
> +	rte_zmalloc
> +	rte_zmalloc_socket
> +
> +	rte_mp_action_register
> +	rte_mp_action_unregister
> +	rte_mp_reply
> +	rte_mp_sendmsg
> +
> +	rte_fbarray_attach
> +	rte_fbarray_destroy
> +	rte_fbarray_detach
> +	rte_fbarray_dump_metadata
> +	rte_fbarray_find_contig_free
> +	rte_fbarray_find_contig_used
> +	rte_fbarray_find_idx
> +	rte_fbarray_find_next_free
> +	rte_fbarray_find_next_n_free
> +	rte_fbarray_find_next_n_used
> +	rte_fbarray_find_next_used
> +	rte_fbarray_get
> +	rte_fbarray_init
> +	rte_fbarray_is_used
> +	rte_fbarray_set_free
> +	rte_fbarray_set_used
> +	rte_malloc_dump_heaps
> +	rte_mem_alloc_validator_register
> +	rte_mem_alloc_validator_unregister
> +	rte_mem_check_dma_mask
> +	rte_mem_event_callback_register
> +	rte_mem_event_callback_unregister
> +	rte_mem_iova2virt
> +	rte_mem_virt2memseg
> +	rte_mem_virt2memseg_list
> +	rte_memseg_contig_walk
> +	rte_memseg_list_walk
> +	rte_memseg_walk
> +	rte_mp_request_async
> +	rte_mp_request_sync
> +
> +	rte_fbarray_find_prev_free
> +	rte_fbarray_find_prev_n_free
> +	rte_fbarray_find_prev_n_used
> +	rte_fbarray_find_prev_used
> +	rte_fbarray_find_rev_contig_free
> +	rte_fbarray_find_rev_contig_used
> +	rte_memseg_contig_walk_thread_unsafe
> +	rte_memseg_list_walk_thread_unsafe
> +	rte_memseg_walk_thread_unsafe
> +
> +	rte_malloc_heap_create
> +	rte_malloc_heap_destroy
> +	rte_malloc_heap_get_socket
> +	rte_malloc_heap_memory_add
> +	rte_malloc_heap_memory_attach
> +	rte_malloc_heap_memory_detach
> +	rte_malloc_heap_memory_remove
> +	rte_malloc_heap_socket_is_external
> +	rte_mem_check_dma_mask_thread_unsafe
> +	rte_mem_set_dma_mask
> +	rte_memseg_get_fd
> +	rte_memseg_get_fd_offset
> +	rte_memseg_get_fd_offset_thread_unsafe
> +	rte_memseg_get_fd_thread_unsafe
> +
> +	rte_extmem_attach
> +	rte_extmem_detach
> +	rte_extmem_register
> +	rte_extmem_unregister
> +
> +	rte_fbarray_find_biggest_free
> +	rte_fbarray_find_biggest_used
> +	rte_fbarray_find_rev_biggest_free
> +	rte_fbarray_find_rev_biggest_used
> +
> +	rte_get_page_size
>  	rte_mem_lock
>  	rte_mem_map
>  	rte_mem_unmap
> -	rte_vlog
> diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
> index cf55b56da..38f17f09c 100644
> --- a/lib/librte_eal/windows/eal.c
> +++ b/lib/librte_eal/windows/eal.c
> @@ -93,6 +93,24 @@ eal_proc_type_detect(void)
>  	return ptype;
>  }
>  
> +enum rte_proc_type_t
> +rte_eal_process_type(void)
> +{
> +	return rte_config.process_type;
> +}
> +
> +int
> +rte_eal_has_hugepages(void)
> +{
> +	return !internal_config.no_hugetlbfs;
> +}
> +
> +enum rte_iova_mode
> +rte_eal_iova_mode(void)
> +{
> +	return rte_config.iova_mode;
> +}
> +
>  /* display usage */
>  static void
>  eal_usage(const char *prgname)
> @@ -328,6 +346,13 @@ rte_eal_init(int argc, char **argv)
>  	if (fctret < 0)
>  		exit(1);
>  
> +	/* Prevent creation of shared memory files. */
> +	if (internal_config.no_shconf == 0) {
> +		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
> +			"but not available.\n");
> +		internal_config.no_shconf = 1;
> +	}
> +
>  	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
>  		rte_eal_init_alert("Cannot get hugepage information");
>  		rte_errno = EACCES;
> @@ -345,6 +370,36 @@ rte_eal_init(int argc, char **argv)
>  		return -1;
>  	}
>  
> +	if (eal_mem_virt2iova_init() < 0) {
> +		/* Non-fatal error if physical addresses are not required. */
> +		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
> +			"PA will not be available\n");
> +	}
> +
> +	if (rte_eal_memzone_init() < 0) {
> +		rte_eal_init_alert("Cannot init memzone");
> +		rte_errno = ENODEV;
> +		return -1;
> +	}
> +
> +	if (rte_eal_memory_init() < 0) {
> +		rte_eal_init_alert("Cannot init memory");
> +		rte_errno = ENOMEM;
> +		return -1;
> +	}
> +
> +	if (rte_eal_malloc_heap_init() < 0) {
> +		rte_eal_init_alert("Cannot init malloc heap");
> +		rte_errno = ENODEV;
> +		return -1;
> +	}
> +
> +	if (rte_eal_tailqs_init() < 0) {
> +		rte_eal_init_alert("Cannot init tail queues for objects");
> +		rte_errno = EFAULT;
> +		return -1;
> +	}
> +
>  	eal_thread_init_master(rte_config.master_lcore);
>  
>  	RTE_LCORE_FOREACH_SLAVE(i) {
> diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
> new file mode 100644
> index 000000000..c7c3cf8df
> --- /dev/null
> +++ b/lib/librte_eal/windows/eal_memalloc.c
> @@ -0,0 +1,423 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <rte_errno.h>
> +#include <rte_os.h>
> +#include <rte_windows.h>
> +
> +#include "eal_internal_cfg.h"
> +#include "eal_memalloc.h"
> +#include "eal_memcfg.h"
> +#include "eal_private.h"
> +#include "eal_windows.h"
> +
> +int
> +eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
> +{
> +	/* Hugepages have no assiciated files in Windows. */
> +	RTE_SET_USED(list_idx);
> +	RTE_SET_USED(seg_idx);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
> +{
> +	/* Hugepages have no assiciated files in Windows. */
> +	RTE_SET_USED(list_idx);
> +	RTE_SET_USED(seg_idx);
> +	RTE_SET_USED(offset);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +static int
> +alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
> +	struct hugepage_info *hi)
> +{
> +	HANDLE current_process;
> +	unsigned int numa_node;
> +	size_t alloc_sz;
> +	void *addr;
> +	rte_iova_t iova = RTE_BAD_IOVA;
> +	PSAPI_WORKING_SET_EX_INFORMATION info;
> +	PSAPI_WORKING_SET_EX_BLOCK *page;
> +
> +	if (ms->len > 0) {
> +		/* If a segment is already allocated as needed, return it. */
> +		if ((ms->addr == requested_addr) &&
> +			(ms->socket_id == socket_id) &&
> +			(ms->hugepage_sz == hi->hugepage_sz)) {
> +			return 0;
> +		}
> +
> +		/* Bugcheck, should not happen. */
> +		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
> +			"(size %zu) on socket %d", ms->addr,
> +			ms->len, ms->socket_id);
> +		return -1;
> +	}
> +
> +	current_process = GetCurrentProcess();
> +	numa_node = eal_socket_numa_node(socket_id);
> +	alloc_sz = hi->hugepage_sz;
> +
> +	if (requested_addr == NULL) {
> +		/* Request a new chunk of memory and enforce address hint. */

Does requested_addr being NULL means that no hint was provided? It also looks like eal_mem_alloc_socket
ignores the address hint anyway and just calls VirtualAllocExNuma with NULL for lpAddress. Maybe remove
the second part of the comment. 

> +		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
> +		if (addr == NULL) {
> +			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
> +				"on socket %d\n", alloc_sz, socket_id);
> +			return -1;
> +		}
> +
> +		if (addr != requested_addr) {

requested_addr is NULL on this branch and we confirmed with the previous 'if' that addr is not NULL.
Should this branch be removed, since requested_addr is NULL, so there is no hint provided?

> +			RTE_LOG(DEBUG, EAL, "Address hint %p not respected, "
> +				"got %p\n", requested_addr, addr);
> +			goto error;
> +		}
> +	} else {
> +		/* Requested address is already reserved, commit memory. */
> +		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
> +		if (addr == NULL) {
> +			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
> +				"(size %zu)\n", requested_addr, alloc_sz);
> +			goto error;

Execution jumps to 'error' with an invalid addr, so it will try to call eal_mem_decommit with NULL as parameter.
Instead of 'goto error', maybe we should return here.

> +		}
> +	}
> +
> +	/* Force OS to allocate a physical page and select a NUMA node.
> +	 * Hugepages are not pageable in Windows, so there's no race
> +	 * for physical address.
> +	 */
> +	*(volatile int *)addr = *(volatile int *)addr;
> +
> +	/* Only try to obtain IOVA if it's available, so that applications
> +	 * that do not need IOVA can use this allocator.
> +	 */
> +	if (rte_eal_using_phys_addrs()) {
> +		iova = rte_mem_virt2iova(addr);
> +		if (iova == RTE_BAD_IOVA) {
> +			RTE_LOG(DEBUG, EAL,
> +				"Cannot get IOVA of allocated segment\n");
> +			goto error;
> +		}
> +	}
> +
> +	/* Only "Ex" function can handle hugepages. */
> +	info.VirtualAddress = addr;
> +	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
> +		RTE_LOG_WIN32_ERR("QueryWorkingSetEx()");
> +		goto error;
> +	}
> +
> +	page = &info.VirtualAttributes;
> +	if (!page->Valid || !page->LargePage) {
> +		RTE_LOG(DEBUG, EAL, "Got regular page instead of hugepage\n");
> +		goto error;
> +	}
> +	if (page->Node != numa_node) {
> +		RTE_LOG(DEBUG, EAL,
> +			"NUMA node hint %u (socket %d) not respected, got %u\n",
> +			numa_node, socket_id, page->Node);
> +		goto error;
> +	}
> +
> +	ms->addr = addr;
> +	ms->hugepage_sz = hi->hugepage_sz;
> +	ms->len = alloc_sz;
> +	ms->nchannel = rte_memory_get_nchannel();
> +	ms->nrank = rte_memory_get_nrank();
> +	ms->iova = iova;
> +	ms->socket_id = socket_id;
> +
> +	return 0;
> +
> +error:
> +	/* Only jump here when `addr` and `alloc_sz` are valid. */
> +	eal_mem_decommit(addr, alloc_sz);
> +	return -1;
> +}
> +
> +static int
> +free_seg(struct rte_memseg *ms)
> +{
> +	if (eal_mem_decommit(ms->addr, ms->len))
> +		return -1;
> +
> +	/* Must clear the segment, because alloc_seg() inspects it. */
> +	memset(ms, 0, sizeof(*ms));
> +	return 0;
> +}
> +
> +struct alloc_walk_param {
> +	struct hugepage_info *hi;
> +	struct rte_memseg **ms;
> +	size_t page_sz;
> +	unsigned int segs_allocated;
> +	unsigned int n_segs;
> +	int socket;
> +	bool exact;
> +};
> +
> +static int
> +alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
> +{
> +	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
> +	struct alloc_walk_param *wa = arg;
> +	struct rte_memseg_list *cur_msl;
> +	size_t page_sz;
> +	int cur_idx, start_idx, j;
> +	unsigned int msl_idx, need, i;
> +
> +	if (msl->page_sz != wa->page_sz)
> +		return 0;
> +	if (msl->socket_id != wa->socket)
> +		return 0;
> +
> +	page_sz = (size_t)msl->page_sz;
> +
> +	msl_idx = msl - mcfg->memsegs;
> +	cur_msl = &mcfg->memsegs[msl_idx];
> +
> +	need = wa->n_segs;
> +
> +	/* try finding space in memseg list */
> +	if (wa->exact) {
> +		/* if we require exact number of pages in a list, find them */
> +		cur_idx = rte_fbarray_find_next_n_free(
> +			&cur_msl->memseg_arr, 0, need);
> +		if (cur_idx < 0)
> +			return 0;
> +		start_idx = cur_idx;
> +	} else {
> +		int cur_len;
> +
> +		/* we don't require exact number of pages, so we're going to go
> +		 * for best-effort allocation. that means finding the biggest
> +		 * unused block, and going with that.
> +		 */
> +		cur_idx = rte_fbarray_find_biggest_free(
> +			&cur_msl->memseg_arr, 0);
> +		if (cur_idx < 0)
> +			return 0;
> +		start_idx = cur_idx;
> +		/* adjust the size to possibly be smaller than original
> +		 * request, but do not allow it to be bigger.
> +		 */
> +		cur_len = rte_fbarray_find_contig_free(
> +			&cur_msl->memseg_arr, cur_idx);
> +		need = RTE_MIN(need, (unsigned int)cur_len);
> +	}
> +
> +	for (i = 0; i < need; i++, cur_idx++) {
> +		struct rte_memseg *cur;
> +		void *map_addr;
> +
> +		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
> +		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
> +
> +		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
> +			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
> +				"but only %i were allocated\n", need, i);
> +
> +			/* if exact number wasn't requested, stop */
> +			if (!wa->exact)
> +				goto out;
> +
> +			/* clean up */
> +			for (j = start_idx; j < cur_idx; j++) {
> +				struct rte_memseg *tmp;
> +				struct rte_fbarray *arr = &cur_msl->memseg_arr;
> +
> +				tmp = rte_fbarray_get(arr, j);
> +				rte_fbarray_set_free(arr, j);
> +
> +				if (free_seg(tmp))
> +					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
> +			}
> +			/* clear the list */
> +			if (wa->ms)
> +				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
> +
> +			return -1;
> +		}
> +		if (wa->ms)
> +			wa->ms[i] = cur;
> +
> +		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
> +	}
> +
> +out:
> +	wa->segs_allocated = i;
> +	if (i > 0)
> +		cur_msl->version++;
> +
> +	/* if we didn't allocate any segments, move on to the next list */
> +	return i > 0;
> +}
> +
> +struct free_walk_param {
> +	struct hugepage_info *hi;
> +	struct rte_memseg *ms;
> +};
> +static int
> +free_seg_walk(const struct rte_memseg_list *msl, void *arg)
> +{
> +	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
> +	struct rte_memseg_list *found_msl;
> +	struct free_walk_param *wa = arg;
> +	uintptr_t start_addr, end_addr;
> +	int msl_idx, seg_idx, ret;
> +
> +	start_addr = (uintptr_t) msl->base_va;
> +	end_addr = start_addr + msl->len;
> +
> +	if ((uintptr_t)wa->ms->addr < start_addr ||
> +		(uintptr_t)wa->ms->addr >= end_addr)
> +		return 0;
> +
> +	msl_idx = msl - mcfg->memsegs;
> +	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
> +
> +	/* msl is const */
> +	found_msl = &mcfg->memsegs[msl_idx];
> +	found_msl->version++;
> +
> +	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
> +
> +	ret = free_seg(wa->ms);
> +
> +	return (ret < 0) ? (-1) : 1;
> +}
> +
> +int
> +eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
> +		size_t page_sz, int socket, bool exact)
> +{
> +	unsigned int i;
> +	int ret = -1;
> +	struct alloc_walk_param wa;
> +	struct hugepage_info *hi = NULL;
> +
> +	if (internal_config.legacy_mem) {
> +		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
> +		return -ENOTSUP;
> +	}
> +
> +	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
> +		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
> +		if (page_sz == hpi->hugepage_sz) {
> +			hi = hpi;
> +			break;
> +		}
> +	}
> +	if (!hi) {
> +		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
> +		return -1;
> +	}
> +
> +	memset(&wa, 0, sizeof(wa));
> +	wa.exact = exact;
> +	wa.hi = hi;
> +	wa.ms = ms;
> +	wa.n_segs = n_segs;
> +	wa.page_sz = page_sz;
> +	wa.socket = socket;
> +	wa.segs_allocated = 0;
> +
> +	/* memalloc is locked, so it's safe to use thread-unsafe version */
> +	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
> +	if (ret == 0) {
> +		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
> +		ret = -1;
> +	} else if (ret > 0) {
> +		ret = (int)wa.segs_allocated;
> +	}
> +
> +	return ret;
> +}
> +
> +struct rte_memseg *
> +eal_memalloc_alloc_seg(size_t page_sz, int socket)
> +{
> +	struct rte_memseg *ms = NULL;
> +	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
> +	return ms;
> +}
> +
> +int
> +eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
> +{
> +	int seg, ret = 0;
> +
> +	/* dynamic free not supported in legacy mode */
> +	if (internal_config.legacy_mem)
> +		return -1;
> +
> +	for (seg = 0; seg < n_segs; seg++) {
> +		struct rte_memseg *cur = ms[seg];
> +		struct hugepage_info *hi = NULL;
> +		struct free_walk_param wa;
> +		size_t i;
> +		int walk_res;
> +
> +		/* if this page is marked as unfreeable, fail */
> +		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
> +			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
> +			ret = -1;
> +			continue;
> +		}
> +
> +		memset(&wa, 0, sizeof(wa));
> +
> +		for (i = 0; i < RTE_DIM(internal_config.hugepage_info);
> +				i++) {
> +			hi = &internal_config.hugepage_info[i];
> +			if (cur->hugepage_sz == hi->hugepage_sz)
> +				break;
> +		}
> +		if (i == RTE_DIM(internal_config.hugepage_info)) {
> +			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
> +			ret = -1;
> +			continue;
> +		}
> +
> +		wa.ms = cur;
> +		wa.hi = hi;
> +
> +		/* memalloc is locked, so it's safe to use thread-unsafe version
> +		 */
> +		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
> +				&wa);
> +		if (walk_res == 1)
> +			continue;
> +		if (walk_res == 0)
> +			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
> +		ret = -1;
> +	}
> +	return ret;
> +}
> +
> +int
> +eal_memalloc_free_seg(struct rte_memseg *ms)
> +{
> +	return eal_memalloc_free_seg_bulk(&ms, 1);
> +}
> +
> +int
> +eal_memalloc_sync_with_primary(void)
> +{
> +	/* No multi-process support. */
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +eal_memalloc_init(void)
> +{
> +	/* No action required. */
> +	return 0;
> +}
> diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
> index 59606d84c..a9a35b7dc 100644
> --- a/lib/librte_eal/windows/eal_memory.c
> +++ b/lib/librte_eal/windows/eal_memory.c
> @@ -1,11 +1,23 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2010-2014 Intel Corporation (functions from Linux EAL)
> + * Copyright (c) 2020 Dmitry Kozlyuk (Windows specifics)
> + */
> +
> +#include <inttypes.h>
>  #include <io.h>
>  
>  #include <rte_errno.h>
>  #include <rte_memory.h>
>  
> +#include "eal_internal_cfg.h"
> +#include "eal_memalloc.h"
> +#include "eal_memcfg.h"
> +#include "eal_options.h"
>  #include "eal_private.h"
>  #include "eal_windows.h"
>  
> +#include <rte_virt2phys.h>
> +
>  /* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
>   * Provide a copy of definitions and code to load it dynamically.
>   * Note: definitions are copied verbatim from Microsoft documentation
> @@ -120,6 +132,119 @@ eal_mem_win32api_init(void)
>  
>  #endif /* no VirtualAlloc2() */
>  
> +static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
> +
> +int
> +eal_mem_virt2iova_init(void)
> +{
> +	HDEVINFO list = INVALID_HANDLE_VALUE;
> +	SP_DEVICE_INTERFACE_DATA ifdata;
> +	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
> +	DWORD detail_size;
> +	int ret = -1;
> +
> +	list = SetupDiGetClassDevs(
> +		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
> +		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
> +	if (list == INVALID_HANDLE_VALUE) {
> +		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
> +		goto exit;
> +	}
> +
> +	ifdata.cbSize = sizeof(ifdata);
> +	if (!SetupDiEnumDeviceInterfaces(
> +		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
> +		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
> +		goto exit;
> +	}
> +
> +	if (!SetupDiGetDeviceInterfaceDetail(
> +		list, &ifdata, NULL, 0, &detail_size, NULL)) {
> +		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
> +			RTE_LOG_WIN32_ERR(
> +				"SetupDiGetDeviceInterfaceDetail(probe)");
> +			goto exit;
> +		}
> +	}
> +
> +	detail = malloc(detail_size);
> +	if (detail == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
> +			"device interface detail data\n");
> +		goto exit;
> +	}
> +
> +	detail->cbSize = sizeof(*detail);
> +	if (!SetupDiGetDeviceInterfaceDetail(
> +		list, &ifdata, detail, detail_size, NULL, NULL)) {
> +		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
> +		goto exit;
> +	}
> +
> +	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
> +
> +	virt2phys_device = CreateFile(
> +		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
> +	if (virt2phys_device == INVALID_HANDLE_VALUE) {
> +		RTE_LOG_WIN32_ERR("CreateFile()");
> +		goto exit;
> +	}
> +
> +	/* Indicate success. */
> +	ret = 0;
> +
> +exit:
> +	if (detail != NULL)
> +		free(detail);
> +	if (list != INVALID_HANDLE_VALUE)
> +		SetupDiDestroyDeviceInfoList(list);
> +
> +	return ret;
> +}
> +
> +phys_addr_t
> +rte_mem_virt2phy(const void *virt)
> +{
> +	LARGE_INTEGER phys;
> +	DWORD bytes_returned;
> +
> +	if (virt2phys_device == INVALID_HANDLE_VALUE)
> +		return RTE_BAD_PHYS_ADDR;
> +
> +	if (!DeviceIoControl(
> +			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
> +			&virt, sizeof(virt), &phys, sizeof(phys),
> +			&bytes_returned, NULL)) {
> +		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
> +		return RTE_BAD_PHYS_ADDR;
> +	}
> +
> +	return phys.QuadPart;
> +}
> +
> +/* Windows currently only supports IOVA as PA. */
> +rte_iova_t
> +rte_mem_virt2iova(const void *virt)
> +{
> +	phys_addr_t phys;
> +
> +	if (virt2phys_device == INVALID_HANDLE_VALUE)
> +		return RTE_BAD_IOVA;
> +
> +	phys = rte_mem_virt2phy(virt);
> +	if (phys == RTE_BAD_PHYS_ADDR)
> +		return RTE_BAD_IOVA;
> +
> +	return (rte_iova_t)phys;
> +}
> +
> +/* Always using physical addresses under Windows if they can be obtained. */
> +int
> +rte_eal_using_phys_addrs(void)
> +{
> +	return virt2phys_device != INVALID_HANDLE_VALUE;
> +}
> +
>  /* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
>  static int
>  win32_alloc_error_to_errno(DWORD code)
> @@ -360,7 +485,7 @@ rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
>  		return NULL;
>  	}
>  
> -	/* TODO: there is a race for the requested_addr between mem_free()
> +	/* There is a race for the requested_addr between mem_free()
>  	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
>  	 * region with a mapping in a single operation, but it does not support
>  	 * private mappings.
> @@ -410,6 +535,16 @@ rte_mem_unmap(void *virt, size_t size)
>  	return 0;
>  }
>  
> +uint64_t
> +eal_get_baseaddr(void)
> +{
> +	/* Windows strategy for memory allocation is undocumented.
> +	 * Returning 0 here effectively disables address guessing
> +	 * unless user provides an address hint.
> +	 */
> +	return 0;
> +}
> +
>  int
>  rte_get_page_size(void)
>  {
> @@ -431,3 +566,568 @@ rte_mem_lock(const void *virt, size_t size)
>  
>  	return 0;
>  }
> +
> +static int
> +memseg_list_alloc(struct rte_memseg_list *msl, uint64_t page_sz,
> +		int n_segs, int socket_id, int type_msl_idx)
> +{
> +	return eal_alloc_memseg_list(
> +		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
> +}
> +
> +static int
> +memseg_list_reserve(struct rte_memseg_list *msl)
> +{
> +	return eal_reserve_memseg_list(msl, 0);
> +}
> +
> +/*
> + * Remaining code in this file largely duplicates Linux EAL.
> + * Although Windows EAL supports only one hugepage size currently,
> + * code structure and comments are preserved so that changes may be
> + * easily ported until duplication is removed.
> + */
> +
> +static int
> +memseg_primary_init(void)
> +{
> +	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
> +	struct memtype {
> +		uint64_t page_sz;
> +		int socket_id;
> +	} *memtypes = NULL;
> +	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
> +	struct rte_memseg_list *msl;
> +	uint64_t max_mem, max_mem_per_type;
> +	unsigned int max_seglists_per_type;
> +	unsigned int n_memtypes, cur_type;
> +
> +	/* no-huge does not need this at all */
> +	if (internal_config.no_hugetlbfs)
> +		return 0;
> +
> +	/*
> +	 * figuring out amount of memory we're going to have is a long and very
> +	 * involved process. the basic element we're operating with is a memory
> +	 * type, defined as a combination of NUMA node ID and page size (so that
> +	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
> +	 *
> +	 * deciding amount of memory going towards each memory type is a
> +	 * balancing act between maximum segments per type, maximum memory per
> +	 * type, and number of detected NUMA nodes. the goal is to make sure
> +	 * each memory type gets at least one memseg list.
> +	 *
> +	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
> +	 *
> +	 * the total amount of memory per type is limited by either
> +	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
> +	 * of detected NUMA nodes. additionally, maximum number of segments per
> +	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
> +	 * smaller page sizes, it can take hundreds of thousands of segments to
> +	 * reach the above specified per-type memory limits.
> +	 *
> +	 * additionally, each type may have multiple memseg lists associated
> +	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
> +	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
> +	 *
> +	 * the number of memseg lists per type is decided based on the above
> +	 * limits, and also taking number of detected NUMA nodes, to make sure
> +	 * that we don't run out of memseg lists before we populate all NUMA
> +	 * nodes with memory.
> +	 *
> +	 * we do this in three stages. first, we collect the number of types.
> +	 * then, we figure out memory constraints and populate the list of
> +	 * would-be memseg lists. then, we go ahead and allocate the memseg
> +	 * lists.
> +	 */
> +
> +	/* create space for mem types */
> +	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
> +	memtypes = calloc(n_memtypes, sizeof(*memtypes));
> +	if (memtypes == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
> +		return -1;
> +	}
> +
> +	/* populate mem types */
> +	cur_type = 0;
> +	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
> +			hpi_idx++) {
> +		struct hugepage_info *hpi;
> +		uint64_t hugepage_sz;
> +
> +		hpi = &internal_config.hugepage_info[hpi_idx];
> +		hugepage_sz = hpi->hugepage_sz;
> +
> +		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
> +			int socket_id = rte_socket_id_by_idx(i);
> +
> +			memtypes[cur_type].page_sz = hugepage_sz;
> +			memtypes[cur_type].socket_id = socket_id;
> +
> +			RTE_LOG(DEBUG, EAL, "Detected memory type: "
> +				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
> +				socket_id, hugepage_sz);
> +		}
> +	}
> +	/* number of memtypes could have been lower due to no NUMA support */
> +	n_memtypes = cur_type;
> +
> +	/* set up limits for types */
> +	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
> +	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
> +			max_mem / n_memtypes);
> +
> +	/*
> +	 * limit maximum number of segment lists per type to ensure there's
> +	 * space for memseg lists for all NUMA nodes with all page sizes
> +	 */
> +	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
> +
> +	if (max_seglists_per_type == 0) {
> +		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
> +			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
> +		goto out;
> +	}
> +
> +	/* go through all mem types and create segment lists */
> +	msl_idx = 0;
> +	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
> +		unsigned int cur_seglist, n_seglists, n_segs;
> +		unsigned int max_segs_per_type, max_segs_per_list;
> +		struct memtype *type = &memtypes[cur_type];
> +		uint64_t max_mem_per_list, pagesz;
> +		int socket_id;
> +
> +		pagesz = type->page_sz;
> +		socket_id = type->socket_id;
> +
> +		/*
> +		 * we need to create segment lists for this type. we must take
> +		 * into account the following things:
> +		 *
> +		 * 1. total amount of memory we can use for this memory type
> +		 * 2. total amount of memory per memseg list allowed
> +		 * 3. number of segments needed to fit the amount of memory
> +		 * 4. number of segments allowed per type
> +		 * 5. number of segments allowed per memseg list
> +		 * 6. number of memseg lists we are allowed to take up
> +		 */
> +
> +		/* calculate how much segments we will need in total */
> +		max_segs_per_type = max_mem_per_type / pagesz;
> +		/* limit number of segments to maximum allowed per type */
> +		max_segs_per_type = RTE_MIN(max_segs_per_type,
> +				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
> +		/* limit number of segments to maximum allowed per list */
> +		max_segs_per_list = RTE_MIN(max_segs_per_type,
> +				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
> +
> +		/* calculate how much memory we can have per segment list */
> +		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
> +				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
> +
> +		/* calculate how many segments each segment list will have */
> +		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
> +
> +		/* calculate how many segment lists we can have */
> +		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
> +				max_mem_per_type / max_mem_per_list);
> +
> +		/* limit number of segment lists according to our maximum */
> +		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
> +
> +		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
> +				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
> +			n_seglists, n_segs, socket_id, pagesz);
> +
> +		/* create all segment lists */
> +		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
> +			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
> +				RTE_LOG(ERR, EAL,
> +					"No more space in memseg lists, please increase %s\n",
> +					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
> +				goto out;
> +			}
> +			msl = &mcfg->memsegs[msl_idx++];
> +
> +			if (memseg_list_alloc(msl, pagesz, n_segs,
> +					socket_id, cur_seglist))
> +				goto out;
> +
> +			if (memseg_list_reserve(msl)) {
> +				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
> +				goto out;
> +			}
> +		}
> +	}
> +	/* we're successful */
> +	ret = 0;
> +out:
> +	free(memtypes);
> +	return ret;
> +}
> +
> +static int
> +memseg_secondary_init(void)
> +{
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_eal_memseg_init(void)
> +{
> +	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
> +		return memseg_primary_init();
> +	return memseg_secondary_init();
> +}
> +
> +static inline uint64_t
> +get_socket_mem_size(int socket)
> +{
> +	uint64_t size = 0;
> +	unsigned int i;
> +
> +	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
> +		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
> +		size += hpi->hugepage_sz * hpi->num_pages[socket];
> +	}
> +
> +	return size;
> +}
> +
> +static int
> +calc_num_pages_per_socket(uint64_t *memory,
> +		struct hugepage_info *hp_info,
> +		struct hugepage_info *hp_used,
> +		unsigned int num_hp_info)
> +{
> +	unsigned int socket, j, i = 0;
> +	unsigned int requested, available;
> +	int total_num_pages = 0;
> +	uint64_t remaining_mem, cur_mem;
> +	uint64_t total_mem = internal_config.memory;
> +
> +	if (num_hp_info == 0)
> +		return -1;
> +
> +	/* if specific memory amounts per socket weren't requested */
> +	if (internal_config.force_sockets == 0) {
> +		size_t total_size;
> +		int cpu_per_socket[RTE_MAX_NUMA_NODES];
> +		size_t default_size;
> +		unsigned int lcore_id;
> +
> +		/* Compute number of cores per socket */
> +		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
> +		RTE_LCORE_FOREACH(lcore_id) {
> +			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
> +		}
> +
> +		/*
> +		 * Automatically spread requested memory amongst detected
> +		 * sockets according to number of cores from cpu mask present
> +		 * on each socket.
> +		 */
> +		total_size = internal_config.memory;
> +		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
> +				socket++) {
> +
> +			/* Set memory amount per socket */
> +			default_size = internal_config.memory *
> +				cpu_per_socket[socket] / rte_lcore_count();
> +
> +			/* Limit to maximum available memory on socket */
> +			default_size = RTE_MIN(
> +				default_size, get_socket_mem_size(socket));
> +
> +			/* Update sizes */
> +			memory[socket] = default_size;
> +			total_size -= default_size;
> +		}
> +
> +		/*
> +		 * If some memory is remaining, try to allocate it by getting
> +		 * all available memory from sockets, one after the other.
> +		 */
> +		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
> +				socket++) {
> +			/* take whatever is available */
> +			default_size = RTE_MIN(
> +				get_socket_mem_size(socket) - memory[socket],
> +				total_size);
> +
> +			/* Update sizes */
> +			memory[socket] += default_size;
> +			total_size -= default_size;
> +		}
> +	}
> +
> +	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
> +			socket++) {
> +		/* skips if the memory on specific socket wasn't requested */
> +		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
> +			strncpy(hp_used[i].hugedir, hp_info[i].hugedir,
> +				sizeof(hp_used[i].hugedir));
> +			hp_used[i].num_pages[socket] = RTE_MIN(
> +					memory[socket] / hp_info[i].hugepage_sz,
> +					hp_info[i].num_pages[socket]);
> +
> +			cur_mem = hp_used[i].num_pages[socket] *
> +					hp_used[i].hugepage_sz;
> +
> +			memory[socket] -= cur_mem;
> +			total_mem -= cur_mem;
> +
> +			total_num_pages += hp_used[i].num_pages[socket];
> +
> +			/* check if we have met all memory requests */
> +			if (memory[socket] == 0)
> +				break;
> +
> +			/* Check if we have any more pages left at this size,
> +			 * if so, move on to next size.
> +			 */
> +			if (hp_used[i].num_pages[socket] ==
> +					hp_info[i].num_pages[socket])
> +				continue;
> +
> +			/* At this point we know that there are more pages
> +			 * available that are bigger than the memory we want,
> +			 * so lets see if we can get enough from other page
> +			 * sizes.
> +			 */
> +			remaining_mem = 0;
> +			for (j = i+1; j < num_hp_info; j++)
> +				remaining_mem += hp_info[j].hugepage_sz *
> +				hp_info[j].num_pages[socket];
> +
> +			/* Is there enough other memory?
> +			 * If not, allocate another page and quit.
> +			 */
> +			if (remaining_mem < memory[socket]) {
> +				cur_mem = RTE_MIN(
> +					memory[socket], hp_info[i].hugepage_sz);
> +				memory[socket] -= cur_mem;
> +				total_mem -= cur_mem;
> +				hp_used[i].num_pages[socket]++;
> +				total_num_pages++;
> +				break; /* we are done with this socket*/
> +			}
> +		}
> +		/* if we didn't satisfy all memory requirements per socket */
> +		if (memory[socket] > 0 &&
> +				internal_config.socket_mem[socket] != 0) {
> +			/* to prevent icc errors */
> +			requested = (unsigned int)(
> +				internal_config.socket_mem[socket] / 0x100000);
> +			available = requested -
> +				((unsigned int)(memory[socket] / 0x100000));
> +			RTE_LOG(ERR, EAL, "Not enough memory available on "
> +				"socket %u! Requested: %uMB, available: %uMB\n",
> +				socket, requested, available);
> +			return -1;
> +		}
> +	}
> +
> +	/* if we didn't satisfy total memory requirements */
> +	if (total_mem > 0) {
> +		requested = (unsigned int) (internal_config.memory / 0x100000);
> +		available = requested - (unsigned int) (total_mem / 0x100000);
> +		RTE_LOG(ERR, EAL, "Not enough memory available! "
> +			"Requested: %uMB, available: %uMB\n",
> +			requested, available);
> +		return -1;
> +	}
> +	return total_num_pages;
> +}
> +
> +/* Limit is checked by validator itself, nothing left to analyze.*/
> +static int
> +limits_callback(int socket_id, size_t cur_limit, size_t new_len)
> +{
> +	RTE_SET_USED(socket_id);
> +	RTE_SET_USED(cur_limit);
> +	RTE_SET_USED(new_len);
> +	return -1;
> +}
> +
> +static int
> +eal_hugepage_init(void)
> +{
> +	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
> +	uint64_t memory[RTE_MAX_NUMA_NODES];
> +	int hp_sz_idx, socket_id;
> +
> +	memset(used_hp, 0, sizeof(used_hp));
> +
> +	for (hp_sz_idx = 0;
> +			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
> +			hp_sz_idx++) {
> +		/* also initialize used_hp hugepage sizes in used_hp */
> +		struct hugepage_info *hpi;
> +		hpi = &internal_config.hugepage_info[hp_sz_idx];
> +		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
> +	}
> +
> +	/* make a copy of socket_mem, needed for balanced allocation. */
> +	for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES; socket_id++)
> +		memory[socket_id] = internal_config.socket_mem[socket_id];
> +
> +	/* calculate final number of pages */
> +	if (calc_num_pages_per_socket(memory,
> +			internal_config.hugepage_info, used_hp,
> +			internal_config.num_hugepage_sizes) < 0)
> +		return -1;
> +
> +	for (hp_sz_idx = 0;
> +			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
> +			hp_sz_idx++) {
> +		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
> +				socket_id++) {
> +			struct rte_memseg **pages;
> +			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
> +			unsigned int num_pages = hpi->num_pages[socket_id];
> +			unsigned int num_pages_alloc;
> +
> +			if (num_pages == 0)
> +				continue;
> +
> +			RTE_LOG(DEBUG, EAL,
> +				"Allocating %u pages of size %" PRIu64 "M on socket %i\n",
> +				num_pages, hpi->hugepage_sz >> 20, socket_id);
> +
> +			/* we may not be able to allocate all pages in one go,
> +			 * because we break up our memory map into multiple
> +			 * memseg lists. therefore, try allocating multiple
> +			 * times and see if we can get the desired number of
> +			 * pages from multiple allocations.
> +			 */
> +
> +			num_pages_alloc = 0;
> +			do {
> +				int i, cur_pages, needed;
> +
> +				needed = num_pages - num_pages_alloc;
> +
> +				pages = malloc(sizeof(*pages) * needed);
> +
> +				/* do not request exact number of pages */
> +				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
> +						needed, hpi->hugepage_sz,
> +						socket_id, false);
> +				if (cur_pages <= 0) {
> +					free(pages);
> +					return -1;
> +				}
> +
> +				/* mark preallocated pages as unfreeable */
> +				for (i = 0; i < cur_pages; i++) {
> +					struct rte_memseg *ms = pages[i];
> +					ms->flags |=
> +						RTE_MEMSEG_FLAG_DO_NOT_FREE;
> +				}
> +				free(pages);
> +
> +				num_pages_alloc += cur_pages;
> +			} while (num_pages_alloc != num_pages);
> +		}
> +	}
> +	/* if socket limits were specified, set them */
> +	if (internal_config.force_socket_limits) {
> +		unsigned int i;
> +		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
> +			uint64_t limit = internal_config.socket_limit[i];
> +			if (limit == 0)
> +				continue;
> +			if (rte_mem_alloc_validator_register("socket-limit",
> +					limits_callback, i, limit))
> +				RTE_LOG(ERR, EAL, "Failed to register socket "
> +					"limits validator callback\n");
> +		}
> +	}
> +	return 0;
> +}
> +
> +static int
> +eal_nohuge_init(void)
> +{
> +	struct rte_mem_config *mcfg;
> +	struct rte_memseg_list *msl;
> +	int n_segs, cur_seg;
> +	uint64_t page_sz;
> +	void *addr;
> +	struct rte_fbarray *arr;
> +	struct rte_memseg *ms;
> +
> +	mcfg = rte_eal_get_configuration()->mem_config;
> +
> +	/* nohuge mode is legacy mode */
> +	internal_config.legacy_mem = 1;
> +
> +	/* create a memseg list */
> +	msl = &mcfg->memsegs[0];
> +
> +	page_sz = RTE_PGSIZE_4K;
> +	n_segs = internal_config.memory / page_sz;
> +
> +	if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
> +		sizeof(struct rte_memseg))) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
> +		return -1;
> +	}
> +
> +	addr = eal_mem_alloc(internal_config.memory, 0);
> +	if (addr == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate %zu bytes",
> +		internal_config.memory);
> +		return -1;
> +	}
> +
> +	msl->base_va = addr;
> +	msl->page_sz = page_sz;
> +	msl->socket_id = 0;
> +	msl->len = internal_config.memory;
> +	msl->heap = 1;
> +
> +	/* populate memsegs. each memseg is one page long */
> +	for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
> +		arr = &msl->memseg_arr;
> +
> +		ms = rte_fbarray_get(arr, cur_seg);
> +		ms->iova = RTE_BAD_IOVA;
> +		ms->addr = addr;
> +		ms->hugepage_sz = page_sz;
> +		ms->socket_id = 0;
> +		ms->len = page_sz;
> +
> +		rte_fbarray_set_used(arr, cur_seg);
> +
> +		addr = RTE_PTR_ADD(addr, (size_t)page_sz);
> +	}
> +
> +	if (mcfg->dma_maskbits &&
> +		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
> +		RTE_LOG(ERR, EAL,
> +			"%s(): couldn't allocate memory due to IOVA "
> +			"exceeding limits of current DMA mask.\n", __func__);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_eal_hugepage_init(void)
> +{
> +	return internal_config.no_hugetlbfs ?
> +		eal_nohuge_init() : eal_hugepage_init();
> +}
> +
> +int
> +rte_eal_hugepage_attach(void)
> +{
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
> new file mode 100644
> index 000000000..16a5e8ba0
> --- /dev/null
> +++ b/lib/librte_eal/windows/eal_mp.c
> @@ -0,0 +1,103 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +/**
> + * @file Multiprocess support stubs
> + *
> + * Stubs must log an error until implemented. If success is required
> + * for non-multiprocess operation, stub must log a warning and a comment
> + * must document what requires success emulation.
> + */
> +
> +#include <rte_eal.h>
> +#include <rte_errno.h>
> +
> +#include "eal_private.h"
> +#include "eal_windows.h"
> +#include "malloc_mp.h"
> +
> +void
> +rte_mp_channel_cleanup(void)
> +{
> +	EAL_LOG_NOT_IMPLEMENTED();
> +}
> +
> +int
> +rte_mp_action_register(const char *name, rte_mp_t action)
> +{
> +	RTE_SET_USED(name);
> +	RTE_SET_USED(action);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +void
> +rte_mp_action_unregister(const char *name)
> +{
> +	RTE_SET_USED(name);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +}
> +
> +int
> +rte_mp_sendmsg(struct rte_mp_msg *msg)
> +{
> +	RTE_SET_USED(msg);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
> +	const struct timespec *ts)
> +{
> +	RTE_SET_USED(req);
> +	RTE_SET_USED(reply);
> +	RTE_SET_USED(ts);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
> +		rte_mp_async_reply_t clb)
> +{
> +	RTE_SET_USED(req);
> +	RTE_SET_USED(ts);
> +	RTE_SET_USED(clb);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
> +{
> +	RTE_SET_USED(msg);
> +	RTE_SET_USED(peer);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +register_mp_requests(void)
> +{
> +	/* Non-stub function succeeds if multi-process is not supported. */
> +	EAL_LOG_STUB();
> +	return 0;
> +}
> +
> +int
> +request_to_primary(struct malloc_mp_req *req)
> +{
> +	RTE_SET_USED(req);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +request_sync(void)
> +{
> +	/* Common memory allocator depends on this function success. */
> +	EAL_LOG_STUB();
> +	return 0;
> +}
> diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
> index b202a1aa5..083ab8b93 100644
> --- a/lib/librte_eal/windows/eal_windows.h
> +++ b/lib/librte_eal/windows/eal_windows.h
> @@ -9,8 +9,24 @@
>   * @file Facilities private to Windows EAL
>   */
>  
> +#include <rte_errno.h>
>  #include <rte_windows.h>
>  
> +/**
> + * Log current function as not implemented and set rte_errno.
> + */
> +#define EAL_LOG_NOT_IMPLEMENTED() \
> +	do { \
> +		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
> +		rte_errno = ENOTSUP; \
> +	} while (0)
> +
> +/**
> + * Log current function as a stub.
> + */
> +#define EAL_LOG_STUB() \
> +	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
> +
>  /**
>   * Create a map of processors and cores on the system.
>   */
> @@ -36,6 +52,13 @@ int eal_thread_create(pthread_t *thread);
>   */
>  unsigned int eal_socket_numa_node(unsigned int socket_id);
>  
> +/**
> + * Open virt2phys driver interface device.
> + *
> + * @return 0 on success, (-1) on failure.
> + */
> +int eal_mem_virt2iova_init(void);
> +
>  /**
>   * Locate Win32 memory management routines in system libraries.
>   *
> diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
> index 5fb1962ac..b3534b025 100644
> --- a/lib/librte_eal/windows/include/meson.build
> +++ b/lib/librte_eal/windows/include/meson.build
> @@ -5,5 +5,6 @@ includes += include_directories('.')
>  
>  headers += files(
>          'rte_os.h',
> +        'rte_virt2phys.h',
>          'rte_windows.h',
>  )
> diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
> index 510e39e03..62805a307 100644
> --- a/lib/librte_eal/windows/include/rte_os.h
> +++ b/lib/librte_eal/windows/include/rte_os.h
> @@ -36,6 +36,10 @@ extern "C" {
>  
>  #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
>  
> +#define open _open
> +#define close _close
> +#define unlink _unlink
> +
>  /* cpu_set macros implementation */
>  #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
>  #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
> diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
> new file mode 100644
> index 000000000..4bb2b4aaf
> --- /dev/null
> +++ b/lib/librte_eal/windows/include/rte_virt2phys.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +/**
> + * @file virt2phys driver interface
> + */
> +
> +/**
> + * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
> + */
> +DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
> +	0x539c2135, 0x793a, 0x4926,
> +	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
> +
> +/**
> + * Driver device type for IO control codes.
> + */
> +#define VIRT2PHYS_DEVTYPE 0x8000
> +
> +/**
> + * Translate a valid non-paged virtual address to a physical address.
> + *
> + * Note: A physical address zero (0) is reported if input address
> + * is paged out or not mapped. However, if input is a valid mapping
> + * of I/O port 0x0000, output is also zero. There is no way
> + * to distinguish between these cases by return value only.
> + *
> + * Input: a non-paged virtual address (PVOID).
> + *
> + * Output: the corresponding physical address (LARGE_INTEGER).
> + */
> +#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
> +	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
> diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
> index ed6e4c148..899ed7d87 100644
> --- a/lib/librte_eal/windows/include/rte_windows.h
> +++ b/lib/librte_eal/windows/include/rte_windows.h
> @@ -23,6 +23,8 @@
>  
>  #include <basetsd.h>
>  #include <psapi.h>
> +#include <setupapi.h>
> +#include <winioctl.h>
>  
>  /* Have GUIDs defined. */
>  #ifndef INITGUID
> diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
> index 757b7f3c5..6b33005b2 100644
> --- a/lib/librte_eal/windows/include/unistd.h
> +++ b/lib/librte_eal/windows/include/unistd.h
> @@ -9,4 +9,7 @@
>   * as Microsoft libc does not contain unistd.h. This may be removed
>   * in future releases.
>   */
> +
> +#include <io.h>
> +
>  #endif /* _UNISTD_H_ */
> diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
> index 81d3ee095..2f4fa91a9 100644
> --- a/lib/librte_eal/windows/meson.build
> +++ b/lib/librte_eal/windows/meson.build
> @@ -8,7 +8,9 @@ sources += files(
>  	'eal_debug.c',
>  	'eal_hugepages.c',
>  	'eal_lcore.c',
> +	'eal_memalloc.c',
>  	'eal_memory.c',
> +	'eal_mp.c',
>  	'eal_thread.c',
>  	'getopt.c',
>  )
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v2 10/10] eal/windows: implement basic memory management
  2020-04-10 22:04     ` Narcisa Ana Maria Vasile
@ 2020-04-11  1:16       ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-11  1:16 UTC (permalink / raw)
  To: Narcisa Ana Maria Vasile
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic,
	Anatoly Burakov, Bruce Richardson

[snip]
> > +static int
> > +alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
> > +	struct hugepage_info *hi)
> > +{
> > +	HANDLE current_process;
> > +	unsigned int numa_node;
> > +	size_t alloc_sz;
> > +	void *addr;
> > +	rte_iova_t iova = RTE_BAD_IOVA;
> > +	PSAPI_WORKING_SET_EX_INFORMATION info;
> > +	PSAPI_WORKING_SET_EX_BLOCK *page;
> > +
> > +	if (ms->len > 0) {
> > +		/* If a segment is already allocated as needed, return it. */
> > +		if ((ms->addr == requested_addr) &&
> > +			(ms->socket_id == socket_id) &&
> > +			(ms->hugepage_sz == hi->hugepage_sz)) {
> > +			return 0;
> > +		}
> > +
> > +		/* Bugcheck, should not happen. */
> > +		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
> > +			"(size %zu) on socket %d", ms->addr,
> > +			ms->len, ms->socket_id);
> > +		return -1;
> > +	}
> > +
> > +	current_process = GetCurrentProcess();
> > +	numa_node = eal_socket_numa_node(socket_id);
> > +	alloc_sz = hi->hugepage_sz;
> > +
> > +	if (requested_addr == NULL) {
> > +		/* Request a new chunk of memory and enforce address hint. */  
> 
> Does requested_addr being NULL means that no hint was provided? It also looks like eal_mem_alloc_socket
> ignores the address hint anyway and just calls VirtualAllocExNuma with NULL for lpAddress. Maybe remove
> the second part of the comment.

Correct, also see below.

> 
> > +		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
> > +		if (addr == NULL) {
> > +			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
> > +				"on socket %d\n", alloc_sz, socket_id);
> > +			return -1;
> > +		}
> > +
> > +		if (addr != requested_addr) {  
> 
> requested_addr is NULL on this branch and we confirmed with the previous 'if' that addr is not NULL.
> Should this branch be removed, since requested_addr is NULL, so there is no hint provided?

Good catch, this check doesn't belong here. In fact, apart from an allocation
failure, address hint might not be respected iff the hint is not aligned to
hugepage boundary, which would be an allocator bug (use of an incorrect MSL).

> > +			RTE_LOG(DEBUG, EAL, "Address hint %p not respected, "
> > +				"got %p\n", requested_addr, addr);
> > +			goto error;
> > +		}
> > +	} else {
> > +		/* Requested address is already reserved, commit memory. */
> > +		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
> > +		if (addr == NULL) {
> > +			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
> > +				"(size %zu)\n", requested_addr, alloc_sz);
> > +			goto error;  
> 
> Execution jumps to 'error' with an invalid addr, so it will try to call eal_mem_decommit with NULL as parameter.
> Instead of 'goto error', maybe we should return here.

Will fix in v3.

I'd like to also clarify the following. On Windows, eal_mem_commit() is not
atomic: it may first split a reserved region, then commit it. If splitting
succeeds and commit fails, the reserved region remains split into parts.
eal_mem_decommit() does not coalesce these parts intentionally: they're
hugepage-sized, so the next eal_mem_commit() on the same region just doesn't
have to split it.

> 
> > +		}
> > +	}
> > +
> > +	/* Force OS to allocate a physical page and select a NUMA node.
> > +	 * Hugepages are not pageable in Windows, so there's no race
> > +	 * for physical address.
> > +	 */
> > +	*(volatile int *)addr = *(volatile int *)addr;
> > +
> > +	/* Only try to obtain IOVA if it's available, so that applications
> > +	 * that do not need IOVA can use this allocator.
> > +	 */
> > +	if (rte_eal_using_phys_addrs()) {
> > +		iova = rte_mem_virt2iova(addr);
> > +		if (iova == RTE_BAD_IOVA) {
> > +			RTE_LOG(DEBUG, EAL,
> > +				"Cannot get IOVA of allocated segment\n");
> > +			goto error;
> > +		}
> > +	}
> > +
> > +	/* Only "Ex" function can handle hugepages. */
> > +	info.VirtualAddress = addr;
> > +	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
> > +		RTE_LOG_WIN32_ERR("QueryWorkingSetEx()");
> > +		goto error;
> > +	}
> > +
> > +	page = &info.VirtualAttributes;
> > +	if (!page->Valid || !page->LargePage) {
> > +		RTE_LOG(DEBUG, EAL, "Got regular page instead of hugepage\n");
> > +		goto error;
> > +	}
> > +	if (page->Node != numa_node) {
> > +		RTE_LOG(DEBUG, EAL,
> > +			"NUMA node hint %u (socket %d) not respected, got %u\n",
> > +			numa_node, socket_id, page->Node);
> > +		goto error;
> > +	}
> > +
> > +	ms->addr = addr;
> > +	ms->hugepage_sz = hi->hugepage_sz;
> > +	ms->len = alloc_sz;
> > +	ms->nchannel = rte_memory_get_nchannel();
> > +	ms->nrank = rte_memory_get_nrank();
> > +	ms->iova = iova;
> > +	ms->socket_id = socket_id;
> > +
> > +	return 0;
> > +
> > +error:
> > +	/* Only jump here when `addr` and `alloc_sz` are valid. */
> > +	eal_mem_decommit(addr, alloc_sz);
> > +	return -1;
> > +}
[snip]

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
@ 2020-04-13  5:32     ` Ranjit Menon
  0 siblings, 0 replies; 218+ messages in thread
From: Ranjit Menon @ 2020-04-13  5:32 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman

On 4/10/2020 9:43 AM, Dmitry Kozlyuk wrote:
> This driver supports Windows EAL memory management by translating
> current process virtual addresses to physical addresses (IOVA).
> Standalone virt2phys allows using DPDK without PMD and provides a
> reference implementation.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>

Reviewed-by: Ranjit Menon <ranjit.menon@intel.com>
Acked-by: Ranjit Menon <ranjit.menon@intel.com>

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/10] eal: introduce memory management wrappers
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-04-13  7:50     ` Tal Shnaiderman
  0 siblings, 0 replies; 218+ messages in thread
From: Tal Shnaiderman @ 2020-04-13  7:50 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Thomas Monjalon,
	Anatoly Burakov, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

> Subject: [PATCH v2 06/10] eal: introduce memory management wrappers
> 
> System meory management is implemented differently for POSIX and
> Windows. Introduce wrapper functions for operations used across DPDK:
> 
> * rte_mem_map()
>   Create memory mapping for a regular file or a page file (swap).
>   This supports mapping to a reserved memory region even on Windows.
> 
> * rte_mem_unmap()
>   Remove mapping created with rte_mem_map().
> 
> * rte_get_page_size()
>   Obtain default system page size.
> 
> * rte_mem_lock()
>   Make arbitrary-sized memory region non-swappable.
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their signatures
> deliberately differ from POSIX ones to be more safe and expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<Snip!>

> +void *
> +rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot
> prot,
> +	enum rte_map_flags flags, int fd, size_t offset) {
> +	int sys_prot = 0;
> +	int sys_flags = 0;
> +
> +	sys_prot = mem_rte_to_sys_prot(prot);
> +
> +	if (flags & RTE_MAP_SHARED)
> +		sys_flags |= MAP_SHARED;
> +	if (flags & RTE_MAP_ANONYMOUS)
> +		sys_flags |= MAP_ANONYMOUS;
> +	if (flags & RTE_MAP_PRIVATE)
> +		sys_flags |= MAP_PRIVATE;
> +	if (flags & RTE_MAP_FIXED)
> +		sys_flags |= MAP_FIXED;
> +
> +	return mem_map(requested_addr, size, sys_prot, sys_flags, fd,
> offset);
> +}

Looks like there is a different behavior in the Windows and Unix implementation of rte_mem_map in case of memory mapping failure, in Windows the function returns NULL however in Unix the return value is the MAP_FAILED value of mmap which is ((void *)-1)

^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 00/10] Windows basic memory management
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (9 preceding siblings ...)
  2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-04-14 19:44   ` Dmitry Kozlyuk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
                       ` (10 more replies)
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
  11 siblings, 11 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk

Note: no changes in cover letter since v2.

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing with IOVA unavailable.


The first commit introduces a new kernel-mode driver, virt2phys.
It translates user-mode virtual addresses into physical addresses.
On Windows community call 2020-04-01 it was decided this driver can be
used for now, later netUIO may pick up its code/interface or not.


New EAL public functions for memory mapping are introduced
to mitigate OS differences in DPDK libraries and applications:

* rte_mem_map
* rte_mem_unmap
* rte_mem_lock

To support common MM routines, internal wrappers for low-level
memory reservation and file management are introduced. These changes
affect Linux and FreeBSD EAL. Shared code is placed unded /unix/
subdirectory (suggested by Thomas).

Also, entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.


Windows MM duplicates quite a lot of code from Linux EAL:

* eal_memalloc_alloc_seg_bulk
* eal_memalloc_free_seg_bulk
* calc_num_pages_per_socket
* rte_eal_hugepage_init

Perhaps this should be left as-is until Windows MM evolves into having
some specific requirements for these parts.


Notes on checkpatch warnings:

* No space after comma / no space before closing parent in macros---
  definitely a false-positive, unclear how to suppress this.

* Issues from imported BSD code---probably should be ignored?

* Checkpatch is not run against dpdk-kmods (Windows drivers).

---

v3:

    * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
    * Fix logic and error handling while allocating segments.
    * Fix Unix rte_mem_map(): return NULL on failure.
    * Fix some checkpatch.sh issues:
        * Do not return positive errno, use DWORD for GetLastError().
        * Make dpdk-kmods source files non-executable.
    * Improve GSG for Windows Server (suggested by Ranjit Menon).

v2:

    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.

Dmitry Kozlyuk (9):
  eal/windows: do not expose private EAL facilities
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal: extract common code for memseg list initialization
  eal/windows: fix rte_page_sizes with Clang on Windows
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: implement basic memory management

 config/meson.build                            |   12 +-
 doc/guides/windows_gsg/build_dpdk.rst         |   20 -
 doc/guides/windows_gsg/index.rst              |    1 +
 doc/guides/windows_gsg/run_apps.rst           |   84 ++
 lib/librte_eal/common/eal_common_fbarray.c    |   57 +-
 lib/librte_eal/common/eal_common_memory.c     |  104 +-
 lib/librte_eal/common/eal_private.h           |  134 +-
 lib/librte_eal/common/malloc_heap.c           |    1 +
 lib/librte_eal/common/meson.build             |    9 +
 lib/librte_eal/freebsd/eal_memory.c           |   55 +-
 lib/librte_eal/include/rte_memory.h           |   74 ++
 lib/librte_eal/linux/eal_memory.c             |   68 +-
 lib/librte_eal/meson.build                    |    4 +
 lib/librte_eal/rte_eal_exports.def            |  119 ++
 lib/librte_eal/rte_eal_version.map            |    4 +
 lib/librte_eal/unix/eal.c                     |   47 +
 lib/librte_eal/unix/eal_memory.c              |  113 ++
 lib/librte_eal/unix/meson.build               |    7 +
 lib/librte_eal/windows/eal.c                  |  160 +++
 lib/librte_eal/windows/eal_hugepages.c        |  108 ++
 lib/librte_eal/windows/eal_lcore.c            |  187 ++-
 lib/librte_eal/windows/eal_memalloc.c         |  418 ++++++
 lib/librte_eal/windows/eal_memory.c           | 1141 +++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               |  103 ++
 lib/librte_eal/windows/eal_thread.c           |    1 +
 lib/librte_eal/windows/eal_windows.h          |  129 ++
 lib/librte_eal/windows/include/meson.build    |    2 +
 lib/librte_eal/windows/include/pthread.h      |    2 +
 lib/librte_eal/windows/include/rte_os.h       |   48 +-
 .../windows/include/rte_virt2phys.h           |   34 +
 lib/librte_eal/windows/include/rte_windows.h  |   43 +
 lib/librte_eal/windows/include/sys/queue.h    |  663 +++++++++-
 lib/librte_eal/windows/include/unistd.h       |    3 +
 lib/librte_eal/windows/meson.build            |    6 +
 34 files changed, 3611 insertions(+), 350 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/unix/eal.c
 create mode 100644 lib/librte_eal/unix/eal_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/eal_windows.h
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h
 create mode 100644 lib/librte_eal/windows/include/rte_windows.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-14 23:35       ` Ranjit Menon
  2020-04-21  6:23       ` Ophir Munk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 02/10] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
                       ` (9 subsequent siblings)
  10 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Ranjit Menon

This driver supports Windows EAL memory management by translating
current process virtual addresses to physical addresses (IOVA).
Standalone virt2phys allows using DPDK without PMD and provides a
reference implementation.

Suggested-by: Ranjit Menon <ranjit.menon@intel.com>
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 windows/README.rst                          | 103 +++++++++
 windows/virt2phys/virt2phys.c               | 129 +++++++++++
 windows/virt2phys/virt2phys.h               |  34 +++
 windows/virt2phys/virt2phys.inf             |  64 ++++++
 windows/virt2phys/virt2phys.sln             |  27 +++
 windows/virt2phys/virt2phys.vcxproj         | 228 ++++++++++++++++++++
 windows/virt2phys/virt2phys.vcxproj.filters |  36 ++++
 7 files changed, 621 insertions(+)
 create mode 100644 windows/README.rst
 create mode 100644 windows/virt2phys/virt2phys.c
 create mode 100644 windows/virt2phys/virt2phys.h
 create mode 100644 windows/virt2phys/virt2phys.inf
 create mode 100644 windows/virt2phys/virt2phys.sln
 create mode 100644 windows/virt2phys/virt2phys.vcxproj
 create mode 100644 windows/virt2phys/virt2phys.vcxproj.filters

diff --git a/windows/README.rst b/windows/README.rst
new file mode 100644
index 0000000..45a1d80
--- /dev/null
+++ b/windows/README.rst
@@ -0,0 +1,103 @@
+Developing Windows Drivers
+==========================
+
+Prerequisites
+-------------
+
+Building Windows Drivers is only possible on Windows.
+
+1. Visual Studio 2019 Community or Professional Edition
+2. Windows Driver Kit (WDK) for Windows 10, version 1903
+
+Follow the official instructions to obtain all of the above:
+https://docs.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk
+
+
+Build the Drivers
+-----------------
+
+Build from Visual Studio
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Open a solution (``*.sln``) with Visual Studio and build it (Ctrl+Shift+B).
+
+
+Build from Command-Line
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Run *Developer Command Prompt for VS 2019* from the Start menu.
+
+Navigate to the solution directory (with ``*.sln``), then run:
+
+.. code-block:: console
+
+    msbuild
+
+To build a particular combination of configuration and platform:
+
+.. code-block:: console
+
+    msbuild -p:Configuration=Debug;Platform=x64
+
+
+Install the Drivers
+-------------------
+
+Disable Driver Signature Enforcement
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default Windows prohibits installing and loading drivers without `digital
+signature`_ obtained from Microsoft. For development signature enforcement may
+be disabled as follows.
+
+In Elevated Command Prompt (from this point, sufficient privileges are
+assumed):
+
+.. code-block:: console
+
+    bcdedit -set loadoptions DISABLE_INTEGRITY_CHECKS
+    bcdedit -set TESTSIGNING ON
+    shutdown -r -t 0
+
+Upon reboot, an overlay message should appear on the desktop informing
+that Windows is in test mode, which means it allows loading unsigned drivers.
+
+.. _digital signature: https://docs.microsoft.com/en-us/windows-hardware/drivers/install/driver-signing
+
+Install, List, and Remove Drivers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Driver package is by default located in a subdirectory of its source tree,
+e.g. ``x64\Debug\virt2phys\virt2phys`` (note two levels of ``virt2phys``).
+
+To install the driver and bind associated devices to it:
+
+.. code-block:: console
+
+    pnputil /add-driver x64\Debug\virt2phys\virt2phys\virt2phys.inf /install
+
+A graphical confirmation to load an unsigned driver will still appear.
+
+On Windows Server additional steps are required if the driver uses a custom
+setup class:
+
+1. From "Device Manager", "Action" menu, select "Add legacy hardware".
+2. It will launch the "Add Hardware Wizard". Click "Next".
+3. Select second option "Install the hardware that I manually select
+   from a list (Advanced)".
+4. On the next screen, locate the driver device class.
+5. Select it, and click "Next".
+6. The previously installed drivers will now be installed for
+   the appropriate devices (software devices will be created).
+
+To list installed drivers:
+
+.. code-block:: console
+
+    pnputil /enum-drivers
+
+To remove the driver package and to uninstall its devices:
+
+.. code-block:: console
+
+    pnputil /delete-driver oem2.inf /uninstall
diff --git a/windows/virt2phys/virt2phys.c b/windows/virt2phys/virt2phys.c
new file mode 100644
index 0000000..e157e9c
--- /dev/null
+++ b/windows/virt2phys/virt2phys.c
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <ntddk.h>
+#include <wdf.h>
+#include <wdmsec.h>
+#include <initguid.h>
+
+#include "virt2phys.h"
+
+DRIVER_INITIALIZE DriverEntry;
+EVT_WDF_DRIVER_DEVICE_ADD virt2phys_driver_EvtDeviceAdd;
+EVT_WDF_IO_IN_CALLER_CONTEXT virt2phys_device_EvtIoInCallerContext;
+
+NTSTATUS
+DriverEntry(
+	IN PDRIVER_OBJECT driver_object, IN PUNICODE_STRING registry_path)
+{
+	WDF_DRIVER_CONFIG config;
+	WDF_OBJECT_ATTRIBUTES attributes;
+	NTSTATUS status;
+
+	PAGED_CODE();
+
+	WDF_DRIVER_CONFIG_INIT(&config, virt2phys_driver_EvtDeviceAdd);
+	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
+	status = WdfDriverCreate(
+			driver_object, registry_path,
+			&attributes, &config, WDF_NO_HANDLE);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDriverCreate() failed, status=%08x\n", status));
+	}
+
+	return status;
+}
+
+_Use_decl_annotations_
+NTSTATUS
+virt2phys_driver_EvtDeviceAdd(
+	WDFDRIVER driver, PWDFDEVICE_INIT init)
+{
+	WDF_OBJECT_ATTRIBUTES attributes;
+	WDFDEVICE device;
+	NTSTATUS status;
+
+	UNREFERENCED_PARAMETER(driver);
+
+	PAGED_CODE();
+
+	WdfDeviceInitSetIoType(
+		init, WdfDeviceIoNeither);
+	WdfDeviceInitSetIoInCallerContextCallback(
+		init, virt2phys_device_EvtIoInCallerContext);
+
+	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
+
+	status = WdfDeviceCreate(&init, &attributes, &device);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDeviceCreate() failed, status=%08x\n", status));
+		return status;
+	}
+
+	status = WdfDeviceCreateDeviceInterface(
+			device, &GUID_DEVINTERFACE_VIRT2PHYS, NULL);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfDeviceCreateDeviceInterface() failed, "
+			"status=%08x\n", status));
+		return status;
+	}
+
+	return STATUS_SUCCESS;
+}
+
+_Use_decl_annotations_
+VOID
+virt2phys_device_EvtIoInCallerContext(
+	IN WDFDEVICE device, IN WDFREQUEST request)
+{
+	WDF_REQUEST_PARAMETERS params;
+	ULONG code;
+	PVOID *virt;
+	PHYSICAL_ADDRESS *phys;
+	size_t size;
+	NTSTATUS status;
+
+	UNREFERENCED_PARAMETER(device);
+
+	PAGED_CODE();
+
+	WDF_REQUEST_PARAMETERS_INIT(&params);
+	WdfRequestGetParameters(request, &params);
+
+	if (params.Type != WdfRequestTypeDeviceControl) {
+		KdPrint(("bogus request type=%u\n", params.Type));
+		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
+		return;
+	}
+
+	code = params.Parameters.DeviceIoControl.IoControlCode;
+	if (code != IOCTL_VIRT2PHYS_TRANSLATE) {
+		KdPrint(("bogus IO control code=%lu\n", code));
+		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
+		return;
+	}
+
+	status = WdfRequestRetrieveInputBuffer(
+			request, sizeof(*virt), (PVOID *)&virt, &size);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfRequestRetrieveInputBuffer() failed, "
+			"status=%08x\n", status));
+		WdfRequestComplete(request, status);
+		return;
+	}
+
+	status = WdfRequestRetrieveOutputBuffer(
+		request, sizeof(*phys), (PVOID *)&phys, &size);
+	if (!NT_SUCCESS(status)) {
+		KdPrint(("WdfRequestRetrieveOutputBuffer() failed, "
+			"status=%08x\n", status));
+		WdfRequestComplete(request, status);
+		return;
+	}
+
+	*phys = MmGetPhysicalAddress(*virt);
+
+	WdfRequestCompleteWithInformation(
+		request, STATUS_SUCCESS, sizeof(*phys));
+}
diff --git a/windows/virt2phys/virt2phys.h b/windows/virt2phys/virt2phys.h
new file mode 100644
index 0000000..4bb2b4a
--- /dev/null
+++ b/windows/virt2phys/virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/windows/virt2phys/virt2phys.inf b/windows/virt2phys/virt2phys.inf
new file mode 100644
index 0000000..e35765e
--- /dev/null
+++ b/windows/virt2phys/virt2phys.inf
@@ -0,0 +1,64 @@
+; SPDX-License-Identifier: BSD-3-Clause
+; Copyright (c) 2020 Dmitry Kozlyuk
+
+[Version]
+Signature = "$WINDOWS NT$"
+Class = %ClassName%
+ClassGuid = {78A1C341-4539-11d3-B88D-00C04FAD5171}
+Provider = %ManufacturerName%
+CatalogFile = virt2phys.cat
+DriverVer =
+
+[DestinationDirs]
+DefaultDestDir = 12
+
+; ================= Class section =====================
+
+[ClassInstall32]
+Addreg = virt2phys_ClassReg
+
+[virt2phys_ClassReg]
+HKR,,,0,%ClassName%
+HKR,,Icon,,-5
+
+[SourceDisksNames]
+1 = %DiskName%,,,""
+
+[SourceDisksFiles]
+virt2phys.sys  = 1,,
+
+;*****************************************
+; Install Section
+;*****************************************
+
+[Manufacturer]
+%ManufacturerName%=Standard,NT$ARCH$
+
+[Standard.NT$ARCH$]
+%virt2phys.DeviceDesc%=virt2phys_Device, Root\virt2phys
+
+[virt2phys_Device.NT]
+CopyFiles = Drivers_Dir
+
+[Drivers_Dir]
+virt2phys.sys
+
+;-------------- Service installation
+[virt2phys_Device.NT.Services]
+AddService = virt2phys,%SPSVCINST_ASSOCSERVICE%, virt2phys_Service_Inst
+
+; -------------- virt2phys driver install sections
+[virt2phys_Service_Inst]
+DisplayName    = %virt2phys.SVCDESC%
+ServiceType    = 1 ; SERVICE_KERNEL_DRIVER
+StartType      = 3 ; SERVICE_DEMAND_START
+ErrorControl   = 1 ; SERVICE_ERROR_NORMAL
+ServiceBinary  = %12%\virt2phys.sys
+
+[Strings]
+SPSVCINST_ASSOCSERVICE = 0x00000002
+ManufacturerName = "Dmitry Kozlyuk"
+ClassName = "Kernel bypass"
+DiskName = "virt2phys Installation Disk"
+virt2phys.DeviceDesc = "Virtual to physical address translator"
+virt2phys.SVCDESC = "virt2phys Service"
diff --git a/windows/virt2phys/virt2phys.sln b/windows/virt2phys/virt2phys.sln
new file mode 100644
index 0000000..0f5ecdc
--- /dev/null
+++ b/windows/virt2phys/virt2phys.sln
@@ -0,0 +1,27 @@
+
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio Version 16
+VisualStudioVersion = 16.0.29613.14
+MinimumVisualStudioVersion = 10.0.40219.1
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "virt2phys", "virt2phys.vcxproj", "{0EEF826B-9391-43A8-A722-BDD6F6115137}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|x64 = Debug|x64
+		Release|x64 = Release|x64
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.ActiveCfg = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.Build.0 = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.Deploy.0 = Debug|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.ActiveCfg = Release|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.Build.0 = Release|x64
+		{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.Deploy.0 = Release|x64
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+	GlobalSection(ExtensibilityGlobals) = postSolution
+		SolutionGuid = {845012FB-4471-4A12-A1C4-FF7E05C40E8E}
+	EndGlobalSection
+EndGlobal
diff --git a/windows/virt2phys/virt2phys.vcxproj b/windows/virt2phys/virt2phys.vcxproj
new file mode 100644
index 0000000..fa51916
--- /dev/null
+++ b/windows/virt2phys/virt2phys.vcxproj
@@ -0,0 +1,228 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|ARM">
+      <Configuration>Debug</Configuration>
+      <Platform>ARM</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|ARM">
+      <Configuration>Release</Configuration>
+      <Platform>ARM</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|ARM64">
+      <Configuration>Debug</Configuration>
+      <Platform>ARM64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|ARM64">
+      <Configuration>Release</Configuration>
+      <Platform>ARM64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="virt2phys.c" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="virt2phys.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <Inf Include="virt2phys.inf" />
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <ProjectGuid>{0EEF826B-9391-43A8-A722-BDD6F6115137}</ProjectGuid>
+    <TemplateGuid>{497e31cb-056b-4f31-abb8-447fd55ee5a5}</TemplateGuid>
+    <TargetFrameworkVersion>v4.5</TargetFrameworkVersion>
+    <MinimumVisualStudioVersion>12.0</MinimumVisualStudioVersion>
+    <Configuration>Debug</Configuration>
+    <Platform Condition="'$(Platform)' == ''">Win32</Platform>
+    <RootNamespace>virt2phys</RootNamespace>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'" Label="Configuration">
+    <TargetVersion>Windows10</TargetVersion>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>
+    <ConfigurationType>Driver</ConfigurationType>
+    <DriverType>KMDF</DriverType>
+    <DriverTargetPlatform>Universal</DriverTargetPlatform>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">
+    <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <WppEnabled>false</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+    <Link>
+      <AdditionalDependencies>$(DDK_LIB_PATH)wdmsec.lib;%(AdditionalDependencies)</AdditionalDependencies>
+    </Link>
+    <Inf>
+      <TimeStamp>0.1</TimeStamp>
+    </Inf>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">
+    <ClCompile>
+      <WppEnabled>true</WppEnabled>
+      <WppRecorderEnabled>true</WppRecorderEnabled>
+      <WppScanConfigurationData Condition="'%(ClCompile.ScanConfigurationData)' == ''">trace.h</WppScanConfigurationData>
+      <WppKernelMode>true</WppKernelMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <FilesToPackage Include="$(TargetPath)" />
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+  </ImportGroup>
+</Project>
\ No newline at end of file
diff --git a/windows/virt2phys/virt2phys.vcxproj.filters b/windows/virt2phys/virt2phys.vcxproj.filters
new file mode 100644
index 0000000..0fe65fc
--- /dev/null
+++ b/windows/virt2phys/virt2phys.vcxproj.filters
@@ -0,0 +1,36 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="Header Files">
+      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>
+      <Extensions>h;hpp;hxx;hm;inl;inc;xsd</Extensions>
+    </Filter>
+    <Filter Include="Resource Files">
+      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>
+      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms</Extensions>
+    </Filter>
+    <Filter Include="Driver Files">
+      <UniqueIdentifier>{8E41214B-6785-4CFE-B992-037D68949A14}</UniqueIdentifier>
+      <Extensions>inf;inv;inx;mof;mc;</Extensions>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <Inf Include="virt2phys.inf">
+      <Filter>Driver Files</Filter>
+    </Inf>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="virt2phys.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="virt2phys.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+</Project>
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 02/10] eal/windows: do not expose private EAL facilities
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-21 22:40       ` Thomas Monjalon
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 03/10] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                       ` (8 subsequent siblings)
  10 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon, Thomas Monjalon, Anand Rawat

The goal of rte_os.h is to mitigate OS differences for EAL users.
In Windows EAL, rte_os.h did excessive things:

1. It included platform SDK headers (windows.h, etc). Those files are
   huge, require specific inclusion order, and are generally unused by
   the code including rte_os.h. Declarations from platform SDK may
   break otherwise platform-independent code, e.g. min, max, ERROR.

2. It included pthread.h, which is clearly not always required.

3. It defined functions private to Windows EAL.

Reorganize Windows EAL includes in the following way:

1. Create rte_windows.h to properly import Windows-specific facilities.
   Primary users are bus drivers, tests, and external applications.

2. Remove platform SDK includes from rte_os.h to prevent breaking
   otherwise portable code by including rte_os.h on Windows.
   Copy necessary definitions to avoid including those headers.

3. Remove pthread.h include from rte_os.h.

4. Move declarations private to Windows EAL into eal_windows.h.

Fixes: 428eb983f5f7 ("eal: add OS specific header file")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal.c                 |  2 +
 lib/librte_eal/windows/eal_lcore.c           |  2 +
 lib/librte_eal/windows/eal_thread.c          |  1 +
 lib/librte_eal/windows/eal_windows.h         | 29 +++++++++++++
 lib/librte_eal/windows/include/meson.build   |  1 +
 lib/librte_eal/windows/include/pthread.h     |  2 +
 lib/librte_eal/windows/include/rte_os.h      | 44 ++++++--------------
 lib/librte_eal/windows/include/rte_windows.h | 41 ++++++++++++++++++
 8 files changed, 91 insertions(+), 31 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_windows.h
 create mode 100644 lib/librte_eal/windows/include/rte_windows.h

diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index e4b50df3b..2cf7a04ef 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -18,6 +18,8 @@
 #include <eal_options.h>
 #include <eal_private.h>
 
+#include "eal_windows.h"
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index b3a6c63af..82ee45413 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -2,12 +2,14 @@
  * Copyright(c) 2019 Intel Corporation
  */
 
+#include <pthread.h>
 #include <stdint.h>
 
 #include <rte_common.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_windows.h"
 
 /* global data structure that contains the CPU map */
 static struct _wcpu_map {
diff --git a/lib/librte_eal/windows/eal_thread.c b/lib/librte_eal/windows/eal_thread.c
index 9e4bbaa08..e149199a6 100644
--- a/lib/librte_eal/windows/eal_thread.c
+++ b/lib/librte_eal/windows/eal_thread.c
@@ -14,6 +14,7 @@
 #include <eal_thread.h>
 
 #include "eal_private.h"
+#include "eal_windows.h"
 
 RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
 RTE_DEFINE_PER_LCORE(unsigned int, _socket_id) = (unsigned int)SOCKET_ID_ANY;
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
new file mode 100644
index 000000000..fadd676b2
--- /dev/null
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#ifndef _EAL_WINDOWS_H_
+#define _EAL_WINDOWS_H_
+
+/**
+ * @file Facilities private to Windows EAL
+ */
+
+#include <rte_windows.h>
+
+/**
+ * Create a map of processors and cores on the system.
+ */
+void eal_create_cpu_map(void);
+
+/**
+ * Create a thread.
+ *
+ * @param thread
+ *   The location to store the thread id if successful.
+ * @return
+ *   0 for success, -1 if the thread is not created.
+ */
+int eal_thread_create(pthread_t *thread);
+
+#endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 7d18dd52f..5fb1962ac 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,4 +5,5 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/pthread.h b/lib/librte_eal/windows/include/pthread.h
index b9dd18e56..cfd53f0b8 100644
--- a/lib/librte_eal/windows/include/pthread.h
+++ b/lib/librte_eal/windows/include/pthread.h
@@ -5,6 +5,8 @@
 #ifndef _PTHREAD_H_
 #define _PTHREAD_H_
 
+#include <stdint.h>
+
 /**
  * This file is required to support the common code in eal_common_proc.c,
  * eal_common_thread.c and common\include\rte_per_lcore.h as Microsoft libc
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index e1e0378e6..510e39e03 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -8,20 +8,18 @@
 /**
  * This is header should contain any function/macro definition
  * which are not supported natively or named differently in the
- * Windows OS. Functions will be added in future releases.
+ * Windows OS. It must not include Windows-specific headers.
  */
 
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-#include <windows.h>
-#include <basetsd.h>
-#include <pthread.h>
-#include <stdio.h>
-
-/* limits.h replacement */
-#include <stdlib.h>
+/* limits.h replacement, value as in <windows.h> */
 #ifndef PATH_MAX
 #define PATH_MAX _MAX_PATH
 #endif
@@ -31,8 +29,6 @@ extern "C" {
 /* strdup is deprecated in Microsoft libc and _strdup is preferred */
 #define strdup(str) _strdup(str)
 
-typedef SSIZE_T ssize_t;
-
 #define strtok_r(str, delim, saveptr) strtok_s(str, delim, saveptr)
 
 #define index(a, b)     strchr(a, b)
@@ -40,22 +36,14 @@ typedef SSIZE_T ssize_t;
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
-/**
- * Create a thread.
- * This function is private to EAL.
- *
- * @param thread
- *   The location to store the thread id if successful.
- * @return
- *   0 for success, -1 if the thread is not created.
- */
-int eal_thread_create(pthread_t *thread);
+/* cpu_set macros implementation */
+#define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
+#define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
+#define RTE_CPU_FILL(set) CPU_FILL(set)
+#define RTE_CPU_NOT(dst, src) CPU_NOT(dst, src)
 
-/**
- * Create a map of processors and cores on the system.
- * This function is private to EAL.
- */
-void eal_create_cpu_map(void);
+/* as in <windows.h> */
+typedef long long ssize_t;
 
 #ifndef RTE_TOOLCHAIN_GCC
 static inline int
@@ -86,12 +74,6 @@ asprintf(char **buffer, const char *format, ...)
 }
 #endif /* RTE_TOOLCHAIN_GCC */
 
-/* cpu_set macros implementation */
-#define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
-#define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
-#define RTE_CPU_FILL(set) CPU_FILL(set)
-#define RTE_CPU_NOT(dst, src) CPU_NOT(dst, src)
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
new file mode 100644
index 000000000..ed6e4c148
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#ifndef _RTE_WINDOWS_H_
+#define _RTE_WINDOWS_H_
+
+/**
+ * @file Windows-specific facilities
+ *
+ * This file should be included by DPDK libraries and applications
+ * that need access to Windows API. It includes platform SDK headers
+ * in compatible order with proper options and defines error-handling macros.
+ */
+
+/* Disable excessive libraries. */
+#ifndef WIN32_LEAN_AND_MEAN
+#define WIN32_LEAN_AND_MEAN
+#endif
+
+/* Must come first. */
+#include <windows.h>
+
+#include <basetsd.h>
+#include <psapi.h>
+
+/* Have GUIDs defined. */
+#ifndef INITGUID
+#define INITGUID
+#endif
+#include <initguid.h>
+
+/**
+ * Log GetLastError() with context, usually a Win32 API function and arguments.
+ */
+#define RTE_LOG_WIN32_ERR(...) \
+	RTE_LOG(DEBUG, EAL, RTE_FMT("GetLastError()=%lu: " \
+		RTE_FMT_HEAD(__VA_ARGS__,) "\n", GetLastError(), \
+		RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#endif /* _RTE_WINDOWS_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 03/10] eal/windows: improve CPU and NUMA node detection
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 02/10] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 04/10] eal/windows: initialize hugepage info Dmitry Kozlyuk
                       ` (7 subsequent siblings)
  10 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon, Anand Rawat, Jeff Shaw

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal_lcore.c   | 185 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..9d931d50a 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() may not be available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 04/10] eal/windows: initialize hugepage info
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (2 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 03/10] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/build_dpdk.rst  |  20 -----
 doc/guides/windows_gsg/index.rst       |   1 +
 doc/guides/windows_gsg/run_apps.rst    |  47 +++++++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 7 files changed, 173 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/config/meson.build b/config/meson.build
index 58421342b..4607655d9 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -263,6 +263,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 2cf7a04ef..63461f51a 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -18,8 +18,11 @@
 #include <eal_options.h>
 #include <eal_private.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -242,6 +245,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..b099d13f9
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available in Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem in Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 09dd4ab2f..5f118bfe2 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_thread.c',
 	'getopt.c',
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (3 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 04/10] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-15 21:48       ` Thomas Monjalon
  2020-04-17 12:24       ` Burakov, Anatoly
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
                       ` (5 subsequent siblings)
  10 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon

EAL common code uses file locking and truncation. Introduce
OS-independent wrappers in order to support both Linux/FreeBSD
and Windows:

* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Wrappers follow POSIX semantics, but interface is not POSIX,
so that it can be made more clean, e.g. by not mixing locking
operation and behaviour on conflict.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. Files should be named
after the ones from which the code is factored in OS subdirectory.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_private.h | 45 ++++++++++++++++
 lib/librte_eal/meson.build          |  4 ++
 lib/librte_eal/unix/eal.c           | 47 ++++++++++++++++
 lib/librte_eal/unix/meson.build     |  6 +++
 lib/librte_eal/windows/eal.c        | 83 +++++++++++++++++++++++++++++
 5 files changed, 185 insertions(+)
 create mode 100644 lib/librte_eal/unix/eal.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index ddcfbe2e4..65d61ff13 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -443,4 +443,49 @@ rte_option_usage(void);
 uint64_t
 eal_get_baseaddr(void);
 
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 9d219a0e6..1f89efb88 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal.c b/lib/librte_eal/unix/eal.c
new file mode 100644
index 000000000..a337b59b1
--- /dev/null
+++ b/lib/librte_eal/unix/eal.c
@@ -0,0 +1,47 @@
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..13564838e
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal.c',
+)
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 63461f51a..9dba895e7 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -224,6 +224,89 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
+
  /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (4 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-15 22:17       ` Thomas Monjalon
                         ` (4 more replies)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
                       ` (4 subsequent siblings)
  10 siblings, 5 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Anatoly Burakov,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

System meory management is implemented differently for POSIX and
Windows. Introduce wrapper functions for operations used across DPDK:

* rte_mem_map()
  Create memory mapping for a regular file or a page file (swap).
  This supports mapping to a reserved memory region even on Windows.

* rte_mem_unmap()
  Remove mapping created with rte_mem_map().

* rte_get_page_size()
  Obtain default system page size.

* rte_mem_lock()
  Make arbitrary-sized memory region non-swappable.

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                   |  10 +-
 lib/librte_eal/common/eal_private.h  |  51 +++-
 lib/librte_eal/include/rte_memory.h  |  68 +++++
 lib/librte_eal/rte_eal_exports.def   |   4 +
 lib/librte_eal/rte_eal_version.map   |   4 +
 lib/librte_eal/unix/eal_memory.c     | 113 +++++++
 lib/librte_eal/unix/meson.build      |   1 +
 lib/librte_eal/windows/eal.c         |   6 +
 lib/librte_eal/windows/eal_memory.c  | 437 +++++++++++++++++++++++++++
 lib/librte_eal/windows/eal_windows.h |  67 ++++
 lib/librte_eal/windows/meson.build   |   1 +
 11 files changed, 758 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c

diff --git a/config/meson.build b/config/meson.build
index 4607655d9..bceb5ef7b 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -256,14 +256,20 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
 	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 65d61ff13..1e89338f2 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,16 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**< Reserve hugepages (support may be limited or missing). */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**< Fail if requested address is not available. */
+	EAL_RESERVE_EXACT_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -232,8 +243,8 @@ int rte_eal_check_module(const char *module_name);
 #define EAL_VIRTUAL_AREA_UNMAP (1 << 2)
 /**< immediately unmap reserved virtual area. */
 void *
-eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
+	int flags, int mmap_flags);
 
 /**
  * Get cpu core_id.
@@ -488,4 +499,40 @@ int eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
  */
 int eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation address. The system may not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *eal_mem_reserve(void *requested_addr, size_t size,
+	enum eal_mem_reserve_flags flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If @code virt @endcode and @code size @endcode describe a part of the
+ * reserved region, only this part of the region is freed (accurately
+ * up to the system page size). If @code virt @endcode points to allocated
+ * memory, @code size @endcode must match the one specified on allocation.
+ * The behavior is undefined if the memory pointed by @code virt @endcode
+ * is obtained from another source than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void eal_mem_free(void *virt, size_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..1b7c3e5df 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -85,6 +85,74 @@ struct rte_memseg_list {
 	struct rte_fbarray memseg_arr;
 };
 
+/**
+ * Memory protection flags.
+ */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,   /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/**
+ * Memory mapping additional flags.
+ *
+ * In Linux and FreeBSD, each flag is semantically equivalent
+ * to OS-specific mmap(3) flag with the same or similar name.
+ * In Windows, POSIX and MAP_ANONYMOUS semantics are followed.
+ */
+enum rte_map_flags {
+	/** Changes of mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/** Fail if requested address cannot be taken. */
+	RTE_MAP_FIXED = 1 << 3
+};
+
+/**
+ * OS-independent implementation of POSIX mmap(3)
+ * with MAP_ANONYMOUS Linux/FreeBSD extension.
+ */
+__rte_experimental
+void *rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_experimental
+int rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never failes.
+ *
+ * @return
+ *   Positive page size in bytes.
+ */
+__rte_experimental
+int rte_get_page_size(void);
+
+/**
+ * Lock region in physical memory and prevent it from swapping.
+ *
+ * @param virt
+ *   The virtual address.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @note Implementations may require @p virt and @p size to be multiples
+ *       of system page size.
+ * @see rte_get_page_size()
+ * @see rte_mem_lock_page()
+ */
+__rte_experimental
+int rte_mem_lock(const void *virt, size_t size);
+
 /**
  * Lock page in physical memory and prevent from swapping.
  *
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..bacf9a107 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -5,5 +5,9 @@ EXPORTS
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
 	rte_eal_remote_launch
+	rte_get_page_size
 	rte_log
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_unmap
 	rte_vlog
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index f9ede5b41..07128898f 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -337,5 +337,9 @@ EXPERIMENTAL {
 	rte_thread_is_intr;
 
 	# added in 20.05
+	rte_get_page_size;
 	rte_log_can_log;
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_unmap;
 };
diff --git a/lib/librte_eal/unix/eal_memory.c b/lib/librte_eal/unix/eal_memory.c
new file mode 100644
index 000000000..6bd087d94
--- /dev/null
+++ b/lib/librte_eal/unix/eal_memory.c
@@ -0,0 +1,113 @@
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+
+#include "eal_private.h"
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(ERR, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+		return NULL;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size,
+	enum eal_mem_reserve_flags flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+#ifdef MAP_HUGETLB
+	if (flags & EAL_RESERVE_HUGEPAGES)
+		sys_flags |= MAP_HUGETLB;
+#endif
+	if (flags & EAL_RESERVE_EXACT_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+static int
+mem_rte_to_sys_prot(enum rte_mem_prot prot)
+{
+	int sys_prot = 0;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset)
+{
+	int sys_prot = 0;
+	int sys_flags = 0;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FIXED)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+int
+rte_get_page_size(void)
+{
+	return getpagesize();
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	return mlock(virt, size);
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index 13564838e..50c019a56 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal.c',
+	'eal_memory.c',
 )
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 9dba895e7..cf55b56da 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -339,6 +339,12 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..5697187ce
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,437 @@
+#include <io.h>
+
+#include <rte_errno.h>
+#include <rte_memory.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterMax,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	OSVERSIONINFO info;
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	/* IsWindows10OrGreater() may also be unavailable. */
+	memset(&info, 0, sizeof(info));
+	info.dwOSVersionInfoSize = sizeof(info);
+	GetVersionEx(&info);
+
+	/* Checking for Windows 10+ will also detect Windows Server 2016+.
+	 * Do not abort, because Windows may report false version depending
+	 * on executable manifest, compatibility mode, etc.
+	 */
+	if (info.dwMajorVersion < 10)
+		RTE_LOG(DEBUG, EAL, "Windows 10+ or Windows Server 2016+ "
+			"is required for advanced memory features\n");
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* no VirtualAlloc2() */
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static void
+set_errno_from_win32_alloc_error(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		rte_errno = 0;
+		break;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		rte_errno = ENOMEM;
+		break;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		rte_errno = EINVAL;
+		break;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size,
+	enum eal_mem_reserve_flags flags)
+{
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		RTE_LOG(ERR, EAL, "Hugepage reservation is not supported\n");
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	virt = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		set_errno_from_win32_alloc_error(err);
+	}
+
+	if ((flags & EAL_RESERVE_EXACT_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFree(virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFree()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc(size_t size, enum rte_page_sizes page_size)
+{
+	if (page_size != 0)
+		return eal_mem_alloc_socket(size, SOCKET_ID_ANY);
+
+	return VirtualAlloc(
+		NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void*
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+		if (VirtualQuery(requested_addr, &info, sizeof(info)) == 0) {
+			RTE_LOG_WIN32_ERR("VirtualQuery()");
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) &&
+			!VirtualFree(requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+				"<split placeholder>)", requested_addr, size);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	if (requested_addr != NULL)
+		flags |= MEM_REPLACE_PLACEHOLDER;
+
+	addr = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, "
+			"<replace placeholder>)", addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, ...)", addr, size);
+		return -1;
+	}
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	if (VirtualQuery(addr, &info, sizeof(info)) == 0) {
+		RTE_LOG_WIN32_ERR("VirtualQuery()");
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFree(addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+			"MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)", addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFree(addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
+	enum rte_map_flags flags, int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* TODO: there is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FIXED) && (virt != requested_addr)) {
+		BOOL ret = UnmapViewOfFile(virt);
+		virt = NULL;
+		if (!ret)
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		rte_errno = GetLastError();
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		return -1;
+	}
+	return 0;
+}
+
+int
+rte_get_page_size(void)
+{
+	SYSTEM_INFO info;
+	GetSystemInfo(&info);
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock()");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index 390d2fd66..b202a1aa5 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -36,4 +36,71 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate a contiguous chunk of virtual memory.
+ *
+ * Use eal_mem_free() to free allocated memory.
+ *
+ * @param size
+ *  Number of bytes to allocate.
+ * @param page_size
+ *  If non-zero, means memory must be allocated in hugepages
+ *  of the specified size. The @code size @endcode parameter
+ *  must then be a multiple of the largest hugepage size requested.
+ * @return
+ *  Address of allocated memory or NULL on failure (rte_errno is set).
+ */
+void *eal_mem_alloc(size_t size, enum rte_page_sizes page_size);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with @ref eal_mem_reserve()
+ * or decommitted from hugepages by @ref eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and @code rte_errno @endcode is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit.
+ *
+ * The @code addr @endcode and @code param @endcode must match
+ * location and size of previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 5f118bfe2..81d3ee095 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -8,6 +8,7 @@ sources += files(
 	'eal_debug.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
+	'eal_memory.c',
 	'eal_thread.c',
 	'getopt.c',
 )
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (5 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-15 22:19       ` Thomas Monjalon
  2020-04-17 13:04       ` Burakov, Anatoly
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
                       ` (3 subsequent siblings)
  10 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Anatoly Burakov, Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c | 54 ++++++++++++++++++
 lib/librte_eal/common/eal_private.h       | 34 ++++++++++++
 lib/librte_eal/freebsd/eal_memory.c       | 54 +++---------------
 lib/librte_eal/linux/eal_memory.c         | 68 +++++------------------
 4 files changed, 110 insertions(+), 100 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index cc7d54e0c..d9764681a 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -25,6 +25,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -182,6 +183,59 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_reserve_memseg_list(struct rte_memseg_list *msl,
+		enum eal_mem_reserve_flags flags)
+{
+	uint64_t page_sz;
+	size_t mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
+	if (addr == NULL) {
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+		else
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	return 0;
+}
+
+int
+eal_alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
+			(size_t)page_sz >> 10, socket_id);
+
+	return 0;
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 1e89338f2..76938e379 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -246,6 +246,40 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
 	int flags, int mmap_flags);
 
+/**
+ * Reserve VA space for a memory segment list.
+ *
+ * @param msl
+ *  Memory segment list with page size defined.
+ * @param flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_reserve_memseg_list(struct rte_memseg_list *msl,
+	enum eal_mem_reserve_flags flags);
+
+/**
+ * Initialize a memory segment list with its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ */
+int
+eal_alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index a97d8f0f0..5174f9cd0 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -336,61 +336,23 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_alloc(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_alloc_memseg_list(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_reserve(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
+	enum eal_reserve_flags flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_reserve_memseg_list(msl, flags);
 }
 
 
@@ -479,7 +441,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_alloc(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..a01a7ce76 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_alloc(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_alloc_memseg_list(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_reserve(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_reserve_memseg_list(msl, 0);
 }
 
 /*
@@ -1009,12 +969,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_alloc(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_reserve(msl) < 0)
 				return -1;
 		}
 	}
@@ -2191,7 +2151,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_alloc(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2160,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_reserve(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2355,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_alloc(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_reserve(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2393,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_reserve(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (6 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-15  9:34       ` Jerin Jacob
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                       ` (2 subsequent siblings)
  10 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_size has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Define these values outside of the enum for Clang on Windows only.
This does not affect runtime, because Windows doesn't run on machines
with 4GiB and 16GiB hugepages.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/include/rte_memory.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 1b7c3e5df..3ec673f51 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -34,8 +34,14 @@ enum rte_page_sizes {
 	RTE_PGSIZE_256M  = 1ULL << 28,
 	RTE_PGSIZE_512M  = 1ULL << 29,
 	RTE_PGSIZE_1G    = 1ULL << 30,
+/* Work around Clang on Windows being limited to 32-bit underlying type. */
+#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)
 	RTE_PGSIZE_4G    = 1ULL << 32,
 	RTE_PGSIZE_16G   = 1ULL << 34,
+#else
+#define RTE_PGSIZE_4G  (1ULL << 32)
+#define RTE_PGSIZE_16G (1ULL << 34)
+#endif
 };
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (7 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-04-14 23:37     ` [dpdk-dev] [PATCH v3 00/10] Windows " Kadam, Pallavi
  10 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (8 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-04-14 19:44     ` Dmitry Kozlyuk
  2020-04-15  9:42       ` Jerin Jacob
  2020-04-16 18:34       ` Ranjit Menon
  2020-04-14 23:37     ` [dpdk-dev] [PATCH v3 00/10] Windows " Kadam, Pallavi
  10 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-14 19:44 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic, Anatoly Burakov, Bruce Richardson

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                            |   2 +-
 doc/guides/windows_gsg/run_apps.rst           |  43 +-
 lib/librte_eal/common/eal_common_fbarray.c    |  57 +-
 lib/librte_eal/common/eal_common_memory.c     |  50 +-
 lib/librte_eal/common/eal_private.h           |   6 +-
 lib/librte_eal/common/malloc_heap.c           |   1 +
 lib/librte_eal/common/meson.build             |   9 +
 lib/librte_eal/freebsd/eal_memory.c           |   1 -
 lib/librte_eal/rte_eal_exports.def            | 119 ++-
 lib/librte_eal/windows/eal.c                  |  55 ++
 lib/librte_eal/windows/eal_memalloc.c         | 418 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 706 +++++++++++++++++-
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  23 +
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |   4 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   4 +
 20 files changed, 1571 insertions(+), 70 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/config/meson.build b/config/meson.build
index bceb5ef7b..7b8baa788 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -270,7 +270,7 @@ if is_windows
 		add_project_link_arguments('-lmincore', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..323d050dc 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -7,10 +7,10 @@ Running DPDK Applications
 Grant *Lock pages in memory* Privilege
 --------------------------------------
 
-Use of hugepages ("large pages" in Windows terminolocy) requires
+Use of hugepages ("large pages" in Windows terminology) requires
 ``SeLockMemoryPrivilege`` for the user running an application.
 
-1. Open *Local Security Policy* snap in, either:
+1. Open *Local Security Policy* snap-in, either:
 
    * Control Panel / Computer Management / Local Security Policy;
    * or Win+R, type ``secpol``, press Enter.
@@ -24,7 +24,44 @@ Use of hugepages ("large pages" in Windows terminolocy) requires
 
 See `Large-Page Support`_ in MSDN for details.
 
-.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+.. _Large-Page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation. This driver is not signed, so signature
+checking must be disabled to load it.
+
+.. warning::
+
+    Disabling driver signature enforcement weakens OS security.
+    It is discouraged in production environments.
+
+Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``. It can be installed as follows
+from Elevated Command Prompt:
+
+.. code-block:: console
+
+    pnputil /add-driver Z:\path\to\virt2phys.inf /install
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+Installed driver persists across reboots.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
 
 
 Run the ``helloworld`` Example
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 1312f936b..236db9cb7 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,15 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -85,19 +85,16 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
 		/* pass errno up the chain */
 		rte_errno = errno;
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FIXED, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -735,7 +732,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -756,9 +753,12 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		void *new_data = rte_mem_map(
+			data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_FIXED | RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS,
+			fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
 					__func__, strerror(errno));
 			goto fail;
@@ -778,7 +778,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 					__func__, path, strerror(errno));
 			rte_errno = errno;
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
 					__func__, path, strerror(errno));
 			rte_errno = EBUSY;
@@ -789,10 +790,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -824,7 +823,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -862,7 +861,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -895,10 +894,8 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -916,7 +913,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -944,8 +941,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -964,7 +960,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -999,8 +995,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1025,7 +1020,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,14 +1037,14 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index d9764681a..5c3cf1f75 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,7 +11,6 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
@@ -44,7 +43,7 @@ static uint64_t system_page_sz;
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, enum eal_mem_reserve_flags reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -52,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_get_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -98,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
-			*size -= page_sz;
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
+			size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, *size);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -125,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -154,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -172,12 +166,12 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	return aligned_addr;
@@ -586,10 +580,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	int page_size = rte_get_page_size();
+	uintptr_t aligned = (virtual & ~(page_size - 1));
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 76938e379..59ac41916 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -226,8 +226,8 @@ enum eal_mem_reserve_flags {
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to rte_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -244,7 +244,7 @@ enum eal_mem_reserve_flags {
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
-	int flags, int mmap_flags);
+	int flags, enum eal_mem_reserve_flags reserve_flags);
 
 /**
  * Reserve VA space for a memory segment list.
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index 842eb9de7..6534c895c 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -729,6 +729,7 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg,
 		if (ret != NULL)
 			return ret;
 	}
+
 	return NULL;
 }
 
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 02d9280cc..6dcdcc890 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -9,11 +9,20 @@ if is_windows
 		'eal_common_class.c',
 		'eal_common_devargs.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 		'rte_option.c',
 	)
 	subdir_done()
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5174f9cd0..99bf6ec9e 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -355,7 +355,6 @@ memseg_list_reserve(struct rte_memseg_list *msl)
 	return eal_reserve_memseg_list(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index bacf9a107..854b83bcd 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,13 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
-	rte_get_page_size
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
 	rte_log
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
+	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_get_page_size
 	rte_mem_lock
 	rte_mem_map
 	rte_mem_unmap
-	rte_vlog
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index cf55b56da..38f17f09c 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -93,6 +93,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -328,6 +346,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.no_shconf == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.no_shconf = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -345,6 +370,36 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..e72e785b8
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,418 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+#include <rte_windows.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory from OS. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu) on socket %d\n",
+				requested_addr, alloc_sz, socket_id);
+			return -1;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx()");
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of a hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	eal_mem_decommit(addr, alloc_sz);
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len))
+		return -1;
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info);
+				i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
index 5697187ce..82f3dafe6 100644
--- a/lib/librte_eal/windows/eal_memory.c
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -1,11 +1,23 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2010-2014 Intel Corporation (functions from Linux EAL)
+ * Copyright (c) 2020 Dmitry Kozlyuk (Windows specifics)
+ */
+
+#include <inttypes.h>
 #include <io.h>
 
 #include <rte_errno.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
 #include "eal_private.h"
 #include "eal_windows.h"
 
+#include <rte_virt2phys.h>
+
 /* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
  * Provide a copy of definitions and code to load it dynamically.
  * Note: definitions are copied verbatim from Microsoft documentation
@@ -120,6 +132,119 @@ eal_mem_win32api_init(void)
 
 #endif /* no VirtualAlloc2() */
 
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
 /* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
 static void
 set_errno_from_win32_alloc_error(DWORD code)
@@ -253,6 +378,10 @@ eal_mem_commit(void *requested_addr, size_t size, int socket_id)
 int
 eal_mem_decommit(void *addr, size_t size)
 {
+	/* Decommit memory, which might be a part of a larger reserved region.
+	 * Allocator commits hugepage-sized placeholders, so there's no need
+	 * to coalesce placeholders back into region, they can be reused as is.
+	 */
 	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
 		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, ...)", addr, size);
 		return -1;
@@ -364,7 +493,7 @@ rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
 		return NULL;
 	}
 
-	/* TODO: there is a race for the requested_addr between mem_free()
+	/* There is a race for the requested_addr between mem_free()
 	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
 	 * region with a mapping in a single operation, but it does not support
 	 * private mappings.
@@ -414,6 +543,16 @@ rte_mem_unmap(void *virt, size_t size)
 	return 0;
 }
 
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
 int
 rte_get_page_size(void)
 {
@@ -435,3 +574,568 @@ rte_mem_lock(const void *virt, size_t size)
 
 	return 0;
 }
+
+static int
+memseg_list_alloc(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx)
+{
+	return eal_alloc_memseg_list(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
+}
+
+static int
+memseg_list_reserve(struct rte_memseg_list *msl)
+{
+	return eal_reserve_memseg_list(msl, 0);
+}
+
+/*
+ * Remaining code in this file largely duplicates Linux EAL.
+ * Although Windows EAL supports only one hugepage size currently,
+ * code structure and comments are preserved so that changes may be
+ * easily ported until duplication is removed.
+ */
+
+static int
+memseg_primary_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (memseg_list_alloc(msl, pagesz, n_segs,
+					socket_id, cur_seglist))
+				goto out;
+
+			if (memseg_list_reserve(msl)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int
+memseg_secondary_init(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+		return memseg_primary_init();
+	return memseg_secondary_init();
+}
+
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+static int
+calc_num_pages_per_socket(uint64_t *memory,
+		struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used,
+		unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from cpu mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			strncpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int) (internal_config.memory / 0x100000);
+		available = requested - (unsigned int) (total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
+
+/* Limit is checked by validator itself, nothing left to analyze.*/
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+static int
+eal_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES; socket_id++)
+		memory[socket_id] = internal_config.socket_mem[socket_id];
+
+	/* calculate final number of pages */
+	if (calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket "
+					"limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs, cur_seg;
+	uint64_t page_sz;
+	void *addr;
+	struct rte_fbarray *arr;
+	struct rte_memseg *ms;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	/* create a memseg list */
+	msl = &mcfg->memsegs[0];
+
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = internal_config.memory / page_sz;
+
+	if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
+		sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		return -1;
+	}
+
+	addr = eal_mem_alloc(internal_config.memory, 0);
+	if (addr == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate %zu bytes",
+		internal_config.memory);
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->page_sz = page_sz;
+	msl->socket_id = 0;
+	msl->len = internal_config.memory;
+	msl->heap = 1;
+
+	/* populate memsegs. each memseg is one page long */
+	for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
+		arr = &msl->memseg_arr;
+
+		ms = rte_fbarray_get(arr, cur_seg);
+		ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, cur_seg);
+
+		addr = RTE_PTR_ADD(addr, (size_t)page_sz);
+	}
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index b202a1aa5..083ab8b93 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,6 +52,13 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
 /**
  * Locate Win32 memory management routines in system libraries.
  *
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..62805a307 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -36,6 +36,10 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define open _open
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 81d3ee095..0bd56cd8f 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -8,7 +8,11 @@ sources += files(
 	'eal_debug.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
+	'eal_memalloc.c',
 	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'getopt.c',
 )
+
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
@ 2020-04-14 23:35       ` Ranjit Menon
  2020-04-15 15:19         ` Thomas Monjalon
  2020-04-21  6:23       ` Ophir Munk
  1 sibling, 1 reply; 218+ messages in thread
From: Ranjit Menon @ 2020-04-14 23:35 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman

On 4/14/2020 12:44 PM, Dmitry Kozlyuk wrote:
> This driver supports Windows EAL memory management by translating
> current process virtual addresses to physical addresses (IOVA).
> Standalone virt2phys allows using DPDK without PMD and provides a
> reference implementation.
> 
> Suggested-by: Ranjit Menon <ranjit.menon@intel.com>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>   windows/README.rst                          | 103 +++++++++
>   windows/virt2phys/virt2phys.c               | 129 +++++++++++
>   windows/virt2phys/virt2phys.h               |  34 +++
>   windows/virt2phys/virt2phys.inf             |  64 ++++++
>   windows/virt2phys/virt2phys.sln             |  27 +++
>   windows/virt2phys/virt2phys.vcxproj         | 228 ++++++++++++++++++++
>   windows/virt2phys/virt2phys.vcxproj.filters |  36 ++++
>   7 files changed, 621 insertions(+)
>   create mode 100644 windows/README.rst
>   create mode 100644 windows/virt2phys/virt2phys.c
>   create mode 100644 windows/virt2phys/virt2phys.h
>   create mode 100644 windows/virt2phys/virt2phys.inf
>   create mode 100644 windows/virt2phys/virt2phys.sln
>   create mode 100644 windows/virt2phys/virt2phys.vcxproj
>   create mode 100644 windows/virt2phys/virt2phys.vcxproj.filters
> 

Reviewed-by: Ranjit Menon <ranjit.menon@intel.com>
Acked-by: Ranjit Menon <ranjit.menon@intel.com>


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/10] Windows basic memory management
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
                       ` (9 preceding siblings ...)
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-04-14 23:37     ` Kadam, Pallavi
  10 siblings, 0 replies; 218+ messages in thread
From: Kadam, Pallavi @ 2020-04-14 23:37 UTC (permalink / raw)
  To: dev, Dmitry Kozlyuk



On 4/14/2020 12:44 PM, Dmitry Kozlyuk wrote:
> Note: no changes in cover letter since v2.
> 
> This patchset implements basic MM with the following features:
> 
> * Hugepages are dynamically allocated in user-mode.
> * Only 2MB hugepages are supported.
> * IOVA is always PA, obtained through kernel-mode driver.
> * No 32-bit support (presumably not demanded).
> * Ni multi-process support (it is forcefully disabled).
> * No-huge mode for testing with IOVA unavailable.
> 
> 
> The first commit introduces a new kernel-mode driver, virt2phys.
> It translates user-mode virtual addresses into physical addresses.
> On Windows community call 2020-04-01 it was decided this driver can be
> used for now, later netUIO may pick up its code/interface or not.
> 
> 
> New EAL public functions for memory mapping are introduced
> to mitigate OS differences in DPDK libraries and applications:
> 
> * rte_mem_map
> * rte_mem_unmap
> * rte_mem_lock
> 
> To support common MM routines, internal wrappers for low-level
> memory reservation and file management are introduced. These changes
> affect Linux and FreeBSD EAL. Shared code is placed unded /unix/
> subdirectory (suggested by Thomas).
> 
> Also, entire <sys/queue.h> is imported from FreeBSD, replacing existing
> partial import. There is already a license exception for this file.
> 
> 
> Windows MM duplicates quite a lot of code from Linux EAL:
> 
> * eal_memalloc_alloc_seg_bulk
> * eal_memalloc_free_seg_bulk
> * calc_num_pages_per_socket
> * rte_eal_hugepage_init
> 
> Perhaps this should be left as-is until Windows MM evolves into having
> some specific requirements for these parts.
> 
> 
> Notes on checkpatch warnings:
> 
> * No space after comma / no space before closing parent in macros---
>    definitely a false-positive, unclear how to suppress this.
> 
> * Issues from imported BSD code---probably should be ignored?
> 
> * Checkpatch is not run against dpdk-kmods (Windows drivers).
> 
> ---
> 
> v3:
> 
>      * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
>      * Fix logic and error handling while allocating segments.
>      * Fix Unix rte_mem_map(): return NULL on failure.
>      * Fix some checkpatch.sh issues:
>          * Do not return positive errno, use DWORD for GetLastError().
>          * Make dpdk-kmods source files non-executable.
>      * Improve GSG for Windows Server (suggested by Ranjit Menon).
> 
> v2:
> 
>      * Rebase on ToT. Move all new code shared between Linux and FreeBSD
>        to /unix/ subdirectory, also factor out some existing code there.
>      * Improve description of Clang issue with rte_page_sizes on Windows.
>        Restore -fstrict-enum for EAL. Check running, not target compiler.
>      * Use EAL prefix for private facilities instead if RTE.
>      * Improve documentation comments for new functions.
>      * Remove co-installer for virt2phys. Add a typecast for clarity.
>      * Document virt2phys in user guide, improve its own README.
>      * Explicitly and forcefully disable multi-process.
> 
> Dmitry Kozlyuk (9):
>    eal/windows: do not expose private EAL facilities
>    eal/windows: improve CPU and NUMA node detection
>    eal/windows: initialize hugepage info
>    eal: introduce internal wrappers for file operations
>    eal: introduce memory management wrappers
>    eal: extract common code for memseg list initialization
>    eal/windows: fix rte_page_sizes with Clang on Windows
>    eal/windows: replace sys/queue.h with a complete one from FreeBSD
>    eal/windows: implement basic memory management
> 
>   config/meson.build                            |   12 +-
>   doc/guides/windows_gsg/build_dpdk.rst         |   20 -
>   doc/guides/windows_gsg/index.rst              |    1 +
>   doc/guides/windows_gsg/run_apps.rst           |   84 ++
>   lib/librte_eal/common/eal_common_fbarray.c    |   57 +-
>   lib/librte_eal/common/eal_common_memory.c     |  104 +-
>   lib/librte_eal/common/eal_private.h           |  134 +-
>   lib/librte_eal/common/malloc_heap.c           |    1 +
>   lib/librte_eal/common/meson.build             |    9 +
>   lib/librte_eal/freebsd/eal_memory.c           |   55 +-
>   lib/librte_eal/include/rte_memory.h           |   74 ++
>   lib/librte_eal/linux/eal_memory.c             |   68 +-
>   lib/librte_eal/meson.build                    |    4 +
>   lib/librte_eal/rte_eal_exports.def            |  119 ++
>   lib/librte_eal/rte_eal_version.map            |    4 +
>   lib/librte_eal/unix/eal.c                     |   47 +
>   lib/librte_eal/unix/eal_memory.c              |  113 ++
>   lib/librte_eal/unix/meson.build               |    7 +
>   lib/librte_eal/windows/eal.c                  |  160 +++
>   lib/librte_eal/windows/eal_hugepages.c        |  108 ++
>   lib/librte_eal/windows/eal_lcore.c            |  187 ++-
>   lib/librte_eal/windows/eal_memalloc.c         |  418 ++++++
>   lib/librte_eal/windows/eal_memory.c           | 1141 +++++++++++++++++
>   lib/librte_eal/windows/eal_mp.c               |  103 ++
>   lib/librte_eal/windows/eal_thread.c           |    1 +
>   lib/librte_eal/windows/eal_windows.h          |  129 ++
>   lib/librte_eal/windows/include/meson.build    |    2 +
>   lib/librte_eal/windows/include/pthread.h      |    2 +
>   lib/librte_eal/windows/include/rte_os.h       |   48 +-
>   .../windows/include/rte_virt2phys.h           |   34 +
>   lib/librte_eal/windows/include/rte_windows.h  |   43 +
>   lib/librte_eal/windows/include/sys/queue.h    |  663 +++++++++-
>   lib/librte_eal/windows/include/unistd.h       |    3 +
>   lib/librte_eal/windows/meson.build            |    6 +
>   34 files changed, 3611 insertions(+), 350 deletions(-)
>   create mode 100644 doc/guides/windows_gsg/run_apps.rst
>   create mode 100644 lib/librte_eal/unix/eal.c
>   create mode 100644 lib/librte_eal/unix/eal_memory.c
>   create mode 100644 lib/librte_eal/unix/meson.build
>   create mode 100644 lib/librte_eal/windows/eal_hugepages.c
>   create mode 100644 lib/librte_eal/windows/eal_memalloc.c
>   create mode 100644 lib/librte_eal/windows/eal_memory.c
>   create mode 100644 lib/librte_eal/windows/eal_mp.c
>   create mode 100644 lib/librte_eal/windows/eal_windows.h
>   create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h
>   create mode 100644 lib/librte_eal/windows/include/rte_windows.h
> 
Tested-by: Pallavi Kadam <pallavi.kadam@intel.com>

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
@ 2020-04-15  9:34       ` Jerin Jacob
  2020-04-15 10:32         ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Jerin Jacob @ 2020-04-15  9:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov

On Wed, Apr 15, 2020 at 1:16 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
> Enum rte_page_size has members valued above this limit, which get
> wrapped to zero, resulting in compilation error (duplicate values in
> enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.
>
> Define these values outside of the enum for Clang on Windows only.
> This does not affect runtime, because Windows doesn't run on machines
> with 4GiB and 16GiB hugepages.
>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  lib/librte_eal/include/rte_memory.h | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
> index 1b7c3e5df..3ec673f51 100644
> --- a/lib/librte_eal/include/rte_memory.h
> +++ b/lib/librte_eal/include/rte_memory.h
> @@ -34,8 +34,14 @@ enum rte_page_sizes {
>         RTE_PGSIZE_256M  = 1ULL << 28,
>         RTE_PGSIZE_512M  = 1ULL << 29,
>         RTE_PGSIZE_1G    = 1ULL << 30,
> +/* Work around Clang on Windows being limited to 32-bit underlying type. */

It does look like "enum rte_page_sizes" NOT used as enum anywhere.

[master][dpdk.org] $ grep -ri "enum rte_page_sizes" lib/
lib/librte_eal/include/rte_memory.h:enum rte_page_sizes {

Why not remove this workaround and define all items as #define to
avoid below ifdef clutter.

> +#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)

See above.

>         RTE_PGSIZE_4G    = 1ULL << 32,
>         RTE_PGSIZE_16G   = 1ULL << 34,
> +#else
> +#define RTE_PGSIZE_4G  (1ULL << 32)
> +#define RTE_PGSIZE_16G (1ULL << 34)
> +#endif
>  };
>
>  #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-04-15  9:42       ` Jerin Jacob
  2020-04-16 18:34       ` Ranjit Menon
  1 sibling, 0 replies; 218+ messages in thread
From: Jerin Jacob @ 2020-04-15  9:42 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic,
	Anatoly Burakov, Bruce Richardson

On Wed, Apr 15, 2020 at 1:16 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> Basic memory management supports core libraries and PMDs operating in
> IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
> IOVAs of hugepages allocated from user-mode.
>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  config/meson.build                            |   2 +-
>  doc/guides/windows_gsg/run_apps.rst           |  43 +-
>  lib/librte_eal/common/eal_common_fbarray.c    |  57 +-
>  lib/librte_eal/common/eal_common_memory.c     |  50 +-
>  lib/librte_eal/common/eal_private.h           |   6 +-
>  lib/librte_eal/common/malloc_heap.c           |   1 +
>  lib/librte_eal/common/meson.build             |   9 +

IMO, We should split the lib/librte_eal/common/e  and
ib/librte_eal/windows/ as separate
patch for better review and align the patch title accordingly.



>  lib/librte_eal/freebsd/eal_memory.c           |   1 -
>  lib/librte_eal/rte_eal_exports.def            | 119 ++-
>  lib/librte_eal/windows/eal.c                  |  55 ++
>  lib/librte_eal/windows/eal_memalloc.c         | 418 +++++++++++
>  lib/librte_eal/windows/eal_memory.c           | 706 +++++++++++++++++-
>  lib/librte_eal/windows/eal_mp.c               | 103 +++
>  lib/librte_eal/windows/eal_windows.h          |  23 +
>  lib/librte_eal/windows/include/meson.build    |   1 +
>  lib/librte_eal/windows/include/rte_os.h       |   4 +
>  .../windows/include/rte_virt2phys.h           |  34 +
>  lib/librte_eal/windows/include/rte_windows.h  |   2 +
>  lib/librte_eal/windows/include/unistd.h       |   3 +
>  lib/librte_eal/windows/meson.build            |   4 +
>  20 files changed, 1571 insertions(+), 70 deletions(-)
>  create mode 100644 lib/librte_eal/windows/eal_memalloc.c
>  create mode 100644 lib/librte_eal/windows/eal_mp.c
>  create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-15  9:34       ` Jerin Jacob
@ 2020-04-15 10:32         ` Dmitry Kozlyuk
  2020-04-15 10:57           ` Jerin Jacob
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-15 10:32 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov

> On Wed, Apr 15, 2020 at 1:16 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
> >
> > Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
> > Enum rte_page_size has members valued above this limit, which get
> > wrapped to zero, resulting in compilation error (duplicate values in
> > enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.
> >
> > Define these values outside of the enum for Clang on Windows only.
> > This does not affect runtime, because Windows doesn't run on machines
> > with 4GiB and 16GiB hugepages.
> >
> > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > ---
> >  lib/librte_eal/include/rte_memory.h | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
> > index 1b7c3e5df..3ec673f51 100644
> > --- a/lib/librte_eal/include/rte_memory.h
> > +++ b/lib/librte_eal/include/rte_memory.h
> > @@ -34,8 +34,14 @@ enum rte_page_sizes {
> >         RTE_PGSIZE_256M  = 1ULL << 28,
> >         RTE_PGSIZE_512M  = 1ULL << 29,
> >         RTE_PGSIZE_1G    = 1ULL << 30,
> > +/* Work around Clang on Windows being limited to 32-bit underlying type. */  
> 
> It does look like "enum rte_page_sizes" NOT used as enum anywhere.
> 
> [master][dpdk.org] $ grep -ri "enum rte_page_sizes" lib/
> lib/librte_eal/include/rte_memory.h:enum rte_page_sizes {
> 
> Why not remove this workaround and define all items as #define to
> avoid below ifdef clutter.
> 
> > +#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)  
> 
> See above.
> 
> >         RTE_PGSIZE_4G    = 1ULL << 32,
> >         RTE_PGSIZE_16G   = 1ULL << 34,
> > +#else
> > +#define RTE_PGSIZE_4G  (1ULL << 32)
> > +#define RTE_PGSIZE_16G (1ULL << 34)
> > +#endif
> >  };
> >
> >  #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
> > --
> > 2.25.1
> >  

This is a public header and removing enum rte_page_sizes will break API.
Moving members out of enum while keeping enum itself might break compilation
because of integer constants being converted to enum (with -Werror).

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-15 10:32         ` Dmitry Kozlyuk
@ 2020-04-15 10:57           ` Jerin Jacob
  2020-04-15 11:09             ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Jerin Jacob @ 2020-04-15 10:57 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov, Ray Kinsella

On Wed, Apr 15, 2020 at 4:02 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> > On Wed, Apr 15, 2020 at 1:16 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
> > >
> > > Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
> > > Enum rte_page_size has members valued above this limit, which get
> > > wrapped to zero, resulting in compilation error (duplicate values in
> > > enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.
> > >
> > > Define these values outside of the enum for Clang on Windows only.
> > > This does not affect runtime, because Windows doesn't run on machines
> > > with 4GiB and 16GiB hugepages.
> > >
> > > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > > ---
> > >  lib/librte_eal/include/rte_memory.h | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
> > > index 1b7c3e5df..3ec673f51 100644
> > > --- a/lib/librte_eal/include/rte_memory.h
> > > +++ b/lib/librte_eal/include/rte_memory.h
> > > @@ -34,8 +34,14 @@ enum rte_page_sizes {
> > >         RTE_PGSIZE_256M  = 1ULL << 28,
> > >         RTE_PGSIZE_512M  = 1ULL << 29,
> > >         RTE_PGSIZE_1G    = 1ULL << 30,
> > > +/* Work around Clang on Windows being limited to 32-bit underlying type. */
> >
> > It does look like "enum rte_page_sizes" NOT used as enum anywhere.
> >
> > [master][dpdk.org] $ grep -ri "enum rte_page_sizes" lib/
> > lib/librte_eal/include/rte_memory.h:enum rte_page_sizes {
> >
> > Why not remove this workaround and define all items as #define to
> > avoid below ifdef clutter.
> >
> > > +#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)
> >
> > See above.
> >
> > >         RTE_PGSIZE_4G    = 1ULL << 32,
> > >         RTE_PGSIZE_16G   = 1ULL << 34,
> > > +#else
> > > +#define RTE_PGSIZE_4G  (1ULL << 32)
> > > +#define RTE_PGSIZE_16G (1ULL << 34)
> > > +#endif
> > >  };
> > >
> > >  #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
> > > --
> > > 2.25.1
> > >
>
> This is a public header and removing enum rte_page_sizes will break API.
> Moving members out of enum while keeping enum itself might break compilation
> because of integer constants being converted to enum (with -Werror).

If none of the public API is using this enum then I think, we may not
need to make this enum as public.
Since it has ULL, I believe both cases(enum or define), it will be
treated as unsigned long long. ie. NO ABI breakage.


>
> --
> Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-15 10:57           ` Jerin Jacob
@ 2020-04-15 11:09             ` Dmitry Kozlyuk
  2020-04-15 11:17               ` Jerin Jacob
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-15 11:09 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov, Ray Kinsella



> On Wed, Apr 15, 2020 at 4:02 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
> >  
> > > On Wed, Apr 15, 2020 at 1:16 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:  
> > > >
> > > > Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
> > > > Enum rte_page_size has members valued above this limit, which get
> > > > wrapped to zero, resulting in compilation error (duplicate values in
> > > > enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.
> > > >
> > > > Define these values outside of the enum for Clang on Windows only.
> > > > This does not affect runtime, because Windows doesn't run on machines
> > > > with 4GiB and 16GiB hugepages.
> > > >
> > > > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > > > ---
> > > >  lib/librte_eal/include/rte_memory.h | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > >
> > > > diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
> > > > index 1b7c3e5df..3ec673f51 100644
> > > > --- a/lib/librte_eal/include/rte_memory.h
> > > > +++ b/lib/librte_eal/include/rte_memory.h
> > > > @@ -34,8 +34,14 @@ enum rte_page_sizes {
> > > >         RTE_PGSIZE_256M  = 1ULL << 28,
> > > >         RTE_PGSIZE_512M  = 1ULL << 29,
> > > >         RTE_PGSIZE_1G    = 1ULL << 30,
> > > > +/* Work around Clang on Windows being limited to 32-bit underlying type. */  
> > >
> > > It does look like "enum rte_page_sizes" NOT used as enum anywhere.
> > >
> > > [master][dpdk.org] $ grep -ri "enum rte_page_sizes" lib/
> > > lib/librte_eal/include/rte_memory.h:enum rte_page_sizes {
> > >
> > > Why not remove this workaround and define all items as #define to
> > > avoid below ifdef clutter.
> > >  
> > > > +#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)  
> > >
> > > See above.
> > >  
> > > >         RTE_PGSIZE_4G    = 1ULL << 32,
> > > >         RTE_PGSIZE_16G   = 1ULL << 34,
> > > > +#else
> > > > +#define RTE_PGSIZE_4G  (1ULL << 32)
> > > > +#define RTE_PGSIZE_16G (1ULL << 34)
> > > > +#endif
> > > >  };
> > > >
> > > >  #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
> > > > --
> > > > 2.25.1
> > > >  
> >
> > This is a public header and removing enum rte_page_sizes will break API.
> > Moving members out of enum while keeping enum itself might break compilation
> > because of integer constants being converted to enum (with -Werror).  
> 
> If none of the public API is using this enum then I think, we may not
> need to make this enum as public.

Agreed.

> Since it has ULL, I believe both cases(enum or define), it will be
> treated as unsigned long long. ie. NO ABI breakage.

I was talking about API only (compile-time compatibility). Getting rid of
#ifdef and workarounds sounds right, we'll just need a notice in release
notes.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-15 11:09             ` Dmitry Kozlyuk
@ 2020-04-15 11:17               ` Jerin Jacob
  2020-05-06  5:41                 ` Ray Kinsella
  0 siblings, 1 reply; 218+ messages in thread
From: Jerin Jacob @ 2020-04-15 11:17 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov, Ray Kinsella

On Wed, Apr 15, 2020 at 4:39 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
>
>
> > On Wed, Apr 15, 2020 at 4:02 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
> > >
> > > > On Wed, Apr 15, 2020 at 1:16 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
> > > > >
> > > > > Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
> > > > > Enum rte_page_size has members valued above this limit, which get
> > > > > wrapped to zero, resulting in compilation error (duplicate values in
> > > > > enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.
> > > > >
> > > > > Define these values outside of the enum for Clang on Windows only.
> > > > > This does not affect runtime, because Windows doesn't run on machines
> > > > > with 4GiB and 16GiB hugepages.
> > > > >
> > > > > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > > > > ---
> > > > >  lib/librte_eal/include/rte_memory.h | 6 ++++++
> > > > >  1 file changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
> > > > > index 1b7c3e5df..3ec673f51 100644
> > > > > --- a/lib/librte_eal/include/rte_memory.h
> > > > > +++ b/lib/librte_eal/include/rte_memory.h
> > > > > @@ -34,8 +34,14 @@ enum rte_page_sizes {
> > > > >         RTE_PGSIZE_256M  = 1ULL << 28,
> > > > >         RTE_PGSIZE_512M  = 1ULL << 29,
> > > > >         RTE_PGSIZE_1G    = 1ULL << 30,
> > > > > +/* Work around Clang on Windows being limited to 32-bit underlying type. */
> > > >
> > > > It does look like "enum rte_page_sizes" NOT used as enum anywhere.
> > > >
> > > > [master][dpdk.org] $ grep -ri "enum rte_page_sizes" lib/
> > > > lib/librte_eal/include/rte_memory.h:enum rte_page_sizes {
> > > >
> > > > Why not remove this workaround and define all items as #define to
> > > > avoid below ifdef clutter.
> > > >
> > > > > +#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)
> > > >
> > > > See above.
> > > >
> > > > >         RTE_PGSIZE_4G    = 1ULL << 32,
> > > > >         RTE_PGSIZE_16G   = 1ULL << 34,
> > > > > +#else
> > > > > +#define RTE_PGSIZE_4G  (1ULL << 32)
> > > > > +#define RTE_PGSIZE_16G (1ULL << 34)
> > > > > +#endif
> > > > >  };
> > > > >
> > > > >  #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
> > > > > --
> > > > > 2.25.1
> > > > >
> > >
> > > This is a public header and removing enum rte_page_sizes will break API.
> > > Moving members out of enum while keeping enum itself might break compilation
> > > because of integer constants being converted to enum (with -Werror).
> >
> > If none of the public API is using this enum then I think, we may not
> > need to make this enum as public.
>
> Agreed.
>
> > Since it has ULL, I believe both cases(enum or define), it will be
> > treated as unsigned long long. ie. NO ABI breakage.
>
> I was talking about API only (compile-time compatibility). Getting rid of
> #ifdef and workarounds sounds right, we'll just need a notice in release
> notes.

Good to check ./devtools/check-abi.sh for any ABI breakage.

>
> --
> Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-14 23:35       ` Ranjit Menon
@ 2020-04-15 15:19         ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-04-15 15:19 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Ranjit Menon

15/04/2020 01:35, Ranjit Menon:
> On 4/14/2020 12:44 PM, Dmitry Kozlyuk wrote:
> > This driver supports Windows EAL memory management by translating
> > current process virtual addresses to physical addresses (IOVA).
> > Standalone virt2phys allows using DPDK without PMD and provides a
> > reference implementation.
> > 
> > Suggested-by: Ranjit Menon <ranjit.menon@intel.com>
> > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> 
> Reviewed-by: Ranjit Menon <ranjit.menon@intel.com>
> Acked-by: Ranjit Menon <ranjit.menon@intel.com>

Applied in dpdk-kmods, thanks.

This is the very first kernel module in this repository:
	http://git.dpdk.org/dpdk-kmods/



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-04-15 21:48       ` Thomas Monjalon
  2020-04-17 12:24       ` Burakov, Anatoly
  1 sibling, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-04-15 21:48 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

Thanks for starting the new directory for Unix common implementation.

14/04/2020 21:44, Dmitry Kozlyuk:
> --- /dev/null
> +++ b/lib/librte_eal/unix/eal.c

Please take care of not creating a new file without adding
SPDX tag and copyright owner.

> @@ -0,0 +1,47 @@
> +#include <sys/file.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <rte_errno.h>
> +
> +#include "eal_private.h"
> +
> +int
> +eal_file_truncate(int fd, ssize_t size)
[...]
> +int
> +eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)

+1
Adding new abstractions is the way to go in my opinion.
Thanks



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-04-15 22:17       ` Thomas Monjalon
  2020-04-15 23:32         ` Dmitry Kozlyuk
  2020-04-17 12:43       ` Burakov, Anatoly
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 218+ messages in thread
From: Thomas Monjalon @ 2020-04-15 22:17 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, bruce.richardson, david.marchand

14/04/2020 21:44, Dmitry Kozlyuk:
> System meory management is implemented differently for POSIX and

meory -> memory

> Windows. Introduce wrapper functions for operations used across DPDK:
> 
> * rte_mem_map()
>   Create memory mapping for a regular file or a page file (swap).
>   This supports mapping to a reserved memory region even on Windows.
> 
> * rte_mem_unmap()
>   Remove mapping created with rte_mem_map().
> 
> * rte_get_page_size()
>   Obtain default system page size.
> 
> * rte_mem_lock()
>   Make arbitrary-sized memory region non-swappable.
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
[...]
> +/**
> + * Memory reservation flags.
> + */
> +enum eal_mem_reserve_flags {
> +	/**< Reserve hugepages (support may be limited or missing). */
> +	EAL_RESERVE_HUGEPAGES = 1 << 0,
> +	/**< Fail if requested address is not available. */
> +	EAL_RESERVE_EXACT_ADDRESS = 1 << 1
> +};

Maybe more context is needed to understand the meaning of these flags.
[...]
> -eal_get_virtual_area(void *requested_addr, size_t *size,
> -		size_t page_sz, int flags, int mmap_flags);
> +eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
> +	int flags, int mmap_flags);

Is there any change here?

[...]
> + * If @code virt @endcode and @code size @endcode describe a part of the

I am not sure about using @code.
It makes reading from source harder.
Is there a real benefit?

[...]
> +/**
> + * Memory protection flags.
> + */
> +enum rte_mem_prot {
> +	RTE_PROT_READ = 1 << 0,   /**< Read access. */
> +	RTE_PROT_WRITE = 1 << 1,   /**< Write access. */
> +	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
> +};

Alignment of comments would look better :-)

> +
> +/**
> + * Memory mapping additional flags.
> + *
> + * In Linux and FreeBSD, each flag is semantically equivalent
> + * to OS-specific mmap(3) flag with the same or similar name.
> + * In Windows, POSIX and MAP_ANONYMOUS semantics are followed.
> + */

I don't understand this comment.
The flags and meanings are the same no matter the OS, right?

> +enum rte_map_flags {
> +	/** Changes of mapped memory are visible to other processes. */
> +	RTE_MAP_SHARED = 1 << 0,
> +	/** Mapping is not backed by a regular file. */
> +	RTE_MAP_ANONYMOUS = 1 << 1,
> +	/** Copy-on-write mapping, changes are invisible to other processes. */
> +	RTE_MAP_PRIVATE = 1 << 2,
> +	/** Fail if requested address cannot be taken. */
> +	RTE_MAP_FIXED = 1 << 3
> +};
> +
> +/**
> + * OS-independent implementation of POSIX mmap(3)
> + * with MAP_ANONYMOUS Linux/FreeBSD extension.
> + */
> +__rte_experimental
> +void *rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
> +	enum rte_map_flags flags, int fd, size_t offset);
> +
> +/**
> + * OS-independent implementation of POSIX munmap(3).
> + */
> +__rte_experimental
> +int rte_mem_unmap(void *virt, size_t size);
> +
> +/**
> + * Get system page size. This function never failes.

failes -> fails

> + *
> + * @return
> + *   Positive page size in bytes.
> + */
> +__rte_experimental
> +int rte_get_page_size(void);
> +
> +/**
> + * Lock region in physical memory and prevent it from swapping.
> + *
> + * @param virt
> + *   The virtual address.
> + * @param size
> + *   Size of the region.
> + * @return
> + *   0 on success, negative on error.
> + *
> + * @note Implementations may require @p virt and @p size to be multiples
> + *       of system page size.
> + * @see rte_get_page_size()
> + * @see rte_mem_lock_page()
> + */
> +__rte_experimental
> +int rte_mem_lock(const void *virt, size_t size);

[...]
> --- /dev/null
> +++ b/lib/librte_eal/unix/eal_memory.c

License and copyright missing.

> @@ -0,0 +1,113 @@
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <rte_errno.h>
> +#include <rte_log.h>
> +#include <rte_memory.h>
> +
> +#include "eal_private.h"
> +
> +static void *
> +mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset)
> +{
> +	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
> +	if (virt == MAP_FAILED) {
> +		RTE_LOG(ERR, EAL,

Not sure it should be a log level so high.
We could imagine checking a memory map.
What about INFO level?
The real error log will be made by the caller.

> +			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
> +			requested_addr, size, prot, flags, fd, offset,
> +			strerror(errno));
> +		rte_errno = errno;
> +		return NULL;
[...]
> +void *
> +eal_mem_reserve(void *requested_addr, size_t size,
> +	enum eal_mem_reserve_flags flags)
> +{
> +	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
> +
> +#ifdef MAP_HUGETLB
> +	if (flags & EAL_RESERVE_HUGEPAGES)
> +		sys_flags |= MAP_HUGETLB;
> +#endif

If MAP_HUGETLB is false, and flags contain EAL_RESERVE_HUGEPAGES,
I think an error should be returned.

> +	if (flags & EAL_RESERVE_EXACT_ADDRESS)
> +		sys_flags |= MAP_FIXED;
> +
> +	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
> +}
[...]
> +int
> +rte_get_page_size(void)
> +{
> +	return getpagesize();
> +}
> +
> +int
> +rte_mem_lock(const void *virt, size_t size)
> +{
> +	return mlock(virt, size);
> +}

Why don't you replace existing code with these new functions?



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-04-15 22:19       ` Thomas Monjalon
  2020-04-17 13:04       ` Burakov, Anatoly
  1 sibling, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-04-15 22:19 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov, Bruce Richardson

14/04/2020 21:44, Dmitry Kozlyuk:
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  lib/librte_eal/common/eal_common_memory.c | 54 ++++++++++++++++++
>  lib/librte_eal/common/eal_private.h       | 34 ++++++++++++
>  lib/librte_eal/freebsd/eal_memory.c       | 54 +++---------------
>  lib/librte_eal/linux/eal_memory.c         | 68 +++++------------------
>  4 files changed, 110 insertions(+), 100 deletions(-)

Didn't review this change, but thanks for doing such cleanup.




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-15 22:17       ` Thomas Monjalon
@ 2020-04-15 23:32         ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-15 23:32 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, bruce.richardson, david.marchand

Answering the questions. All snipped comments will be fixed in v4.

> [...]
> > +/**
> > + * Memory reservation flags.
> > + */
> > +enum eal_mem_reserve_flags {
> > +	/**< Reserve hugepages (support may be limited or missing). */
> > +	EAL_RESERVE_HUGEPAGES = 1 << 0,
> > +	/**< Fail if requested address is not available. */
> > +	EAL_RESERVE_EXACT_ADDRESS = 1 << 1
> > +};  
> 
> Maybe more context is needed to understand the meaning of these flags.

Will extend the comment in v4. It's basically MAP_HUGE and MAP_FIXED.

> [...]
> > -eal_get_virtual_area(void *requested_addr, size_t *size,
> > -		size_t page_sz, int flags, int mmap_flags);
> > +eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
> > +	int flags, int mmap_flags);  
> 
> Is there any change here?

No, will fix this artifact.

> [...]
> > + * If @code virt @endcode and @code size @endcode describe a part of the  
> 
> I am not sure about using @code.
> It makes reading from source harder.
> Is there a real benefit?

It should be either @p or no markup (as in the rest of the comments), @code is
indeed inappropriate.

> > +
> > +/**
> > + * Memory mapping additional flags.
> > + *
> > + * In Linux and FreeBSD, each flag is semantically equivalent
> > + * to OS-specific mmap(3) flag with the same or similar name.
> > + * In Windows, POSIX and MAP_ANONYMOUS semantics are followed.
> > + */  
> 
> I don't understand this comment.
> The flags and meanings are the same no matter the OS, right?

Correct. MAP_ANONYMOUS is not POSIX so I mentioned it explicitly. I'll try to
come up with better wording.

> > +static void *
> > +mem_map(void *requested_addr, size_t size, int prot, int flags,
> > +	int fd, size_t offset)
> > +{
> > +	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
> > +	if (virt == MAP_FAILED) {
> > +		RTE_LOG(ERR, EAL,  
> 
> Not sure it should be a log level so high.
> We could imagine checking a memory map.
> What about INFO level?
> The real error log will be made by the caller.
> 
> > +			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
> > +			requested_addr, size, prot, flags, fd, offset,
> > +			strerror(errno));
> > +		rte_errno = errno;
> > +		return NULL;  

The same level is used now in places from which this code is extracted:
lib/librte_eal/common/{eal_common_fbarray.c:97,eal_common_memory:131}, see
also lib/librte_pci/rte_pci.c:144. To my understanding, DEBUG is used to log
implementation-specific details like these OS API calls, so I'll change
level to that.

> [...]
> > +int
> > +rte_get_page_size(void)
> > +{
> > +	return getpagesize();
> > +}
> > +
> > +int
> > +rte_mem_lock(const void *virt, size_t size)
> > +{
> > +	return mlock(virt, size);
> > +}  
> 
> Why don't you replace existing code with these new functions?

In this patchset I tried to touch existing code as little as possible, at
least I'd like to limit the scope to EAL. Libraries and drivers using Unix
functions directly will fail to compile when enabled on Windows, but patches
will be trivial. I propose replacing calls in EAL in v4.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-04-15  9:42       ` Jerin Jacob
@ 2020-04-16 18:34       ` Ranjit Menon
  2020-04-23  1:00         ` Dmitry Kozlyuk
  1 sibling, 1 reply; 218+ messages in thread
From: Ranjit Menon @ 2020-04-16 18:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona, Kadam,
	Pallavi, Mcnamara, John, Kovacevic, Marko, Burakov, Anatoly,
	Richardson, Bruce

On 4/14/2020 12:44 PM, Dmitry Kozlyuk wrote:
> Basic memory management supports core libraries and PMDs operating in
> IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
> IOVAs of hugepages allocated from user-mode.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<Snip!>

> diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
> new file mode 100644
> index 000000000..4bb2b4aaf
> --- /dev/null
> +++ b/lib/librte_eal/windows/include/rte_virt2phys.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +/**
> + * @file virt2phys driver interface
> + */
> +
> +/**
> + * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
> + */
> +DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
> +	0x539c2135, 0x793a, 0x4926,
> +	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
> +
> +/**
> + * Driver device type for IO control codes.
> + */
> +#define VIRT2PHYS_DEVTYPE 0x8000
> +
> +/**
> + * Translate a valid non-paged virtual address to a physical address.
> + *
> + * Note: A physical address zero (0) is reported if input address
> + * is paged out or not mapped. However, if input is a valid mapping
> + * of I/O port 0x0000, output is also zero. There is no way
> + * to distinguish between these cases by return value only.
> + *
> + * Input: a non-paged virtual address (PVOID).
> + *
> + * Output: the corresponding physical address (LARGE_INTEGER).
> + */
> +#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
> +	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)

This file is a duplicate of: <kmods>: windows/virt2phys/virt2phys.h
This is by design, since it documents the driver interface.

So, to prevent the two files going out-of-sync, should we:
1. Make the two filenames the same?
2. Add comments to both files referencing each other and the need to 
keep them both in sync?
3. Do both (1) and (2)?

This will also be an issue for the upcoming Windows netuio kernel driver 
and I reckon this could be an issue for Linux kernel modules too.

Thoughts?

ranjit m.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
  2020-04-15 21:48       ` Thomas Monjalon
@ 2020-04-17 12:24       ` Burakov, Anatoly
  2020-04-28 23:50         ` Dmitry Kozlyuk
  1 sibling, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-04-17 12:24 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

On 14-Apr-20 8:44 PM, Dmitry Kozlyuk wrote:
> EAL common code uses file locking and truncation. Introduce
> OS-independent wrappers in order to support both Linux/FreeBSD
> and Windows:
> 
> * eal_file_lock: lock or unlock an open file.
> * eal_file_truncate: enforce a given size for an open file.
> 
> Wrappers follow POSIX semantics, but interface is not POSIX,
> so that it can be made more clean, e.g. by not mixing locking
> operation and behaviour on conflict.
> 
> Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
> which is intended for common code between the two. Files should be named
> after the ones from which the code is factored in OS subdirectory.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>   lib/librte_eal/common/eal_private.h | 45 ++++++++++++++++
>   lib/librte_eal/meson.build          |  4 ++
>   lib/librte_eal/unix/eal.c           | 47 ++++++++++++++++
>   lib/librte_eal/unix/meson.build     |  6 +++
>   lib/librte_eal/windows/eal.c        | 83 +++++++++++++++++++++++++++++
>   5 files changed, 185 insertions(+)
>   create mode 100644 lib/librte_eal/unix/eal.c
>   create mode 100644 lib/librte_eal/unix/meson.build
> 
> diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
> index ddcfbe2e4..65d61ff13 100644
> --- a/lib/librte_eal/common/eal_private.h
> +++ b/lib/librte_eal/common/eal_private.h
> @@ -443,4 +443,49 @@ rte_option_usage(void);
>   uint64_t
>   eal_get_baseaddr(void);
>   
> +/** File locking operation. */
> +enum eal_flock_op {
> +	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
> +	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
> +	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
> +};
> +
> +/** Behavior on file locking conflict. */
> +enum eal_flock_mode {
> +	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
> +	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
> +};

Nitpicking, but why not blocking/unblocking? The terminology seems 
pretty standard.


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
  2020-04-15 22:17       ` Thomas Monjalon
@ 2020-04-17 12:43       ` Burakov, Anatoly
  2020-04-20  5:59       ` Tal Shnaiderman
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-04-17 12:43 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

On 14-Apr-20 8:44 PM, Dmitry Kozlyuk wrote:
> System meory management is implemented differently for POSIX and
> Windows. Introduce wrapper functions for operations used across DPDK:
> 
> * rte_mem_map()
>    Create memory mapping for a regular file or a page file (swap).
>    This supports mapping to a reserved memory region even on Windows.
> 
> * rte_mem_unmap()
>    Remove mapping created with rte_mem_map().
> 
> * rte_get_page_size()
>    Obtain default system page size.
> 
> * rte_mem_lock()
>    Make arbitrary-sized memory region non-swappable.
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

> +/**
> + * Memory reservation flags.
> + */
> +enum eal_mem_reserve_flags {
> +	/**< Reserve hugepages (support may be limited or missing). */
> +	EAL_RESERVE_HUGEPAGES = 1 << 0,
> +	/**< Fail if requested address is not available. */
> +	EAL_RESERVE_EXACT_ADDRESS = 1 << 1

I *really* don't like this terminology.

In Linux et al., MAP_FIXED is not just "reserve at this exact address". 
MAP_FIXED is actually fairly dangerous if you don't know what you're 
doing, because it will unconditionally unmap any previously mapped 
memory. Also, to my knowledge, a call to MAP_FIXED cannot fail unless 
something went very wrong - it will *not* "fail if requested address is 
not available". We basically use MAP_FIXED because we have already 
mapped that area with MAP_ANONYMOUS previously, so we can guarantee that 
it's safe to call MAP_FIXED.

I would greatly prefer if this was named to better reflect the above. 
EAL_FORCE_RESERVE perhaps? The comment also needs to be adjusted.

> +};
> +
>   /**
>    * Get virtual area of specified size from the OS.
>    *
> @@ -232,8 +243,8 @@ int rte_eal_check_module(const char *module_name);
>   #define EAL_VIRTUAL_AREA_UNMAP (1 << 2)
>   /**< immediately unmap reserved virtual area. */
>   void *
> -eal_get_virtual_area(void *requested_addr, size_t *size,
> -		size_t page_sz, int flags, int mmap_flags);
> +eal_get_virtual_area(void *requested_addr, size_t *size, size_t page_sz,
> +	int flags, int mmap_flags);
>   
>   /**

<snip>

>   
> +/**
> + * Reserve a region of virtual memory.
> + *
> + * Use eal_mem_free() to free reserved memory.
> + *
> + * @param requested_addr
> + *  A desired reservation address. The system may not respect it.
> + *  NULL means the address will be chosen by the system.
> + * @param size
> + *  Reservation size. Must be a multiple of system page size.
> + * @param flags
> + *  Reservation options.
> + * @returns
> + *  Starting address of the reserved area on success, NULL on failure.
> + *  Callers must not access this memory until remapping it.
> + */
> +void *eal_mem_reserve(void *requested_addr, size_t size,
> +	enum eal_mem_reserve_flags flags);

This seems fairly suspect to me. I know that technically enum is an int, 
but semantically, IIRC an enum value should always contain exactly one 
value - you can't use an enum value like a set of flags.

> +
> +/**
> + * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
> + *
> + * If @code virt @endcode and @code size @endcode describe a part of the
> + * reserved region, only this part of the region is freed (accurately
> + * up to the system page size). If @code virt @endcode points to allocated
> + * memory, @code size @endcode must match the one specified on allocation.
> + * The behavior is undefined if the memory pointed by @code virt @endcode
> + * is obtained from another source than listed above.
> + *
> + * @param virt

<snip>

> +/**
> + * Memory mapping additional flags.
> + *
> + * In Linux and FreeBSD, each flag is semantically equivalent
> + * to OS-specific mmap(3) flag with the same or similar name.
> + * In Windows, POSIX and MAP_ANONYMOUS semantics are followed.
> + */
> +enum rte_map_flags {
> +	/** Changes of mapped memory are visible to other processes. */
> +	RTE_MAP_SHARED = 1 << 0,
> +	/** Mapping is not backed by a regular file. */
> +	RTE_MAP_ANONYMOUS = 1 << 1,
> +	/** Copy-on-write mapping, changes are invisible to other processes. */
> +	RTE_MAP_PRIVATE = 1 << 2,
> +	/** Fail if requested address cannot be taken. */
> +	RTE_MAP_FIXED = 1 << 3

Again, MAP_FIXED does not behave the way you describe. See above comments.

> +};
> +
> +/**
> + * OS-independent implementation of POSIX mmap(3)
> + * with MAP_ANONYMOUS Linux/FreeBSD extension.
> + */
> +__rte_experimental
> +void *rte_mem_map(void *requested_addr, size_t size, enum rte_mem_prot prot,
> +	enum rte_map_flags flags, int fd, size_t offset);
> +
> +/**
> + * OS-independent implementation of POSIX munmap(3).
> + */
> +__rte_experimental
> +int rte_mem_unmap(void *virt, size_t size);
> +
> +/**
> + * Get system page size. This function never failes.
> + *
> + * @return
> + *   Positive page size in bytes.
> + */
> +__rte_experimental
> +int rte_get_page_size(void);

uint32_t? or maybe uint64_t?

> +
> +/**
> + * Lock region in physical memory and prevent it from swapping.
> + *
> + * @param virt
> + *   The virtual address.
> + * @param size
> + *   Size of the region.
> + * @return
> + *   0 on success, negative on error.
> + *
> + * @note Implementations may require @p virt and @p size to be multiples
> + *       of system page size.
> + * @see rte_get_page_size()
> + * @see rte_mem_lock_page()
> + */
> +__rte_experimental
> +int rte_mem_lock(const void *virt, size_t size);
> +
>   /**
-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
  2020-04-15 22:19       ` Thomas Monjalon
@ 2020-04-17 13:04       ` Burakov, Anatoly
  1 sibling, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-04-17 13:04 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Bruce Richardson

On 14-Apr-20 8:44 PM, Dmitry Kozlyuk wrote:
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>   lib/librte_eal/common/eal_common_memory.c | 54 ++++++++++++++++++
>   lib/librte_eal/common/eal_private.h       | 34 ++++++++++++
>   lib/librte_eal/freebsd/eal_memory.c       | 54 +++---------------
>   lib/librte_eal/linux/eal_memory.c         | 68 +++++------------------
>   4 files changed, 110 insertions(+), 100 deletions(-)
> 
> diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
> index cc7d54e0c..d9764681a 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> @@ -25,6 +25,7 @@
>   #include "eal_private.h"
>   #include "eal_internal_cfg.h"
>   #include "eal_memcfg.h"
> +#include "eal_options.h"
>   #include "malloc_heap.h"
>   
>   /*
> @@ -182,6 +183,59 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>   	return aligned_addr;
>   }
>   
> +int
> +eal_reserve_memseg_list(struct rte_memseg_list *msl,
> +		enum eal_mem_reserve_flags flags)

This and other similar places in this and other patches: i don't think 
using enums for this purpose (i.e. to hold multiple values at once) is 
good practice. I would suggest replacing with int.

Also, i don't think "eal_reserve_memseg_list" is a particularly good or 
descriptive name (and neither is "eal_alloc_memseg_list" for that 
matter). Suggestion: "eal_memseg_list_create" (or "_init") and 
"eal_memseg_list_alloc"?

> +{
> +	uint64_t page_sz;
> +	size_t mem_sz;
> +	void *addr;
> +
> +	page_sz = msl->page_sz;
> +	mem_sz = page_sz * msl->memseg_arr.len;
> +
> +	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
> +	if (addr == NULL) {
> +		if (rte_errno == EADDRNOTAVAIL)
> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> +				(unsigned long long)mem_sz, msl->base_va);
> +		else
> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> +		return -1;
> +	}
> +	msl->base_va = addr;
> +	msl->len = mem_sz;
> +
> +	return 0;
> +}
> +
> +int
> +eal_alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
> +		int n_segs, int socket_id, int type_msl_idx, bool heap)
> +{
> +	char name[RTE_FBARRAY_NAME_LEN];
> +
> +	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
> +		 type_msl_idx);
> +	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
> +			sizeof(struct rte_memseg))) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
> +			rte_strerror(rte_errno));
> +		return -1;
> +	}
> +
> +	msl->page_sz = page_sz;
> +	msl->socket_id = socket_id;
> +	msl->base_va = NULL;

It probably needs to be documented that eal_alloc_memseg_list must be 
called before eal_reserve_memseg_list.

> +	msl->heap = heap;
> +
> +	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
> +			(size_t)page_sz >> 10, socket_id);
> +
> +	return 0;
> +}
> +
>   static struct rte_memseg *
>   virt2memseg(const void *addr, const struct rte_memseg_list *msl)
>   {

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
  2020-04-15 22:17       ` Thomas Monjalon
  2020-04-17 12:43       ` Burakov, Anatoly
@ 2020-04-20  5:59       ` Tal Shnaiderman
  2020-04-21 23:36         ` Dmitry Kozlyuk
  2020-04-22  0:55       ` Ranjit Menon
  2020-04-22  2:07       ` Ranjit Menon
  4 siblings, 1 reply; 218+ messages in thread
From: Tal Shnaiderman @ 2020-04-20  5:59 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Thomas Monjalon,
	Anatoly Burakov, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

> System meory management is implemented differently for POSIX and
> Windows. Introduce wrapper functions for operations used across DPDK:
> 
> * rte_mem_map()
>   Create memory mapping for a regular file or a page file (swap).
>   This supports mapping to a reserved memory region even on Windows.
> 
> * rte_mem_unmap()
>   Remove mapping created with rte_mem_map().
> 
> * rte_get_page_size()
>   Obtain default system page size.
> 
> * rte_mem_lock()
>   Make arbitrary-sized memory region non-swappable.
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their signatures
> deliberately differ from POSIX ones to be more safe and expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  config/meson.build                   |  10 +-
>  lib/librte_eal/common/eal_private.h  |  51 +++-
> lib/librte_eal/include/rte_memory.h  |  68 +++++
>  lib/librte_eal/rte_eal_exports.def   |   4 +
>  lib/librte_eal/rte_eal_version.map   |   4 +
>  lib/librte_eal/unix/eal_memory.c     | 113 +++++++
>  lib/librte_eal/unix/meson.build      |   1 +
>  lib/librte_eal/windows/eal.c         |   6 +
>  lib/librte_eal/windows/eal_memory.c  | 437
> +++++++++++++++++++++++++++  lib/librte_eal/windows/eal_windows.h |

<Snip!>

> +eal_mem_win32api_init(void)
> +{
> +	static const char library_name[] = "kernelbase.dll";
> +	static const char function[] = "VirtualAlloc2";
> +
> +	OSVERSIONINFO info;
> +	HMODULE library = NULL;
> +	int ret = 0;
> +
> +	/* Already done. */
> +	if (VirtualAlloc2 != NULL)
> +		return 0;
> +
> +	/* IsWindows10OrGreater() may also be unavailable. */
> +	memset(&info, 0, sizeof(info));
> +	info.dwOSVersionInfoSize = sizeof(info);
> +	GetVersionEx(&info);

I'd remove the GetVersionEx check entirely and add the comments regarding OS dependency to the RTE_LOG 
Of the LoadLibraryA failure below, GetVersionEx returns the Windows 8 OS version on newer servers

Also, it looks like not all Win2016 servers versions support VirtualAlloc2, I'm using Microsoft Windows Server 2016 Datacenter Version 10.0.14393 and LoadLibraryA failed to load VirtualAlloc2.

> +	/* Checking for Windows 10+ will also detect Windows Server 2016+.
> +	 * Do not abort, because Windows may report false version
> depending
> +	 * on executable manifest, compatibility mode, etc.
> +	 */
> +	if (info.dwMajorVersion < 10)
> +		RTE_LOG(DEBUG, EAL, "Windows 10+ or Windows Server
> 2016+ "
> +			"is required for advanced memory features\n");
> +
> +	library = LoadLibraryA(library_name);
> +	if (library == NULL) {
> +		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")",
> library_name);
> +		return -1;
> +	}
> +


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
  2020-04-14 23:35       ` Ranjit Menon
@ 2020-04-21  6:23       ` Ophir Munk
  1 sibling, 0 replies; 218+ messages in thread
From: Ophir Munk @ 2020-04-21  6:23 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Ranjit Menon, Thomas Monjalon

Hi Dmitry,
When trying to test the series under Linux (RedHat 7.4, x86_64-native-linuxapp-gcc) I get compilation errors. I am using make.
It seems that there are calls from common which are not compiled under linux.

x86_64-native-linuxapp-gcc/lib/librte_eal.a(eal_common_memory.o): In function `eal_get_virtual_area':
eal_common_memory.c:(.text+0x27e): undefined reference to `eal_mem_reserve'

[dpdk-branches]$  find lib -name '*.c' | xargs grep eal_mem_reserve
lib/librte_eal/common/eal_common_memory.c:              mapped_addr = eal_mem_reserve(
lib/librte_eal/windows/eal_memory.c:eal_mem_reserve(void *requested_addr, size_t size,
lib/librte_eal/unix/eal_memory.c:eal_mem_reserve(void *requested_addr, size_t size,

Does it succeed for you?
If not - can you please resend the series such that it compiles under Linux with Makefile?

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Dmitry Kozlyuk
> Sent: Tuesday, April 14, 2020 10:44 PM
> To: dev@dpdk.org
> Cc: Dmitry Malloy (MESHCHANINOV) <dmitrym@microsoft.com>; Narcisa
> Ana Maria Vasile <Narcisa.Vasile@microsoft.com>; Fady Bader
> <fady@mellanox.com>; Tal Shnaiderman <talshn@mellanox.com>; Dmitry
> Kozlyuk <dmitry.kozliuk@gmail.com>; Ranjit Menon
> <ranjit.menon@intel.com>
> Subject: [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address
> translator for Windows
> 
> This driver supports Windows EAL memory management by translating
> current process virtual addresses to physical addresses (IOVA).
> Standalone virt2phys allows using DPDK without PMD and provides a
> reference implementation.
> 
> Suggested-by: Ranjit Menon <ranjit.menon@intel.com>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  windows/README.rst                          | 103 +++++++++
>  windows/virt2phys/virt2phys.c               | 129 +++++++++++
>  windows/virt2phys/virt2phys.h               |  34 +++
>  windows/virt2phys/virt2phys.inf             |  64 ++++++
>  windows/virt2phys/virt2phys.sln             |  27 +++
>  windows/virt2phys/virt2phys.vcxproj         | 228 ++++++++++++++++++++
>  windows/virt2phys/virt2phys.vcxproj.filters |  36 ++++
>  7 files changed, 621 insertions(+)
>  create mode 100644 windows/README.rst
>  create mode 100644 windows/virt2phys/virt2phys.c  create mode 100644
> windows/virt2phys/virt2phys.h  create mode 100644
> windows/virt2phys/virt2phys.inf  create mode 100644
> windows/virt2phys/virt2phys.sln  create mode 100644
> windows/virt2phys/virt2phys.vcxproj
>  create mode 100644 windows/virt2phys/virt2phys.vcxproj.filters
> 
> diff --git a/windows/README.rst b/windows/README.rst new file mode
> 100644 index 0000000..45a1d80
> --- /dev/null
> +++ b/windows/README.rst
> @@ -0,0 +1,103 @@
> +Developing Windows Drivers
> +==========================
> +
> +Prerequisites
> +-------------
> +
> +Building Windows Drivers is only possible on Windows.
> +
> +1. Visual Studio 2019 Community or Professional Edition 2. Windows
> +Driver Kit (WDK) for Windows 10, version 1903
> +
> +Follow the official instructions to obtain all of the above:
> +https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs
> .
> +microsoft.com%2Fen-us%2Fwindows-hardware%2Fdrivers%2Fdownload-
> the-wdk&a
> +mp;data=02%7C01%7Cophirmu%40mellanox.com%7C86da431f2b904dca3
> d8208d7e0ac
> +4cd0%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637224902
> 939680479&am
> +p;sdata=RSZXmo%2FewOpl9VCCg4BIgYUFvpreZzqX1X%2Bcl9vNRjU%3D&a
> mp;reserved
> +=0
> +
> +
> +Build the Drivers
> +-----------------
> +
> +Build from Visual Studio
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Open a solution (``*.sln``) with Visual Studio and build it (Ctrl+Shift+B).
> +
> +
> +Build from Command-Line
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Run *Developer Command Prompt for VS 2019* from the Start menu.
> +
> +Navigate to the solution directory (with ``*.sln``), then run:
> +
> +.. code-block:: console
> +
> +    msbuild
> +
> +To build a particular combination of configuration and platform:
> +
> +.. code-block:: console
> +
> +    msbuild -p:Configuration=Debug;Platform=x64
> +
> +
> +Install the Drivers
> +-------------------
> +
> +Disable Driver Signature Enforcement
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +By default Windows prohibits installing and loading drivers without
> +`digital signature`_ obtained from Microsoft. For development signature
> +enforcement may be disabled as follows.
> +
> +In Elevated Command Prompt (from this point, sufficient privileges are
> +assumed):
> +
> +.. code-block:: console
> +
> +    bcdedit -set loadoptions DISABLE_INTEGRITY_CHECKS
> +    bcdedit -set TESTSIGNING ON
> +    shutdown -r -t 0
> +
> +Upon reboot, an overlay message should appear on the desktop informing
> +that Windows is in test mode, which means it allows loading unsigned
> drivers.
> +
> +.. _digital signature:
> +https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs
> .
> +microsoft.com%2Fen-us%2Fwindows-
> hardware%2Fdrivers%2Finstall%2Fdriver-s
> +igning&amp;data=02%7C01%7Cophirmu%40mellanox.com%7C86da431f2b
> 904dca3d82
> +08d7e0ac4cd0%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63
> 72249029396
> +85478&amp;sdata=tLVdijiQ56xwSDW0zVsVE3TnIpuMPTLd9DBtS3Bnf4Q%3
> D&amp;rese
> +rved=0
> +
> +Install, List, and Remove Drivers
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Driver package is by default located in a subdirectory of its source
> +tree, e.g. ``x64\Debug\virt2phys\virt2phys`` (note two levels of
> ``virt2phys``).
> +
> +To install the driver and bind associated devices to it:
> +
> +.. code-block:: console
> +
> +    pnputil /add-driver x64\Debug\virt2phys\virt2phys\virt2phys.inf
> + /install
> +
> +A graphical confirmation to load an unsigned driver will still appear.
> +
> +On Windows Server additional steps are required if the driver uses a
> +custom setup class:
> +
> +1. From "Device Manager", "Action" menu, select "Add legacy hardware".
> +2. It will launch the "Add Hardware Wizard". Click "Next".
> +3. Select second option "Install the hardware that I manually select
> +   from a list (Advanced)".
> +4. On the next screen, locate the driver device class.
> +5. Select it, and click "Next".
> +6. The previously installed drivers will now be installed for
> +   the appropriate devices (software devices will be created).
> +
> +To list installed drivers:
> +
> +.. code-block:: console
> +
> +    pnputil /enum-drivers
> +
> +To remove the driver package and to uninstall its devices:
> +
> +.. code-block:: console
> +
> +    pnputil /delete-driver oem2.inf /uninstall
> diff --git a/windows/virt2phys/virt2phys.c b/windows/virt2phys/virt2phys.c
> new file mode 100644 index 0000000..e157e9c
> --- /dev/null
> +++ b/windows/virt2phys/virt2phys.c
> @@ -0,0 +1,129 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <ntddk.h>
> +#include <wdf.h>
> +#include <wdmsec.h>
> +#include <initguid.h>
> +
> +#include "virt2phys.h"
> +
> +DRIVER_INITIALIZE DriverEntry;
> +EVT_WDF_DRIVER_DEVICE_ADD virt2phys_driver_EvtDeviceAdd;
> +EVT_WDF_IO_IN_CALLER_CONTEXT
> virt2phys_device_EvtIoInCallerContext;
> +
> +NTSTATUS
> +DriverEntry(
> +	IN PDRIVER_OBJECT driver_object, IN PUNICODE_STRING
> registry_path) {
> +	WDF_DRIVER_CONFIG config;
> +	WDF_OBJECT_ATTRIBUTES attributes;
> +	NTSTATUS status;
> +
> +	PAGED_CODE();
> +
> +	WDF_DRIVER_CONFIG_INIT(&config,
> virt2phys_driver_EvtDeviceAdd);
> +	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
> +	status = WdfDriverCreate(
> +			driver_object, registry_path,
> +			&attributes, &config, WDF_NO_HANDLE);
> +	if (!NT_SUCCESS(status)) {
> +		KdPrint(("WdfDriverCreate() failed, status=%08x\n", status));
> +	}
> +
> +	return status;
> +}
> +
> +_Use_decl_annotations_
> +NTSTATUS
> +virt2phys_driver_EvtDeviceAdd(
> +	WDFDRIVER driver, PWDFDEVICE_INIT init) {
> +	WDF_OBJECT_ATTRIBUTES attributes;
> +	WDFDEVICE device;
> +	NTSTATUS status;
> +
> +	UNREFERENCED_PARAMETER(driver);
> +
> +	PAGED_CODE();
> +
> +	WdfDeviceInitSetIoType(
> +		init, WdfDeviceIoNeither);
> +	WdfDeviceInitSetIoInCallerContextCallback(
> +		init, virt2phys_device_EvtIoInCallerContext);
> +
> +	WDF_OBJECT_ATTRIBUTES_INIT(&attributes);
> +
> +	status = WdfDeviceCreate(&init, &attributes, &device);
> +	if (!NT_SUCCESS(status)) {
> +		KdPrint(("WdfDeviceCreate() failed, status=%08x\n", status));
> +		return status;
> +	}
> +
> +	status = WdfDeviceCreateDeviceInterface(
> +			device, &GUID_DEVINTERFACE_VIRT2PHYS, NULL);
> +	if (!NT_SUCCESS(status)) {
> +		KdPrint(("WdfDeviceCreateDeviceInterface() failed, "
> +			"status=%08x\n", status));
> +		return status;
> +	}
> +
> +	return STATUS_SUCCESS;
> +}
> +
> +_Use_decl_annotations_
> +VOID
> +virt2phys_device_EvtIoInCallerContext(
> +	IN WDFDEVICE device, IN WDFREQUEST request) {
> +	WDF_REQUEST_PARAMETERS params;
> +	ULONG code;
> +	PVOID *virt;
> +	PHYSICAL_ADDRESS *phys;
> +	size_t size;
> +	NTSTATUS status;
> +
> +	UNREFERENCED_PARAMETER(device);
> +
> +	PAGED_CODE();
> +
> +	WDF_REQUEST_PARAMETERS_INIT(&params);
> +	WdfRequestGetParameters(request, &params);
> +
> +	if (params.Type != WdfRequestTypeDeviceControl) {
> +		KdPrint(("bogus request type=%u\n", params.Type));
> +		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
> +		return;
> +	}
> +
> +	code = params.Parameters.DeviceIoControl.IoControlCode;
> +	if (code != IOCTL_VIRT2PHYS_TRANSLATE) {
> +		KdPrint(("bogus IO control code=%lu\n", code));
> +		WdfRequestComplete(request, STATUS_NOT_SUPPORTED);
> +		return;
> +	}
> +
> +	status = WdfRequestRetrieveInputBuffer(
> +			request, sizeof(*virt), (PVOID *)&virt, &size);
> +	if (!NT_SUCCESS(status)) {
> +		KdPrint(("WdfRequestRetrieveInputBuffer() failed, "
> +			"status=%08x\n", status));
> +		WdfRequestComplete(request, status);
> +		return;
> +	}
> +
> +	status = WdfRequestRetrieveOutputBuffer(
> +		request, sizeof(*phys), (PVOID *)&phys, &size);
> +	if (!NT_SUCCESS(status)) {
> +		KdPrint(("WdfRequestRetrieveOutputBuffer() failed, "
> +			"status=%08x\n", status));
> +		WdfRequestComplete(request, status);
> +		return;
> +	}
> +
> +	*phys = MmGetPhysicalAddress(*virt);
> +
> +	WdfRequestCompleteWithInformation(
> +		request, STATUS_SUCCESS, sizeof(*phys)); }
> diff --git a/windows/virt2phys/virt2phys.h b/windows/virt2phys/virt2phys.h
> new file mode 100644 index 0000000..4bb2b4a
> --- /dev/null
> +++ b/windows/virt2phys/virt2phys.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +/**
> + * @file virt2phys driver interface
> + */
> +
> +/**
> + * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
> + */
> +DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
> +	0x539c2135, 0x793a, 0x4926,
> +	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
> +
> +/**
> + * Driver device type for IO control codes.
> + */
> +#define VIRT2PHYS_DEVTYPE 0x8000
> +
> +/**
> + * Translate a valid non-paged virtual address to a physical address.
> + *
> + * Note: A physical address zero (0) is reported if input address
> + * is paged out or not mapped. However, if input is a valid mapping
> + * of I/O port 0x0000, output is also zero. There is no way
> + * to distinguish between these cases by return value only.
> + *
> + * Input: a non-paged virtual address (PVOID).
> + *
> + * Output: the corresponding physical address (LARGE_INTEGER).
> + */
> +#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
> +	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED,
> FILE_ANY_ACCESS)
> diff --git a/windows/virt2phys/virt2phys.inf
> b/windows/virt2phys/virt2phys.inf new file mode 100644 index
> 0000000..e35765e
> --- /dev/null
> +++ b/windows/virt2phys/virt2phys.inf
> @@ -0,0 +1,64 @@
> +; SPDX-License-Identifier: BSD-3-Clause ; Copyright (c) 2020 Dmitry
> +Kozlyuk
> +
> +[Version]
> +Signature = "$WINDOWS NT$"
> +Class = %ClassName%
> +ClassGuid = {78A1C341-4539-11d3-B88D-00C04FAD5171}
> +Provider = %ManufacturerName%
> +CatalogFile = virt2phys.cat
> +DriverVer =
> +
> +[DestinationDirs]
> +DefaultDestDir = 12
> +
> +; ================= Class section =====================
> +
> +[ClassInstall32]
> +Addreg = virt2phys_ClassReg
> +
> +[virt2phys_ClassReg]
> +HKR,,,0,%ClassName%
> +HKR,,Icon,,-5
> +
> +[SourceDisksNames]
> +1 = %DiskName%,,,""
> +
> +[SourceDisksFiles]
> +virt2phys.sys  = 1,,
> +
> +;*****************************************
> +; Install Section
> +;*****************************************
> +
> +[Manufacturer]
> +%ManufacturerName%=Standard,NT$ARCH$
> +
> +[Standard.NT$ARCH$]
> +%virt2phys.DeviceDesc%=virt2phys_Device, Root\virt2phys
> +
> +[virt2phys_Device.NT]
> +CopyFiles = Drivers_Dir
> +
> +[Drivers_Dir]
> +virt2phys.sys
> +
> +;-------------- Service installation
> +[virt2phys_Device.NT.Services]
> +AddService = virt2phys,%SPSVCINST_ASSOCSERVICE%,
> virt2phys_Service_Inst
> +
> +; -------------- virt2phys driver install sections
> +[virt2phys_Service_Inst]
> +DisplayName    = %virt2phys.SVCDESC%
> +ServiceType    = 1 ; SERVICE_KERNEL_DRIVER
> +StartType      = 3 ; SERVICE_DEMAND_START
> +ErrorControl   = 1 ; SERVICE_ERROR_NORMAL
> +ServiceBinary  = %12%\virt2phys.sys
> +
> +[Strings]
> +SPSVCINST_ASSOCSERVICE = 0x00000002
> +ManufacturerName = "Dmitry Kozlyuk"
> +ClassName = "Kernel bypass"
> +DiskName = "virt2phys Installation Disk"
> +virt2phys.DeviceDesc = "Virtual to physical address translator"
> +virt2phys.SVCDESC = "virt2phys Service"
> diff --git a/windows/virt2phys/virt2phys.sln
> b/windows/virt2phys/virt2phys.sln new file mode 100644 index
> 0000000..0f5ecdc
> --- /dev/null
> +++ b/windows/virt2phys/virt2phys.sln
> @@ -0,0 +1,27 @@
> ++Microsoft Visual Studio Solution File, Format Version 12.00+# Visual Studio
> Version 16+VisualStudioVersion =
> 16.0.29613.14+MinimumVisualStudioVersion =
> 10.0.40219.1+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") =
> "virt2phys", "virt2phys.vcxproj", "{0EEF826B-9391-43A8-A722-
> BDD6F6115137}"+EndProject+Global+
> 	GlobalSection(SolutionConfigurationPlatforms) = preSolution+
> 	Debug|x64 = Debug|x64+		Release|x64 = Release|x64+
> 	EndGlobalSection+
> 	GlobalSection(ProjectConfigurationPlatforms) = postSolution+
> 	{0EEF826B-9391-43A8-A722-BDD6F6115137}.Debug|x64.ActiveCfg =
> Debug|x64+		{0EEF826B-9391-43A8-A722-
> BDD6F6115137}.Debug|x64.Build.0 = Debug|x64+		{0EEF826B-
> 9391-43A8-A722-BDD6F6115137}.Debug|x64.Deploy.0 = Debug|x64+
> 	{0EEF826B-9391-43A8-A722-BDD6F6115137}.Release|x64.ActiveCfg
> = Release|x64+		{0EEF826B-9391-43A8-A722-
> BDD6F6115137}.Release|x64.Build.0 = Release|x64+		{0EEF826B-
> 9391-43A8-A722-BDD6F6115137}.Release|x64.Deploy.0 = Release|x64+
> 	EndGlobalSection+	GlobalSection(SolutionProperties) =
> preSolution+		HideSolutionNode = FALSE+	EndGlobalSection+
> 	GlobalSection(ExtensibilityGlobals) = postSolution+
> 	SolutionGuid = {845012FB-4471-4A12-A1C4-FF7E05C40E8E}+
> 	EndGlobalSection+EndGlobaldiff --git
> a/windows/virt2phys/virt2phys.vcxproj
> b/windows/virt2phys/virt2phys.vcxproj
> new file mode 100644
> index 0000000..fa51916
> --- /dev/null
> +++ b/windows/virt2phys/virt2phys.vcxproj
> @@ -0,0 +1,228 @@
> +<?xml version="1.0" encoding="utf-8"?>+<Project DefaultTargets="Build"
> ToolsVersion="12.0"
> xmlns="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2
> Fschemas.microsoft.com%2Fdeveloper%2Fmsbuild%2F2003&amp;data=02%
> 7C01%7Cophirmu%40mellanox.com%7C86da431f2b904dca3d8208d7e0ac4c
> d0%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637224902939
> 685478&amp;sdata=N7HPDma1qrXV6sobOS0tblhsgaEUePD8lXqhnyXD1sU%
> 3D&amp;reserved=0">+  <ItemGroup Label="ProjectConfigurations">+
> <ProjectConfiguration Include="Debug|Win32">+
> <Configuration>Debug</Configuration>+      <Platform>Win32</Platform>+
> </ProjectConfiguration>+    <ProjectConfiguration
> Include="Release|Win32">+      <Configuration>Release</Configuration>+
> <Platform>Win32</Platform>+    </ProjectConfiguration>+
> <ProjectConfiguration Include="Debug|x64">+
> <Configuration>Debug</Configuration>+      <Platform>x64</Platform>+
> </ProjectConfiguration>+    <ProjectConfiguration Include="Release|x64">+
> <Configuration>Release</Configuration>+      <Platform>x64</Platform>+
> </ProjectConfiguration>+    <ProjectConfiguration Include="Debug|ARM">+
> <Configuration>Debug</Configuration>+      <Platform>ARM</Platform>+
> </ProjectConfiguration>+    <ProjectConfiguration Include="Release|ARM">+
> <Configuration>Release</Configuration>+      <Platform>ARM</Platform>+
> </ProjectConfiguration>+    <ProjectConfiguration
> Include="Debug|ARM64">+      <Configuration>Debug</Configuration>+
> <Platform>ARM64</Platform>+    </ProjectConfiguration>+
> <ProjectConfiguration Include="Release|ARM64">+
> <Configuration>Release</Configuration>+
> <Platform>ARM64</Platform>+    </ProjectConfiguration>+  </ItemGroup>+
> <ItemGroup>+    <ClCompile Include="virt2phys.c" />+  </ItemGroup>+
> <ItemGroup>+    <ClInclude Include="virt2phys.h" />+  </ItemGroup>+
> <ItemGroup>+    <Inf Include="virt2phys.inf" />+  </ItemGroup>+
> <PropertyGroup Label="Globals">+    <ProjectGuid>{0EEF826B-9391-43A8-
> A722-BDD6F6115137}</ProjectGuid>+    <TemplateGuid>{497e31cb-056b-
> 4f31-abb8-447fd55ee5a5}</TemplateGuid>+
> <TargetFrameworkVersion>v4.5</TargetFrameworkVersion>+
> <MinimumVisualStudioVersion>12.0</MinimumVisualStudioVersion>+
> <Configuration>Debug</Configuration>+    <Platform
> Condition="'$(Platform)' == ''">Win32</Platform>+
> <RootNamespace>virt2phys</RootNamespace>+  </PropertyGroup>+
> <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />+
> <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>true</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|Win32'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>false</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|x64'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>true</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|x64'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>false</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>true</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|ARM'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>false</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>true</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'"
> Label="Configuration">+    <TargetVersion>Windows10</TargetVersion>+
> <UseDebugLibraries>false</UseDebugLibraries>+
> <PlatformToolset>WindowsKernelModeDriver10.0</PlatformToolset>+
> <ConfigurationType>Driver</ConfigurationType>+
> <DriverType>KMDF</DriverType>+
> <DriverTargetPlatform>Universal</DriverTargetPlatform>+
> </PropertyGroup>+  <Import
> Project="$(VCTargetsPath)\Microsoft.Cpp.props" />+  <ImportGroup
> Label="ExtensionSettings">+  </ImportGroup>+  <ImportGroup
> Label="PropertySheets">+    <Import
> Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props"
> Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')"
> Label="LocalAppDataPlatform" />+  </ImportGroup>+  <PropertyGroup
> Label="UserMacros" />+  <PropertyGroup />+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|x64'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <PropertyGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">+
> <DebuggerFlavor>DbgengKernelDebugger</DebuggerFlavor>+
> </PropertyGroup>+  <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">+
> <ClCompile>+      <WppEnabled>true</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+
> </ItemDefinitionGroup>+  <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">+
> <ClCompile>+      <WppEnabled>true</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+
> </ItemDefinitionGroup>+  <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">+    <ClCompile>+
> <WppEnabled>false</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+    <Link>+
> <AdditionalDependencies>$(DDK_LIB_PATH)wdmsec.lib;%(AdditionalDepend
> encies)</AdditionalDependencies>+    </Link>+    <Inf>+
> <TimeStamp>0.1</TimeStamp>+    </Inf>+  </ItemDefinitionGroup>+
> <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|x64'">+
> <ClCompile>+      <WppEnabled>true</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+
> </ItemDefinitionGroup>+  <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|ARM'">+
> <ClCompile>+      <WppEnabled>true</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+
> </ItemDefinitionGroup>+  <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|ARM'">+
> <ClCompile>+      <WppEnabled>true</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+
> </ItemDefinitionGroup>+  <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Release|ARM64'">+
> <ClCompile>+      <WppEnabled>true</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+
> </ItemDefinitionGroup>+  <ItemDefinitionGroup
> Condition="'$(Configuration)|$(Platform)'=='Debug|ARM64'">+
> <ClCompile>+      <WppEnabled>true</WppEnabled>+
> <WppRecorderEnabled>true</WppRecorderEnabled>+
> <WppScanConfigurationData
> Condition="'%(ClCompile.ScanConfigurationData)' ==
> ''">trace.h</WppScanConfigurationData>+
> <WppKernelMode>true</WppKernelMode>+    </ClCompile>+
> </ItemDefinitionGroup>+  <ItemGroup>+    <FilesToPackage
> Include="$(TargetPath)" />+  </ItemGroup>+  <Import
> Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />+  <ImportGroup
> Label="ExtensionTargets">+  </ImportGroup>+</Project>
> \ No newline at end of file
> diff --git a/windows/virt2phys/virt2phys.vcxproj.filters
> b/windows/virt2phys/virt2phys.vcxproj.filters
> new file mode 100644
> index 0000000..0fe65fc
> --- /dev/null
> +++ b/windows/virt2phys/virt2phys.vcxproj.filters
> @@ -0,0 +1,36 @@
> +<?xml version="1.0" encoding="utf-8"?>+<Project ToolsVersion="4.0"
> xmlns="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2
> Fschemas.microsoft.com%2Fdeveloper%2Fmsbuild%2F2003&amp;data=02%
> 7C01%7Cophirmu%40mellanox.com%7C86da431f2b904dca3d8208d7e0ac4c
> d0%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637224902939
> 685478&amp;sdata=N7HPDma1qrXV6sobOS0tblhsgaEUePD8lXqhnyXD1sU%
> 3D&amp;reserved=0">+  <ItemGroup>+    <Filter Include="Source Files">+
> <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-
> 2A32D752A2FF}</UniqueIdentifier>+
> <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>+
> </Filter>+    <Filter Include="Header Files">+
> <UniqueIdentifier>{93995380-89BD-4b04-88EB-
> 625FBE52EBFB}</UniqueIdentifier>+
> <Extensions>h;hpp;hxx;hm;inl;inc;xsd</Extensions>+    </Filter>+    <Filter
> Include="Resource Files">+      <UniqueIdentifier>{67DA6AB6-F800-4c08-
> 8B7A-83BB121AAD01}</UniqueIdentifier>+
> <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;pn
> g;wav;mfcribbon-ms</Extensions>+    </Filter>+    <Filter Include="Driver
> Files">+      <UniqueIdentifier>{8E41214B-6785-4CFE-B992-
> 037D68949A14}</UniqueIdentifier>+
> <Extensions>inf;inv;inx;mof;mc;</Extensions>+    </Filter>+  </ItemGroup>+
> <ItemGroup>+    <Inf Include="virt2phys.inf">+      <Filter>Driver
> Files</Filter>+    </Inf>+  </ItemGroup>+  <ItemGroup>+    <ClInclude
> Include="virt2phys.h">+      <Filter>Header Files</Filter>+    </ClInclude>+
> </ItemGroup>+  <ItemGroup>+    <ClCompile Include="virt2phys.c">+
> <Filter>Source Files</Filter>+    </ClCompile>+  </ItemGroup>+</Project>--
> 2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 02/10] eal/windows: do not expose private EAL facilities
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 02/10] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
@ 2020-04-21 22:40       ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-04-21 22:40 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

14/04/2020 21:44, Dmitry Kozlyuk:
> The goal of rte_os.h is to mitigate OS differences for EAL users.
> In Windows EAL, rte_os.h did excessive things:
> 
> 1. It included platform SDK headers (windows.h, etc). Those files are
>    huge, require specific inclusion order, and are generally unused by
>    the code including rte_os.h. Declarations from platform SDK may
>    break otherwise platform-independent code, e.g. min, max, ERROR.
> 
> 2. It included pthread.h, which is clearly not always required.
> 
> 3. It defined functions private to Windows EAL.
> 
> Reorganize Windows EAL includes in the following way:
> 
> 1. Create rte_windows.h to properly import Windows-specific facilities.
>    Primary users are bus drivers, tests, and external applications.
> 
> 2. Remove platform SDK includes from rte_os.h to prevent breaking
>    otherwise portable code by including rte_os.h on Windows.
>    Copy necessary definitions to avoid including those headers.
> 
> 3. Remove pthread.h include from rte_os.h.
> 
> 4. Move declarations private to Windows EAL into eal_windows.h.
> 
> Fixes: 428eb983f5f7 ("eal: add OS specific header file")
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>

Applied as a separate patch because it is needed for a patch
fixing compilation on Windows.




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-20  5:59       ` Tal Shnaiderman
@ 2020-04-21 23:36         ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-21 23:36 UTC (permalink / raw)
  To: Tal Shnaiderman
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Thomas Monjalon,
	Anatoly Burakov, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

> I'd remove the GetVersionEx check entirely and add the comments regarding OS dependency to the RTE_LOG 
> Of the LoadLibraryA failure below, GetVersionEx returns the Windows 8 OS version on newer servers

Agreed, will do in v4.

> Also, it looks like not all Win2016 servers versions support VirtualAlloc2, I'm using Microsoft Windows Server 2016 Datacenter Version 10.0.14393 and LoadLibraryA failed to load VirtualAlloc2.

I confirm this. Documentation states Windows Server 2016 is supported, but it
is at least partially incorrect, see comments in meson.build and GitHub issue:

	https://github.com/MicrosoftDocs/feedback/issues/1129

How would you estimate Server 2016 support importance? Server 2019 and
Windows 10 are known to work.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
                         ` (2 preceding siblings ...)
  2020-04-20  5:59       ` Tal Shnaiderman
@ 2020-04-22  0:55       ` Ranjit Menon
  2020-04-22  2:07       ` Ranjit Menon
  4 siblings, 0 replies; 218+ messages in thread
From: Ranjit Menon @ 2020-04-22  0:55 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Burakov, Anatoly, Harini Ramakrishnan,
	Omar Cardona, Kadam, Pallavi

On 4/14/2020 12:44 PM, Dmitry Kozlyuk wrote:
> System meory management is implemented differently for POSIX and
> Windows. Introduce wrapper functions for operations used across DPDK:
> 
> * rte_mem_map()
>    Create memory mapping for a regular file or a page file (swap).
>    This supports mapping to a reserved memory region even on Windows.
> 
> * rte_mem_unmap()
>    Remove mapping created with rte_mem_map().
> 
> * rte_get_page_size()
>    Obtain default system page size.
> 
> * rte_mem_lock()
>    Make arbitrary-sized memory region non-swappable.
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>

<Snip!>

> diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
> new file mode 100644
> index 000000000..5697187ce
> --- /dev/null
> +++ b/lib/librte_eal/windows/eal_memory.c
> @@ -0,0 +1,437 @@
> +#include <io.h>
> +
> +#include <rte_errno.h>
> +#include <rte_memory.h>
> +
> +#include "eal_private.h"
> +#include "eal_windows.h"
> +
> +/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
> + * Provide a copy of definitions and code to load it dynamically.
> + * Note: definitions are copied verbatim from Microsoft documentation
> + * and don't follow DPDK code style.
> + */
> +#ifndef MEM_PRESERVE_PLACEHOLDER
> +
> +/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
> +typedef enum MEM_EXTENDED_PARAMETER_TYPE {
> +	MemExtendedParameterInvalidType,
> +	MemExtendedParameterAddressRequirements,
> +	MemExtendedParameterNumaNode,
> +	MemExtendedParameterPartitionHandle,
> +	MemExtendedParameterMax,
> +	MemExtendedParameterUserPhysicalHandle,
> +	MemExtendedParameterAttributeFlags
> +} *PMEM_EXTENDED_PARAMETER_TYPE;
> +
> +#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
> +
> +/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
> +typedef struct MEM_EXTENDED_PARAMETER {
> +	struct {
> +		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
> +		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
> +	} DUMMYSTRUCTNAME;
> +	union {
> +		DWORD64 ULong64;
> +		PVOID   Pointer;
> +		SIZE_T  Size;
> +		HANDLE  Handle;
> +		DWORD   ULong;
> +	} DUMMYUNIONNAME;
> +} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
> +
> +/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
> +typedef PVOID (*VirtualAlloc2_type)(
> +	HANDLE                 Process,
> +	PVOID                  BaseAddress,
> +	SIZE_T                 Size,
> +	ULONG                  AllocationType,
> +	ULONG                  PageProtection,
> +	MEM_EXTENDED_PARAMETER *ExtendedParameters,
> +	ULONG                  ParameterCount
> +);
> +
> +/* VirtualAlloc2() flags. */
> +#define MEM_COALESCE_PLACEHOLDERS 0x00000001
> +#define MEM_PRESERVE_PLACEHOLDER  0x00000002
> +#define MEM_REPLACE_PLACEHOLDER   0x00004000
> +#define MEM_RESERVE_PLACEHOLDER   0x00040000
> +
> +/* Named exactly as the function, so that user code does not depend
> + * on it being found at compile time or dynamically.
> + */
> +static VirtualAlloc2_type VirtualAlloc2;
> +
> +int
> +eal_mem_win32api_init(void)
> +{
> +	static const char library_name[] = "kernelbase.dll";
> +	static const char function[] = "VirtualAlloc2";
> +
> +	OSVERSIONINFO info;
> +	HMODULE library = NULL;
> +	int ret = 0;
> +
> +	/* Already done. */
> +	if (VirtualAlloc2 != NULL)
> +		return 0;
> +
> +	/* IsWindows10OrGreater() may also be unavailable. */
> +	memset(&info, 0, sizeof(info));
> +	info.dwOSVersionInfoSize = sizeof(info);
> +	GetVersionEx(&info);
> +
> +	/* Checking for Windows 10+ will also detect Windows Server 2016+.
> +	 * Do not abort, because Windows may report false version depending
> +	 * on executable manifest, compatibility mode, etc.
> +	 */
> +	if (info.dwMajorVersion < 10)
> +		RTE_LOG(DEBUG, EAL, "Windows 10+ or Windows Server 2016+ "
> +			"is required for advanced memory features\n");
> +
> +	library = LoadLibraryA(library_name);
> +	if (library == NULL) {
> +		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
> +		return -1;
> +	}
> +
> +	VirtualAlloc2 = (VirtualAlloc2_type)(
> +		(void *)GetProcAddress(library, function));
> +	if (VirtualAlloc2 == NULL) {
> +		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
> +			library_name, function);
> +		ret = -1;
> +	}
> +
> +	FreeLibrary(library);
> +
> +	return ret;
> +}
> +
> +#else
> +
> +/* Stub in case VirtualAlloc2() is provided by the compiler. */
> +int
> +eal_mem_win32api_init(void)
> +{
> +	return 0;
> +}
> +
> +#endif /* no VirtualAlloc2() */

Can you fix this comment to match the #ifndef definition above?
BTW...Why use MEM_PRESERVE_PLACEHOLDER (which is also defined within the 
block?)

ranjit m.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers
  2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
                         ` (3 preceding siblings ...)
  2020-04-22  0:55       ` Ranjit Menon
@ 2020-04-22  2:07       ` Ranjit Menon
  4 siblings, 0 replies; 218+ messages in thread
From: Ranjit Menon @ 2020-04-22  2:07 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Burakov, Anatoly, Harini Ramakrishnan,
	Omar Cardona, Kadam, Pallavi

<Snip!>

On 4/14/2020 12:44 PM, Dmitry Kozlyuk wrote:
> diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
> new file mode 100644
> index 000000000..5697187ce
> --- /dev/null
> +++ b/lib/librte_eal/windows/eal_memory.c
> @@ -0,0 +1,437 @@

<Snip!>

> +
> +	if ((flags & EAL_RESERVE_EXACT_ADDRESS) && (virt != requested_addr)) {
> +		if (!VirtualFree(virt, 0, MEM_RELEASE))

Microsoft documentation suggests that we use VirtualFreeEx() to free 
memory allocated by VirtualAlloc2(). VirtualFreeEx() would require the 
handle to the current process that was passed into VirtualAlloc2()

There are 6 other such occurrences in this file.

ranjit m.


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management
  2020-04-16 18:34       ` Ranjit Menon
@ 2020-04-23  1:00         ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-23  1:00 UTC (permalink / raw)
  To: Ranjit Menon
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona, Kadam,
	Pallavi, Mcnamara, John, Kovacevic, Marko, Burakov, Anatoly,
	Richardson, Bruce

On 2020-04-16 11:34 GMT-0700 Ranjit Menon wrote:
[snip]
> > --- /dev/null
> > +++ b/lib/librte_eal/windows/include/rte_virt2phys.h
[snip]
> 
> This file is a duplicate of: <kmods>: windows/virt2phys/virt2phys.h
> This is by design, since it documents the driver interface.
> 
> So, to prevent the two files going out-of-sync, should we:
> 1. Make the two filenames the same?
> 2. Add comments to both files referencing each other and the need to 
> keep them both in sync?
> 3. Do both (1) and (2)?
> 
> This will also be an issue for the upcoming Windows netuio kernel driver 
> and I reckon this could be an issue for Linux kernel modules too.
> 
> Thoughts?

I guess it won't be much of an issue in practice. When you edit driver
interface you're already considering user-mode and kernel interaction, by
definition of that interface, so you can't forget about the counterpart.

I'd be more worried about ABI, because memory corruption in driver will crash
the system, being hard to debug. Can we leverage abigail for these interfaces?

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 0/8] Windows basic memory management
  2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
                     ` (10 preceding siblings ...)
  2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
@ 2020-04-28 23:50   ` Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 1/8] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
                       ` (8 more replies)
  11 siblings, 9 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk

Note: symbols and release notes updated for v20.05, despite this patch
      now targeting v20.08, because proper sections don't yet exist.

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing with IOVA unavailable.

New EAL public functions for memory mapping are introduced
to mitigate OS differences in DPDK libraries and applications:

* rte_mem_map
* rte_mem_unmap
* rte_mem_lock
* rte_get_page_size

To support common MM routines, internal wrappers for low-level
memory reservation and file management are introduced. These changes
affect Linux and FreeBSD EAL. Shared code is placed unded /unix/
subdirectory (suggested by Thomas).

Also, entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.


Windows MM duplicates quite a lot of code from Linux EAL:

* eal_memalloc_alloc_seg_bulk
* eal_memalloc_free_seg_bulk
* calc_num_pages_per_socket
* rte_eal_hugepage_init

Perhaps this should be left as-is until Windows MM evolves into having
some specific requirements for these parts.


Notes on checkpatch warnings:

* No space after comma / no space before closing parent in macros---
  definitely a false-positive, unclear how to suppress this.
* Issues from imported BSD code---probably should be ignored?

---

v4:

    * Rebase on ToT, drop patches merged into master.
    * Rearrange patches to split Windows code (Jerin).
    * Fix Linux and FreeBSD build with make (Ophir).
    * Use int instead of enum to hold a set of flags (Anatoly).
    * Rename eal_mem_reserve items and fix their description (Anatoly).
    * Add eal_mem_set_dump() wrapper around madvise (Anatoly).
    * Don't claim Windows Server 2016 support due to lack of API (Tal).
    * Replace enum rte_page_sizes with a set of #defines (Jerin).
    * Fix documentation, SPDX tags, logging (Thomas).

v3:

    * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
    * Fix logic and error handling while allocating segments.
    * Fix Unix rte_mem_map(): return NULL on failure.
    * Fix some checkpatch.sh issues:
        * Do not return positive errno, use DWORD for GetLastError().
        * Make dpdk-kmods source files non-executable.
    * Improve GSG for Windows Server (suggested by Ranjit Menon).

v2:

    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.


Dmitry Kozlyuk (8):
  eal: replace rte_page_sizes with a set of constants
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal: extract common code for memseg list initialization
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal/windows: implement basic memory management

 config/meson.build                            |   12 +-
 doc/guides/rel_notes/release_20_05.rst        |    2 +
 doc/guides/windows_gsg/build_dpdk.rst         |   20 -
 doc/guides/windows_gsg/index.rst              |    1 +
 doc/guides/windows_gsg/run_apps.rst           |   95 ++
 lib/librte_eal/common/eal_common_fbarray.c    |   56 +-
 lib/librte_eal/common/eal_common_memory.c     |  116 +-
 lib/librte_eal/common/eal_common_memzone.c    |    7 +
 lib/librte_eal/common/eal_private.h           |  155 ++-
 lib/librte_eal/common/meson.build             |   10 +
 lib/librte_eal/common/rte_malloc.c            |    9 +
 lib/librte_eal/freebsd/Makefile               |    5 +
 lib/librte_eal/freebsd/eal_memory.c           |   57 +-
 lib/librte_eal/include/rte_memory.h           |  109 +-
 lib/librte_eal/linux/Makefile                 |    5 +
 lib/librte_eal/linux/eal_memalloc.c           |    5 +-
 lib/librte_eal/linux/eal_memory.c             |   68 +-
 lib/librte_eal/meson.build                    |    4 +
 lib/librte_eal/rte_eal_exports.def            |  119 ++
 lib/librte_eal/rte_eal_version.map            |    4 +
 lib/librte_eal/unix/eal_unix.c                |   51 +
 lib/librte_eal/unix/eal_unix_memory.c         |  144 ++
 lib/librte_eal/unix/meson.build               |    7 +
 lib/librte_eal/windows/eal.c                  |  158 +++
 lib/librte_eal/windows/eal_hugepages.c        |  108 ++
 lib/librte_eal/windows/eal_lcore.c            |  185 ++-
 lib/librte_eal/windows/eal_memalloc.c         |  418 ++++++
 lib/librte_eal/windows/eal_memory.c           | 1155 +++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               |  103 ++
 lib/librte_eal/windows/eal_windows.h          |  100 ++
 lib/librte_eal/windows/include/meson.build    |    1 +
 lib/librte_eal/windows/include/rte_os.h       |    4 +
 .../windows/include/rte_virt2phys.h           |   34 +
 lib/librte_eal/windows/include/rte_windows.h  |    2 +
 lib/librte_eal/windows/include/sys/queue.h    |  663 +++++++++-
 lib/librte_eal/windows/include/unistd.h       |    3 +
 lib/librte_eal/windows/meson.build            |    6 +
 37 files changed, 3655 insertions(+), 346 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/unix/eal_unix.c
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 1/8] eal: replace rte_page_sizes with a set of constants
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 2/8] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Jerin Jacob, John McNamara, Marko Kovacevic,
	Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_sizes has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Remove rte_page_sizes and replace its values with #define's.
This enumeration is not used in public API, so there's no ABI breakage.
Document API change in release notes.

Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 doc/guides/rel_notes/release_20_05.rst |  2 ++
 lib/librte_eal/include/rte_memory.h    | 23 ++++++++++-------------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_05.rst b/doc/guides/rel_notes/release_20_05.rst
index b124c3f28..76ba59220 100644
--- a/doc/guides/rel_notes/release_20_05.rst
+++ b/doc/guides/rel_notes/release_20_05.rst
@@ -225,6 +225,8 @@ Removed Items
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* ``enum rte_page_sizes`` is removed, ``RTE_PGSIZE_*`` constants are kept.
+
 
 API Changes
 -----------
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..65374d53a 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -24,19 +24,16 @@ extern "C" {
 #include <rte_config.h>
 #include <rte_fbarray.h>
 
-__extension__
-enum rte_page_sizes {
-	RTE_PGSIZE_4K    = 1ULL << 12,
-	RTE_PGSIZE_64K   = 1ULL << 16,
-	RTE_PGSIZE_256K  = 1ULL << 18,
-	RTE_PGSIZE_2M    = 1ULL << 21,
-	RTE_PGSIZE_16M   = 1ULL << 24,
-	RTE_PGSIZE_256M  = 1ULL << 28,
-	RTE_PGSIZE_512M  = 1ULL << 29,
-	RTE_PGSIZE_1G    = 1ULL << 30,
-	RTE_PGSIZE_4G    = 1ULL << 32,
-	RTE_PGSIZE_16G   = 1ULL << 34,
-};
+#define RTE_PGSIZE_4K   (1ULL << 12)
+#define RTE_PGSIZE_64K  (1ULL << 16)
+#define RTE_PGSIZE_256K (1ULL << 18)
+#define RTE_PGSIZE_2M   (1ULL << 21)
+#define RTE_PGSIZE_16M  (1ULL << 24)
+#define RTE_PGSIZE_256M (1ULL << 28)
+#define RTE_PGSIZE_512M (1ULL << 29)
+#define RTE_PGSIZE_1G   (1ULL << 30)
+#define RTE_PGSIZE_4G   (1ULL << 32)
+#define RTE_PGSIZE_16G  (1ULL << 34)
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 2/8] eal: introduce internal wrappers for file operations
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 1/8] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-04-29 16:41       ` Burakov, Anatoly
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers Dmitry Kozlyuk
                       ` (6 subsequent siblings)
  8 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Anatoly Burakov, Bruce Richardson

EAL common code uses file locking and truncation. Introduce
OS-independent wrappers in order to support both Linux/FreeBSD
and Windows:

* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Wrappers follow POSIX semantics, but interface is not POSIX,
so that it can be made more clean, e.g. by not mixing locking
operation and behaviour on conflict.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. Files should be named
after the ones from which the code is factored in OS subdirectory.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_fbarray.c | 19 ++++----
 lib/librte_eal/common/eal_private.h        | 45 +++++++++++++++++++
 lib/librte_eal/freebsd/Makefile            |  4 ++
 lib/librte_eal/linux/Makefile              |  4 ++
 lib/librte_eal/meson.build                 |  4 ++
 lib/librte_eal/unix/eal_unix.c             | 51 ++++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |  6 +++
 7 files changed, 122 insertions(+), 11 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_unix.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 4f8f1af73..1e55757ca 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -8,8 +8,8 @@
 #include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -85,7 +85,7 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
 		/* pass errno up the chain */
 		rte_errno = errno;
@@ -778,7 +778,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 					__func__, path, strerror(errno));
 			rte_errno = errno;
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
 					__func__, path, strerror(errno));
 			rte_errno = EBUSY;
@@ -789,10 +790,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -895,10 +894,8 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -1025,7 +1022,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,7 +1039,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index ecf827914..3aafd892f 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -448,4 +448,49 @@ eal_malloc_no_trace(const char *type, size_t size, unsigned int align);
 
 void eal_free_no_trace(void *addr);
 
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index a8400f20a..a26c455c7 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -74,6 +75,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_hypervisor.c
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index a77eb1757..fa41f00bf 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -81,6 +82,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_hypervisor.c
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 0267c3b9d..98c97dd07 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal_unix.c b/lib/librte_eal/unix/eal_unix.c
new file mode 100644
index 000000000..b9c57ef18
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix.c
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..cfa1b4ef9
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal_unix.c',
+)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 1/8] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 2/8] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-04-29 17:13       ` Burakov, Anatoly
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 4/8] eal: extract common code for memseg list initialization Dmitry Kozlyuk
                       ` (5 subsequent siblings)
  8 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Anatoly Burakov, Bruce Richardson

Introduce OS-independent wrappers for memory management operations used
across DPDK and specifically in common code of EAL:

* rte_mem_map()
* rte_mem_unmap()
* rte_get_page_size()
* rte_mem_lock()

Windows uses different APIs for memory mapping and reservation, while
Unices reserve memory by mapping it. Introduce EAL private functions to
support memory reservation in common code:

* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_fbarray.c |  37 +++---
 lib/librte_eal/common/eal_common_memory.c  |  62 ++++-----
 lib/librte_eal/common/eal_private.h        |  74 ++++++++++-
 lib/librte_eal/freebsd/Makefile            |   1 +
 lib/librte_eal/include/rte_memory.h        |  86 ++++++++++++
 lib/librte_eal/linux/Makefile              |   1 +
 lib/librte_eal/linux/eal_memalloc.c        |   5 +-
 lib/librte_eal/rte_eal_version.map         |   4 +
 lib/librte_eal/unix/eal_unix_memory.c      | 144 +++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |   1 +
 10 files changed, 350 insertions(+), 65 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 1e55757ca..b3b6c8521 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,15 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
 #include <string.h>
 #include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -92,12 +92,9 @@ resize_and_map(int fd, void *addr, size_t len)
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -735,7 +732,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -756,9 +753,11 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		static const int flags = RTE_MAP_FORCE_ADDRESS |
+			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
+		void *new_data = rte_mem_map(data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
 					__func__, strerror(errno));
 			goto fail;
@@ -823,7 +822,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -861,7 +860,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -913,7 +912,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -941,8 +940,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -961,7 +959,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -996,8 +994,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1046,7 +1043,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 4c897a13f..1196a8037 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,7 +11,6 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
@@ -40,18 +39,10 @@
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
-#ifdef RTE_EXEC_ENV_LINUX
-#define RTE_DONTDUMP MADV_DONTDUMP
-#elif defined RTE_EXEC_ENV_FREEBSD
-#define RTE_DONTDUMP MADV_NOCORE
-#else
-#error "madvise doesn't support this OS"
-#endif
-
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, int reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -59,9 +50,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_get_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -105,24 +94,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
-			*size -= page_sz;
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
+			size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, *size);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -132,20 +121,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -161,7 +147,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -179,19 +165,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	if (!unmap) {
 		/* Exclude these pages from a core dump. */
-		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
-			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
-				strerror(errno));
+		eal_mem_set_dump(aligned_addr, *size, false);
 	}
 
 	return aligned_addr;
@@ -547,10 +531,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	int page_size = rte_get_page_size();
+	uintptr_t aligned = (virtual & ~(page_size - 1));
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 3aafd892f..67ca83e47 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**
+	 * Reserve hugepages. May be unsupported by some platforms.
+	 */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**
+	 * Force reserving memory at the requested address.
+	 * This can be a destructive action depending on the implementation.
+	 *
+	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
+	 *      (although implementations are not required to use it).
+	 */
+	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to rte_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+		size_t page_sz, int flags, int reserve_flags);
 
 /**
  * Get cpu core_id.
@@ -493,4 +512,53 @@ int eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
  */
 int eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation address. The system may not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options, a combination of eal_mem_reserve_flags.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *eal_mem_reserve(void *requested_addr, size_t size, int flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If *virt* and *size* describe a part of the reserved region,
+ * only this part of the region is freed (accurately up to the system
+ * page size). If *virt* points to allocated memory, *size* must match
+ * the one specified on allocation. The behavior is undefined
+ * if the memory pointed by *virt* is obtained from another source
+ * than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void eal_mem_free(void *virt, size_t size);
+
+/**
+ * Configure memory region inclusion into core dumps.
+ *
+ * @param virt
+ *  Starting address of the region.
+ * @param size
+ *  Size of the region.
+ * @param dump
+ *  True to include memory into core dumps, false to exclude.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int eal_mem_set_dump(void *virt, size_t size, bool dump);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index a26c455c7..647dfd0f2 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 65374d53a..5cceeedc8 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -82,6 +82,92 @@ struct rte_memseg_list {
 	struct rte_fbarray memseg_arr;
 };
 
+/**
+ * Memory protection flags.
+ */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/**
+ * Additional flags for memory mapping.
+ */
+enum rte_map_flags {
+	/** Changes to the mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/**
+	 * Force mapping to the requested address. This flag should be used
+	 * with caution, because to fulfill the request implementation
+	 * may remove all other mappings in the requested region. However,
+	 * it is not required to do so, thus mapping with this flag may fail.
+	 */
+	RTE_MAP_FORCE_ADDRESS = 1 << 3
+};
+
+/**
+ * Map a portion of an opened file or the page file into memory.
+ *
+ * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
+ * extension, except for the return value.
+ *
+ * @param requested_addr
+ *  Desired virtual address for mapping. Can be NULL to let OS choose.
+ * @param size
+ *  Size of the mapping in bytes.
+ * @param prot
+ *  Protection flags, a combination of rte_mem_prot values.
+ * @param flags
+ *  Addtional mapping flags, a combination of rte_map_flags.
+ * @param fd
+ *  Mapped file descriptor. Can be negative for anonymous mapping.
+ * @param offset
+ *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
+ * @return
+ *  Mapped address or NULL on failure and rte_errno is set to OS error.
+ */
+__rte_experimental
+void *rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_experimental
+int rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never fails.
+ *
+ * @return
+ *   Page size in bytes.
+ */
+__rte_experimental
+size_t rte_get_page_size(void);
+
+/**
+ * Lock region in physical memory and prevent it from swapping.
+ *
+ * @param virt
+ *   The virtual address.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @note Implementations may require *virt* and *size*
+ *       to be multiples of system page size.
+ * @see rte_get_page_size() to retrieve the page size.
+ * @see rte_mem_lock_page() to lock an entire single page.
+ */
+__rte_experimental
+int rte_mem_lock(const void *virt, size_t size);
+
 /**
  * Lock page in physical memory and prevent from swapping.
  *
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index fa41f00bf..06428f0de 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
index 2c717f8bd..bf29b83c6 100644
--- a/lib/librte_eal/linux/eal_memalloc.c
+++ b/lib/librte_eal/linux/eal_memalloc.c
@@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 mapped:
 	munmap(addr, alloc_sz);
 unmapped:
-	flags = MAP_FIXED;
+	flags = EAL_RESERVE_FORCE_ADDRESS;
 	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
 	if (new_addr != addr) {
 		if (new_addr != NULL)
@@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
-		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
+	eal_mem_set_dump(ms->addr, ms->len, false);
 
 	exit_early = false;
 
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index 6088e7f6c..5d6d3a8c6 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -373,7 +373,11 @@ EXPERIMENTAL {
 	__rte_trace_point_register;
 	per_lcore_trace_mem;
 	per_lcore_trace_point_sz;
+	rte_get_page_size;
 	rte_log_can_log;
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_unmap;
 	rte_thread_getname;
 	rte_trace_dump;
 	rte_trace_is_enabled;
diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
new file mode 100644
index 000000000..3eab7b941
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix_memory.c
@@ -0,0 +1,144 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+
+#include "eal_private.h"
+
+#ifdef RTE_EXEC_ENV_LINUX
+#define EAL_DONTDUMP MADV_DONTDUMP
+#define EAL_DODUMP   MADV_DODUMP
+#elif defined RTE_EXEC_ENV_FREEBSD
+#define EAL_DONTDUMP MADV_NOCORE
+#define EAL_DODUMP   MADV_CORE
+#else
+#error "madvise doesn't support this OS"
+#endif
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+		return NULL;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+#ifdef MAP_HUGETLB
+		sys_flags |= MAP_HUGETLB;
+#else
+		rte_errno = ENOTSUP;
+		return NULL;
+#endif
+	}
+
+	if (flags & EAL_RESERVE_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
+	int ret = madvise(virt, size, flags);
+	if (ret) {
+		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
+				virt, size, flags, strerror(rte_errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+static int
+mem_rte_to_sys_prot(int prot)
+{
+	int sys_prot = 0;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	int sys_prot = 0;
+	int sys_flags = 0;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+size_t
+rte_get_page_size(void)
+{
+	return sysconf(_SC_PAGESIZE);
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	return mlock(virt, size);
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index cfa1b4ef9..5734f26ad 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal_unix.c',
+	'eal_unix_memory.c',
 )
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 4/8] eal: extract common code for memseg list initialization
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
                       ` (2 preceding siblings ...)
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-05-05 16:08       ` Burakov, Anatoly
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 5/8] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                       ` (4 subsequent siblings)
  8 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Anatoly Burakov, Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c | 54 ++++++++++++++++++
 lib/librte_eal/common/eal_private.h       | 36 ++++++++++++
 lib/librte_eal/freebsd/eal_memory.c       | 57 +++----------------
 lib/librte_eal/linux/eal_memory.c         | 68 +++++------------------
 4 files changed, 113 insertions(+), 102 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 1196a8037..56eff0acb 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -24,6 +24,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -181,6 +182,59 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
+			(size_t)page_sz >> 10, socket_id);
+
+	return 0;
+}
+
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
+{
+	uint64_t page_sz;
+	size_t mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(
+		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
+	if (addr == NULL) {
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+		else
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	return 0;
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 67ca83e47..4a28274ec 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -254,6 +254,42 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int reserve_flags);
 
+/**
+ * Initialize a memory segment list and create its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
+/**
+ * Reserve VA space for a memory segment list
+ * previously initialized with eal_memseg_list_init().
+ *
+ * @param msl
+ *  Memory segment list with page size defined.
+ * @param reserve_flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5bc2da160..870ad94c0 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -336,64 +336,25 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
 	int flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
@@ -479,7 +440,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_init(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
@@ -487,7 +448,7 @@ memseg_primary_init(void)
 			total_type_mem = total_segs * hugepage_sz;
 			type_msl_idx++;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				return -1;
 			}
@@ -518,7 +479,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..888bb2466 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, 0);
 }
 
 /*
@@ -1009,12 +969,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_init(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_alloc(msl) < 0)
 				return -1;
 		}
 	}
@@ -2191,7 +2151,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_init(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2160,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_alloc(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2355,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_init(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2393,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 5/8] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
                       ` (3 preceding siblings ...)
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 4/8] eal: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 6/8] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 6/8] eal/windows: improve CPU and NUMA node detection
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
                       ` (4 preceding siblings ...)
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 5/8] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 7/8] eal/windows: initialize hugepage info Dmitry Kozlyuk
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Harini Ramakrishnan, Omar Cardona, Pallavi Kadam,
	Ranjit Menon, Anand Rawat, Jeff Shaw

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal_lcore.c   | 185 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..9d931d50a 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() may not be available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 7/8] eal/windows: initialize hugepage info
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
                       ` (5 preceding siblings ...)
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 6/8] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
  8 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup. Only 2MB
hugepages are supported and their number is estimated roughly
due to the lack or unstable status of suitable OS APIs.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/build_dpdk.rst  |  20 -----
 doc/guides/windows_gsg/index.rst       |   1 +
 doc/guides/windows_gsg/run_apps.rst    |  47 +++++++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 7 files changed, 173 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/config/meson.build b/config/meson.build
index e851b407b..74f163223 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -271,6 +271,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 2cf7a04ef..63461f51a 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -18,8 +18,11 @@
 #include <eal_options.h>
 #include <eal_private.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -242,6 +245,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..61d0dcd3c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available on Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem on Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 09dd4ab2f..5f118bfe2 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_thread.c',
 	'getopt.c',
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
                       ` (6 preceding siblings ...)
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 7/8] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-04-28 23:50     ` Dmitry Kozlyuk
  2020-04-29  1:18       ` Ranjit Menon
                         ` (2 more replies)
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
  8 siblings, 3 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Dmitry Kozlyuk, Thomas Monjalon, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic, Anatoly Burakov

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode. Multi-process mode is not
implemented and is forcefully disabled at startup.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                            |   12 +-
 doc/guides/windows_gsg/run_apps.rst           |   54 +-
 lib/librte_eal/common/eal_common_memzone.c    |    7 +
 lib/librte_eal/common/meson.build             |   10 +
 lib/librte_eal/common/rte_malloc.c            |    9 +
 lib/librte_eal/rte_eal_exports.def            |  119 ++
 lib/librte_eal/windows/eal.c                  |  144 ++
 lib/librte_eal/windows/eal_memalloc.c         |  418 ++++++
 lib/librte_eal/windows/eal_memory.c           | 1155 +++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               |  103 ++
 lib/librte_eal/windows/eal_windows.h          |   90 ++
 lib/librte_eal/windows/include/meson.build    |    1 +
 lib/librte_eal/windows/include/rte_os.h       |    4 +
 .../windows/include/rte_virt2phys.h           |   34 +
 lib/librte_eal/windows/include/rte_windows.h  |    2 +
 lib/librte_eal/windows/include/unistd.h       |    3 +
 lib/librte_eal/windows/meson.build            |    5 +
 17 files changed, 2164 insertions(+), 6 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/config/meson.build b/config/meson.build
index 74f163223..800b5ba33 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -264,15 +264,21 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..78e5a614f 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -7,10 +7,10 @@ Running DPDK Applications
 Grant *Lock pages in memory* Privilege
 --------------------------------------
 
-Use of hugepages ("large pages" in Windows terminolocy) requires
+Use of hugepages ("large pages" in Windows terminology) requires
 ``SeLockMemoryPrivilege`` for the user running an application.
 
-1. Open *Local Security Policy* snap in, either:
+1. Open *Local Security Policy* snap-in, either:
 
    * Control Panel / Computer Management / Local Security Policy;
    * or Win+R, type ``secpol``, press Enter.
@@ -24,7 +24,55 @@ Use of hugepages ("large pages" in Windows terminolocy) requires
 
 See `Large-Page Support`_ in MSDN for details.
 
-.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+.. _Large-Page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation. This driver is not signed, so signature
+checking must be disabled to load it.
+
+.. warning::
+
+    Disabling driver signature enforcement weakens OS security.
+    It is discouraged in production environments.
+
+Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``. It can be installed as follows
+from Elevated Command Prompt:
+
+.. code-block:: console
+
+    pnputil /add-driver Z:\path\to\virt2phys.inf /install
+
+On Windows Server additional steps are required:
+
+1. From Device Manager, Action menu, select "Add legacy hardware".
+2. It will launch the "Add Hardware Wizard". Click "Next".
+3. Select second option "Install the hardware that I manually select
+   from a list (Advanced)".
+4. On the next screen, "Kernel bypass" will be shown as a device class.
+5. Select it, and click "Next".
+6. The previously installed drivers will now be installed for the
+   "Virtual to physical address translator" device.
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+Installed driver persists across reboots.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
 
 
 Run the ``helloworld`` Example
diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
index 7c21aa921..9fa7bf352 100644
--- a/lib/librte_eal/common/eal_common_memzone.c
+++ b/lib/librte_eal/common/eal_common_memzone.c
@@ -19,7 +19,14 @@
 #include <rte_errno.h>
 #include <rte_string_fns.h>
 #include <rte_common.h>
+
+#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_eal_trace.h>
+#else
+#define rte_eal_trace_memzone_reserve(...)
+#define rte_eal_trace_memzone_lookup(...)
+#define rte_eal_trace_memzone_free(...)
+#endif
 
 #include "malloc_heap.h"
 #include "malloc_elem.h"
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 155da29b4..9bb234009 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -9,11 +9,21 @@ if is_windows
 		'eal_common_class.c',
 		'eal_common_devargs.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_string_fns.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 		'rte_option.c',
 	)
 	subdir_done()
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index f1b73168b..34b416927 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -20,7 +20,16 @@
 #include <rte_lcore.h>
 #include <rte_common.h>
 #include <rte_spinlock.h>
+
+#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_eal_trace.h>
+#else
+/* Suppress -Wempty-body for tracepoints used as "if" body. */
+#define rte_eal_trace_mem_malloc(...) do {} while (0)
+#define rte_eal_trace_mem_zmalloc(...) do {} while (0)
+#define rte_eal_trace_mem_realloc(...) do {} while (0)
+#define rte_eal_trace_mem_free(...) do {} while (0)
+#endif
 
 #include <rte_malloc.h>
 #include "malloc_elem.h"
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..854b83bcd 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,9 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
 	rte_log
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
 	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_get_page_size
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_unmap
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 63461f51a..38f17f09c 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -93,6 +93,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -224,6 +242,89 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
+
  /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
@@ -245,6 +346,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.no_shconf == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.no_shconf = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -256,6 +364,42 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..e72e785b8
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,418 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+#include <rte_windows.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no assiciated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory from OS. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu) on socket %d\n",
+				requested_addr, alloc_sz, socket_id);
+			return -1;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx()");
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of a hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	eal_mem_decommit(addr, alloc_sz);
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len))
+		return -1;
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info);
+				i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..3812b7c67
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,1155 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2010-2014 Intel Corporation (functions from Linux EAL)
+ * Copyright (c) 2020 Dmitry Kozlyuk (Windows specifics)
+ */
+
+#include <inttypes.h>
+#include <io.h>
+
+#include <rte_errno.h>
+#include <rte_memory.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+#include <rte_virt2phys.h>
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ *
+ * MEM_RESERVE_PLACEHOLDER being defined means VirtualAlloc2() is present too.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterMax,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	/* Contrary to the docs, VirtualAlloc2() is not in kernel32.dll,
+	 * see https://github.com/MicrosoftDocs/feedback/issues/1129.
+	 */
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+
+		/* Contrary to the docs, Server 2016 is not supported. */
+		RTE_LOG(ERR, EAL, "Windows 10 or Windows Server 2019 "
+			" is required for memory management\n");
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* defined(MEM_RESERVE_PLACEHOLDER) */
+
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static void
+set_errno_from_win32_alloc_error(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		rte_errno = 0;
+		break;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		rte_errno = ENOMEM;
+		break;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		rte_errno = EINVAL;
+		break;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	virt = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		set_errno_from_win32_alloc_error(err);
+	}
+
+	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFree(virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFree()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc(size_t size, size_t page_size)
+{
+	if (page_size != 0)
+		return eal_mem_alloc_socket(size, SOCKET_ID_ANY);
+
+	return VirtualAlloc(
+		NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void*
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+		if (VirtualQuery(requested_addr, &info, sizeof(info)) == 0) {
+			RTE_LOG_WIN32_ERR("VirtualQuery()");
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) &&
+			!VirtualFree(requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+				"<split placeholder>)", requested_addr, size);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	if (requested_addr != NULL)
+		flags |= MEM_REPLACE_PLACEHOLDER;
+
+	addr = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, "
+			"<replace placeholder>)", addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	/* Decommit memory, which might be a part of a larger reserved region.
+	 * Allocator commits hugepage-sized placeholders, so there's no need
+	 * to coalesce placeholders back into region, they can be reused as is.
+	 */
+	if (!VirtualFree(addr, size, MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, ...)", addr, size);
+		return -1;
+	}
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	HANDLE process;
+
+	if (VirtualQuery(addr, &info, sizeof(info)) == 0) {
+		RTE_LOG_WIN32_ERR("VirtualQuery()");
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	process = GetCurrentProcess();
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFreeEx(process, addr, size,
+			MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+			"MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)", addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	RTE_SET_USED(virt);
+	RTE_SET_USED(size);
+	RTE_SET_USED(dump);
+
+	/* Windows does not dump reserved memory by default.
+	 *
+	 * There is <werapi.h> to include or exclude regions from the dump,
+	 * but this is not currently required by EAL.
+	 */
+
+	rte_errno = ENOTSUP;
+	return -1;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* There is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!UnmapViewOfFile(virt))
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		virt = NULL;
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		rte_errno = GetLastError();
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		return -1;
+	}
+	return 0;
+}
+
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
+size_t
+rte_get_page_size(void)
+{
+	SYSTEM_INFO info;
+	GetSystemInfo(&info);
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock()");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx)
+{
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
+}
+
+static int
+memseg_list_alloc(struct rte_memseg_list *msl)
+{
+	return eal_memseg_list_alloc(msl, 0);
+}
+
+/*
+ * Remaining code in this file largely duplicates Linux EAL.
+ * Although Windows EAL supports only one hugepage size currently,
+ * code structure and comments are preserved so that changes may be
+ * easily ported until duplication is removed.
+ */
+
+static int
+memseg_primary_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (memseg_list_init(msl, pagesz, n_segs,
+					socket_id, cur_seglist))
+				goto out;
+
+			if (memseg_list_alloc(msl)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int
+memseg_secondary_init(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+		return memseg_primary_init();
+	return memseg_secondary_init();
+}
+
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+static int
+calc_num_pages_per_socket(uint64_t *memory,
+		struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used,
+		unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from cpu mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			strncpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int) (internal_config.memory / 0x100000);
+		available = requested - (unsigned int) (total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
+
+/* Limit is checked by validator itself, nothing left to analyze.*/
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+static int
+eal_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES; socket_id++)
+		memory[socket_id] = internal_config.socket_mem[socket_id];
+
+	/* calculate final number of pages */
+	if (calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket "
+					"limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs, cur_seg;
+	uint64_t page_sz;
+	void *addr;
+	struct rte_fbarray *arr;
+	struct rte_memseg *ms;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	/* create a memseg list */
+	msl = &mcfg->memsegs[0];
+
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = internal_config.memory / page_sz;
+
+	if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
+		sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		return -1;
+	}
+
+	addr = eal_mem_alloc(internal_config.memory, 0);
+	if (addr == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate %zu bytes",
+		internal_config.memory);
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->page_sz = page_sz;
+	msl->socket_id = 0;
+	msl->len = internal_config.memory;
+	msl->heap = 1;
+
+	/* populate memsegs. each memseg is one page long */
+	for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
+		arr = &msl->memseg_arr;
+
+		ms = rte_fbarray_get(arr, cur_seg);
+		ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, cur_seg);
+
+		addr = RTE_PTR_ADD(addr, (size_t)page_sz);
+	}
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index 390d2fd66..9735f0293 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,4 +52,78 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate a contiguous chunk of virtual memory.
+ *
+ * Use eal_mem_free() to free allocated memory.
+ *
+ * @param size
+ *  Number of bytes to allocate.
+ * @param page_size
+ *  If non-zero, means memory must be allocated in hugepages
+ *  of the specified size. The *size* parameter must then be
+ *  a multiple of the largest hugepage size requested.
+ * @return
+ *  Address of allocated memory, NULL on failure and rte_errno is set.
+ */
+void *eal_mem_alloc(size_t size, size_t page_size);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with eal_mem_reserve()
+ * or decommitted from hugepages by eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and rte_errno is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit.
+ *
+ * The *addr* and *size* must match location and size
+ * of a previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..62805a307 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -36,6 +36,10 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define open _open
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 5f118bfe2..0bd56cd8f 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -8,6 +8,11 @@ sources += files(
 	'eal_debug.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
+	'eal_memalloc.c',
+	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'getopt.c',
 )
+
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations
  2020-04-17 12:24       ` Burakov, Anatoly
@ 2020-04-28 23:50         ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-04-28 23:50 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

On 2020-04-17 13:24 GMT+0100 Burakov, Anatoly wrote:
[snip]
> > +/** Behavior on file locking conflict. */
> > +enum eal_flock_mode {
> > +	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
> > +	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
> > +};  
> 
> Nitpicking, but why not blocking/unblocking? The terminology seems 
> pretty standard.

On Windows, LockFileEx() call may be non-blocking, but still configured to
fail if the lock is already taken. To avoid confusion, these names reflect
what the behavior will be instead of how it will be achieved.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-04-29  1:18       ` Ranjit Menon
  2020-05-01 19:19         ` Dmitry Kozlyuk
  2020-05-05 16:24       ` Burakov, Anatoly
  2020-05-13  8:24       ` Fady Bader
  2 siblings, 1 reply; 218+ messages in thread
From: Ranjit Menon @ 2020-04-29  1:18 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, John McNamara, Marko Kovacevic, Anatoly Burakov

On 4/28/2020 4:50 PM, Dmitry Kozlyuk wrote:
> Basic memory management supports core libraries and PMDs operating in
> IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
> IOVAs of hugepages allocated from user-mode. Multi-process mode is not
> implemented and is forcefully disabled at startup.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

[snip]

> +void *
> +eal_mem_reserve(void *requested_addr, size_t size, int flags)
> +{
> +	void *virt;
> +
> +	/* Windows requires hugepages to be committed. */
> +	if (flags & EAL_RESERVE_HUGEPAGES) {
> +		rte_errno = ENOTSUP;
> +		return NULL;
> +	}
> +
> +	virt = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
> +		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
> +		NULL, 0);
> +	if (virt == NULL) {
> +		DWORD err = GetLastError();
> +		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
> +		set_errno_from_win32_alloc_error(err);
> +	}
> +
> +	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
> +		if (!VirtualFree(virt, 0, MEM_RELEASE))

Shouldn't this be VirtualFreeEx() here?

> +			RTE_LOG_WIN32_ERR("VirtualFree()");
> +		rte_errno = ENOMEM;
> +		return NULL;
> +	}
> +
> +	return virt;
> +}
> +

ranjit m.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/8] eal: introduce internal wrappers for file operations
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 2/8] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-04-29 16:41       ` Burakov, Anatoly
  0 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-04-29 16:41 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Bruce Richardson

On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
> EAL common code uses file locking and truncation. Introduce
> OS-independent wrappers in order to support both Linux/FreeBSD
> and Windows:
> 
> * eal_file_lock: lock or unlock an open file.
> * eal_file_truncate: enforce a given size for an open file.
> 
> Wrappers follow POSIX semantics, but interface is not POSIX,
> so that it can be made more clean, e.g. by not mixing locking
> operation and behaviour on conflict.
> 
> Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
> which is intended for common code between the two. Files should be named
> after the ones from which the code is factored in OS subdirectory.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

>   
>   #include <rte_common.h>
>   #include <rte_log.h>
> @@ -85,7 +85,7 @@ resize_and_map(int fd, void *addr, size_t len)
>   	char path[PATH_MAX];
>   	void *map_addr;
>   
> -	if (ftruncate(fd, len)) {
> +	if (eal_file_truncate(fd, len)) {
>   		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
>   		/* pass errno up the chain */
>   		rte_errno = errno;

eal_file_truncate sets rte_errno, so no need to pass it up the chain any 
more.

Otherwise,

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-04-29 17:13       ` Burakov, Anatoly
  2020-04-30 13:59         ` Burakov, Anatoly
  2020-05-01 19:00         ` Dmitry Kozlyuk
  0 siblings, 2 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-04-29 17:13 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Bruce Richardson

On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
> Introduce OS-independent wrappers for memory management operations used
> across DPDK and specifically in common code of EAL:
> 
> * rte_mem_map()
> * rte_mem_unmap()
> * rte_get_page_size()
> * rte_mem_lock()
> 
> Windows uses different APIs for memory mapping and reservation, while
> Unices reserve memory by mapping it. Introduce EAL private functions to
> support memory reservation in common code:
> 
> * eal_mem_reserve()
> * eal_mem_free()
> * eal_mem_set_dump()
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

>   	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
>   
> @@ -105,24 +94,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>   			return NULL;
>   		}
>   
> -		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
> -				mmap_flags, -1, 0);
> -		if (mapped_addr == MAP_FAILED && allow_shrink)
> -			*size -= page_sz;
> +		mapped_addr = eal_mem_reserve(
> +			requested_addr, (size_t)map_sz, reserve_flags);
> +		if ((mapped_addr == NULL) && allow_shrink)
> +			size -= page_sz;

Should be *size -= page_sz, size is a pointer in this case.

>   
> -		if (mapped_addr != MAP_FAILED && addr_is_hint &&
> -		    mapped_addr != requested_addr) {
> +		if ((mapped_addr != NULL) && addr_is_hint &&
> +				(mapped_addr != requested_addr)) {
>   			try++;
>   			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
>   			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
>   				/* hint was not used. Try with another offset */
> -				munmap(mapped_addr, map_sz);
> -				mapped_addr = MAP_FAILED;
> +				eal_mem_free(mapped_addr, *size);

Why change map_sz to *size?

> +				mapped_addr = NULL;
>   				requested_addr = next_baseaddr;
>   			}
>   		}
>   	} while ((allow_shrink || addr_is_hint) &&
> -		 mapped_addr == MAP_FAILED && *size > 0);
> +		(mapped_addr == NULL) && (*size > 0));
>   

<snip>

> @@ -547,10 +531,10 @@ rte_eal_memdevice_init(void)
>   int
>   rte_mem_lock_page(const void *virt)
>   {
> -	unsigned long virtual = (unsigned long)virt;
> -	int page_size = getpagesize();
> -	unsigned long aligned = (virtual & ~(page_size - 1));
> -	return mlock((void *)aligned, page_size);
> +	uintptr_t virtual = (uintptr_t)virt;
> +	int page_size = rte_get_page_size();
> +	uintptr_t aligned = (virtual & ~(page_size - 1));

Might as well fix to use macros? e.g.

size_t pagesz = rte_get_page_size();
return rte_mem_lock(RTE_PTR_ALIGN(virt, pagesz), pagesz);

(also, note that rte_get_page_size() returns size_t, not int)

> +	return rte_mem_lock((void *)aligned, page_size);
>   }
>   
>   int
> diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
> index 3aafd892f..67ca83e47 100644
> --- a/lib/librte_eal/common/eal_private.h
> +++ b/lib/librte_eal/common/eal_private.h
> @@ -11,6 +11,7 @@
>   

<snip>

> + *  Reservation size. Must be a multiple of system page size.
> + * @param flags
> + *  Reservation options, a combination of eal_mem_reserve_flags.
> + * @returns
> + *  Starting address of the reserved area on success, NULL on failure.
> + *  Callers must not access this memory until remapping it.
> + */
> +void *eal_mem_reserve(void *requested_addr, size_t size, int flags);

Should we also require requested_addr to be page-aligned?

Also, here and in other added API's, nitpick but our coding style guide 
(and the code style in this file) suggests that return value should be 
on a separate line, e.g.

void *
eal_mem_reserve(...);

> +
> +/**
> + * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
> + *
> + * If *virt* and *size* describe a part of the reserved region,
> + * only this part of the region is freed (accurately up to the system
> + * page size). If *virt* points to allocated memory, *size* must match
> + * the one specified on allocation. The behavior is undefined
> + * if the memory pointed by *virt* is obtained from another source
> + * than listed above.
> + *

<snip>

> +}
> +
> +static int
> +mem_rte_to_sys_prot(int prot)
> +{
> +	int sys_prot = 0;

Maybe set it to PROT_NONE to make it more obvious?

> +
> +	if (prot & RTE_PROT_READ)
> +		sys_prot |= PROT_READ;
> +	if (prot & RTE_PROT_WRITE)
> +		sys_prot |= PROT_WRITE;
> +	if (prot & RTE_PROT_EXECUTE)
> +		sys_prot |= PROT_EXEC;
> +
> +	return sys_prot;
> +}
> +
> +void *
> +rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset)
> +{
> +	int sys_prot = 0;

Not necessary to initialize sys_prot (and it's counter-productive as 
compiler warning about uninitialized usage is a *good thing*!).

> +	int sys_flags = 0;
> +
> +	sys_prot = mem_rte_to_sys_prot(prot);
> +
> +	if (flags & RTE_MAP_SHARED)
> +		sys_flags |= MAP_SHARED;
> +	if (flags & RTE_MAP_ANONYMOUS)
> +		sys_flags |= MAP_ANONYMOUS;
> +	if (flags & RTE_MAP_PRIVATE)
> +		sys_flags |= MAP_PRIVATE;
> +	if (flags & RTE_MAP_FORCE_ADDRESS)
> +		sys_flags |= MAP_FIXED;
> +
> +	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
> +}
> +
> +int
> +rte_mem_unmap(void *virt, size_t size)
> +{
> +	return mem_unmap(virt, size);
> +}
> +
> +size_t
> +rte_get_page_size(void)
> +{
> +	return sysconf(_SC_PAGESIZE);

Can we perhaps cache this value?

> +}
> +
> +int
> +rte_mem_lock(const void *virt, size_t size)
> +{
> +	return mlock(virt, size);

This call can fail. It should pass errno as rte_errno as well, just like 
all other calls from this family.

Also, if the implementation "may require" page alignment, how about 
requiring it unconditionally?

> +}
> diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
> index cfa1b4ef9..5734f26ad 100644
> --- a/lib/librte_eal/unix/meson.build
> +++ b/lib/librte_eal/unix/meson.build
> @@ -3,4 +3,5 @@
>   
>   sources += files(
>   	'eal_unix.c',
> +	'eal_unix_memory.c',
>   )
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers
  2020-04-29 17:13       ` Burakov, Anatoly
@ 2020-04-30 13:59         ` Burakov, Anatoly
  2020-05-01 19:00         ` Dmitry Kozlyuk
  1 sibling, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-04-30 13:59 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Bruce Richardson

On 29-Apr-20 6:13 PM, Burakov, Anatoly wrote:
>> @@ -547,10 +531,10 @@ rte_eal_memdevice_init(void)
>>   int
>>   rte_mem_lock_page(const void *virt)
>>   {
>> -    unsigned long virtual = (unsigned long)virt;
>> -    int page_size = getpagesize();
>> -    unsigned long aligned = (virtual & ~(page_size - 1));
>> -    return mlock((void *)aligned, page_size);
>> +    uintptr_t virtual = (uintptr_t)virt;
>> +    int page_size = rte_get_page_size();
>> +    uintptr_t aligned = (virtual & ~(page_size - 1));
> 
> Might as well fix to use macros? e.g.
> 
> size_t pagesz = rte_get_page_size();
> return rte_mem_lock(RTE_PTR_ALIGN(virt, pagesz), pagesz);
> 
> (also, note that rte_get_page_size() returns size_t, not int)

Apologies, this should've been RTE_PTR_ALIGN_FLOOR(virt, pagesz)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers
  2020-04-29 17:13       ` Burakov, Anatoly
  2020-04-30 13:59         ` Burakov, Anatoly
@ 2020-05-01 19:00         ` Dmitry Kozlyuk
  2020-05-05 14:43           ` Burakov, Anatoly
  1 sibling, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-01 19:00 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Bruce Richardson

Thanks for pointing out the errors, see some comments inline.

On 2020-04-29 18:13 GMT+0100 Burakov, Anatoly wrote:
> On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote: 
<snip>
> > + *  Reservation size. Must be a multiple of system page size.
> > + * @param flags
> > + *  Reservation options, a combination of eal_mem_reserve_flags.
> > + * @returns
> > + *  Starting address of the reserved area on success, NULL on failure.
> > + *  Callers must not access this memory until remapping it.
> > + */
> > +void *eal_mem_reserve(void *requested_addr, size_t size, int flags);  
> 
> Should we also require requested_addr to be page-aligned?

Yes.

> Also, here and in other added API's, nitpick but our coding style guide 
> (and the code style in this file) suggests that return value should be 
> on a separate line, e.g.
> 
> void *
> eal_mem_reserve(...);

Will follow your advice in v5 to keep the style within this file consistent.
However, DPDK Coding Style explicitly says:

	Unlike function definitions, the function prototypes do not need to
	place the function return type on a separate line.

[snip]
> > +
> > +int
> > +rte_mem_lock(const void *virt, size_t size)
> > +{
> > +	return mlock(virt, size);  
> 
> This call can fail. It should pass errno as rte_errno as well, just like 
> all other calls from this family.
> 
> Also, if the implementation "may require" page alignment, how about 
> requiring it unconditionally?

IMO even better to document this function as locking all pages crossed by the
address region. This would save address checking/alignment at call site and
all implementations work this way. Locking memory implies paging system.

-- 
Dmtiry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-04-29  1:18       ` Ranjit Menon
@ 2020-05-01 19:19         ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-01 19:19 UTC (permalink / raw)
  To: Ranjit Menon
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, John McNamara, Marko Kovacevic, Anatoly Burakov

On 2020-04-28 18:18 GMT-0700 Ranjit Menon wrote:
> On 4/28/2020 4:50 PM, Dmitry Kozlyuk wrote:
[snip]
> > +void *
> > +eal_mem_reserve(void *requested_addr, size_t size, int flags)
> > +{
> > +	void *virt;
> > +
> > +	/* Windows requires hugepages to be committed. */
> > +	if (flags & EAL_RESERVE_HUGEPAGES) {
> > +		rte_errno = ENOTSUP;
> > +		return NULL;
> > +	}
> > +
> > +	virt = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
> > +		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
> > +		NULL, 0);
> > +	if (virt == NULL) {
> > +		DWORD err = GetLastError();
> > +		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
> > +		set_errno_from_win32_alloc_error(err);

return NULL;
is also missing here, thanks for making me re-check this part.

> > +	}
> > +
> > +	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
> > +		if (!VirtualFree(virt, 0, MEM_RELEASE))  
> 
> Shouldn't this be VirtualFreeEx() here?

You're right, there were a few more places like this within the file.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers
  2020-05-01 19:00         ` Dmitry Kozlyuk
@ 2020-05-05 14:43           ` Burakov, Anatoly
  0 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-05 14:43 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Bruce Richardson

On 01-May-20 8:00 PM, Dmitry Kozlyuk wrote:
> Thanks for pointing out the errors, see some comments inline.
> 
> On 2020-04-29 18:13 GMT+0100 Burakov, Anatoly wrote:
>> On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
> <snip>
>>> + *  Reservation size. Must be a multiple of system page size.
>>> + * @param flags
>>> + *  Reservation options, a combination of eal_mem_reserve_flags.
>>> + * @returns
>>> + *  Starting address of the reserved area on success, NULL on failure.
>>> + *  Callers must not access this memory until remapping it.
>>> + */
>>> +void *eal_mem_reserve(void *requested_addr, size_t size, int flags);
>>
>> Should we also require requested_addr to be page-aligned?
> 
> Yes.
> 
>> Also, here and in other added API's, nitpick but our coding style guide
>> (and the code style in this file) suggests that return value should be
>> on a separate line, e.g.
>>
>> void *
>> eal_mem_reserve(...);
> 
> Will follow your advice in v5 to keep the style within this file consistent.
> However, DPDK Coding Style explicitly says:
> 
> 	Unlike function definitions, the function prototypes do not need to
> 	place the function return type on a separate line.
> 
> [snip]
>>> +
>>> +int
>>> +rte_mem_lock(const void *virt, size_t size)
>>> +{
>>> +	return mlock(virt, size);
>>
>> This call can fail. It should pass errno as rte_errno as well, just like
>> all other calls from this family.
>>
>> Also, if the implementation "may require" page alignment, how about
>> requiring it unconditionally?
> 
> IMO even better to document this function as locking all pages crossed by the
> address region. This would save address checking/alignment at call site and
> all implementations work this way. Locking memory implies paging system.
> 

I don't think any other external API we provide does automagic pointer 
alignment, so i'm not sure if it indeed would be better to have it align 
automatically. It's also better from the standpoint of not silently 
allowing seemingly invalid arguments. So, i would lean on the side of 
requiring alignment, but not doing it ourselves.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/8] eal: extract common code for memseg list initialization
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 4/8] eal: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-05-05 16:08       ` Burakov, Anatoly
  0 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-05 16:08 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Bruce Richardson

On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

Would it be possible to extract all similar code and make it use 
eal_memseg_list_init? For example, in FreeBSD eal_memory.c in nohuge 
mode (or in Linux legacy mem) the code remains untouched. I wonder if it 
can be ported to using this function.

>   lib/librte_eal/common/eal_common_memory.c | 54 ++++++++++++++++++
>   lib/librte_eal/common/eal_private.h       | 36 ++++++++++++
>   lib/librte_eal/freebsd/eal_memory.c       | 57 +++----------------
>   lib/librte_eal/linux/eal_memory.c         | 68 +++++------------------
>   4 files changed, 113 insertions(+), 102 deletions(-)
> 
> diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
> index 1196a8037..56eff0acb 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> @@ -24,6 +24,7 @@
>   #include "eal_private.h"
>   #include "eal_internal_cfg.h"
>   #include "eal_memcfg.h"
> +#include "eal_options.h"
>   #include "malloc_heap.h"
>   
>   /*
> @@ -181,6 +182,59 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>   	return aligned_addr;
>   }
>   
> +int
> +eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
> +		int n_segs, int socket_id, int type_msl_idx, bool heap)
> +{
> +	char name[RTE_FBARRAY_NAME_LEN];
> +
> +	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
> +		 type_msl_idx);
> +	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
> +			sizeof(struct rte_memseg))) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
> +			rte_strerror(rte_errno));
> +		return -1;
> +	}
> +
> +	msl->page_sz = page_sz;
> +	msl->socket_id = socket_id;
> +	msl->base_va = NULL;
> +	msl->heap = heap;
> +
> +	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
> +			(size_t)page_sz >> 10, socket_id);

The log message looks odd here. Maybe change it to indicate that the kB 
value is the page size, not the size of the memory?

> +
> +	return 0;
> +}
> +
> +int
> +eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
> +{
> +	uint64_t page_sz;
> +	size_t mem_sz;
> +	void *addr;
> +
> +	page_sz = msl->page_sz;
> +	mem_sz = page_sz * msl->memseg_arr.len;
> +
> +	addr = eal_get_virtual_area(
> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
> +	if (addr == NULL) {
> +		if (rte_errno == EADDRNOTAVAIL)
> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> +				(unsigned long long)mem_sz, msl->base_va);
> +		else
> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> +		return -1;
> +	}
> +	msl->base_va = addr;
> +	msl->len = mem_sz;

Perhaps add a log message saying that space was allocated for this 
memseg list?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-04-29  1:18       ` Ranjit Menon
@ 2020-05-05 16:24       ` Burakov, Anatoly
  2020-05-05 23:20         ` Dmitry Kozlyuk
  2020-05-13  8:24       ` Fady Bader
  2 siblings, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-05 16:24 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic

On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
> Basic memory management supports core libraries and PMDs operating in
> IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
> IOVAs of hugepages allocated from user-mode. Multi-process mode is not
> implemented and is forcefully disabled at startup.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

Lots of duplication... I wonder if it would be possible to share at 
least some of this code in common. Tracking down bugs because of 
duplicated code desync is always a pain...

<snip>

> diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
> index 7c21aa921..9fa7bf352 100644
> --- a/lib/librte_eal/common/eal_common_memzone.c
> +++ b/lib/librte_eal/common/eal_common_memzone.c
> @@ -19,7 +19,14 @@
>   #include <rte_errno.h>
>   #include <rte_string_fns.h>
>   #include <rte_common.h>
> +
> +#ifndef RTE_EXEC_ENV_WINDOWS
>   #include <rte_eal_trace.h>
> +#else
> +#define rte_eal_trace_memzone_reserve(...)
> +#define rte_eal_trace_memzone_lookup(...)
> +#define rte_eal_trace_memzone_free(...)
> +#endif
>   

Is it possible for rte_eal_trace.h to implement this workaround instead? 
It wouldn't be very wise to have to have this in each file that depends 
on rte_eal_trace.h.

>   #include "malloc_heap.h"
>   #include "malloc_elem.h"
> diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
> index 155da29b4..9bb234009 100644
> --- a/lib/librte_eal/common/meson.build
> +++ b/lib/librte_eal/common/meson.build
> @@ -9,11 +9,21 @@ if is_windows
>   		'eal_common_class.c',
>   		'eal_common_devargs.c',
>   		'eal_common_errno.c',

<snip>

>    /* Launch threads, called at application init(). */
>   int
>   rte_eal_init(int argc, char **argv)
> @@ -245,6 +346,13 @@ rte_eal_init(int argc, char **argv)
>   	if (fctret < 0)
>   		exit(1);
>   
> +	/* Prevent creation of shared memory files. */
> +	if (internal_config.no_shconf == 0) {
> +		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
> +			"but not available.\n");
> +		internal_config.no_shconf = 1;

In the future i would like to deprecate no_shconf because it's a strict 
subset of in_memory mode and serves the same purpose. Might i suggest 
using in_memory flag instead? IIRC no_shconf is automatically set when 
you set in_memory mode.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-05 16:24       ` Burakov, Anatoly
@ 2020-05-05 23:20         ` Dmitry Kozlyuk
  2020-05-06  9:46           ` Burakov, Anatoly
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-05 23:20 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic

On 2020-05-05 17:24 GMT+0100 Burakov, Anatoly wrote:
> On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
> Lots of duplication... I wonder if it would be possible to share at 
> least some of this code in common. Tracking down bugs because of 
> duplicated code desync is always a pain...

This was the main question of the cover letter :)
Dmitry Malloy explained to me recently that even internally Windows has
no notion of preallocated hugepages and "memory types" (as memseg_primary_init
describes it). Since Windows EAL is not going to support multi-process any
time soon (if ever), maybe these reservations are not needed and memory manger
should create MSLs and enforce socket limits dynamically? This way most of the
duplicated code can be removed, I think. Or does MSL reservation serve some
other purposes?

> > diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
> > index 7c21aa921..9fa7bf352 100644
> > --- a/lib/librte_eal/common/eal_common_memzone.c
> > +++ b/lib/librte_eal/common/eal_common_memzone.c
> > @@ -19,7 +19,14 @@
> >   #include <rte_errno.h>
> >   #include <rte_string_fns.h>
> >   #include <rte_common.h>
> > +
> > +#ifndef RTE_EXEC_ENV_WINDOWS
> >   #include <rte_eal_trace.h>
> > +#else
> > +#define rte_eal_trace_memzone_reserve(...)
> > +#define rte_eal_trace_memzone_lookup(...)
> > +#define rte_eal_trace_memzone_free(...)
> > +#endif
> >     
> 
> Is it possible for rte_eal_trace.h to implement this workaround instead? 
> It wouldn't be very wise to have to have this in each file that depends 
> on rte_eal_trace.h.

I can add a patch that makes each tracepoint a no-op on Windows.

We discussed this issue (spreading workarounds) 2020-04-30 on Windows
community call. The proper solution would be supporting trace on Windows, but
IIRC no one is yet directly assigned to do that.

[snip] 
> > +	/* Prevent creation of shared memory files. */
> > +	if (internal_config.no_shconf == 0) {
> > +		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
> > +			"but not available.\n");
> > +		internal_config.no_shconf = 1;  
> 
> In the future i would like to deprecate no_shconf because it's a strict 
> subset of in_memory mode and serves the same purpose. Might i suggest 
> using in_memory flag instead? IIRC no_shconf is automatically set when 
> you set in_memory mode.

OK, thanks.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows
  2020-04-15 11:17               ` Jerin Jacob
@ 2020-05-06  5:41                 ` Ray Kinsella
  0 siblings, 0 replies; 218+ messages in thread
From: Ray Kinsella @ 2020-05-06  5:41 UTC (permalink / raw)
  To: Jerin Jacob, Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Anatoly Burakov



On 15/04/2020 12:17, Jerin Jacob wrote:
> On Wed, Apr 15, 2020 at 4:39 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>>
>>
>>
>>> On Wed, Apr 15, 2020 at 4:02 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>>>>
>>>>> On Wed, Apr 15, 2020 at 1:16 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>>>>>>
>>>>>> Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
>>>>>> Enum rte_page_size has members valued above this limit, which get
>>>>>> wrapped to zero, resulting in compilation error (duplicate values in
>>>>>> enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.
>>>>>>
>>>>>> Define these values outside of the enum for Clang on Windows only.
>>>>>> This does not affect runtime, because Windows doesn't run on machines
>>>>>> with 4GiB and 16GiB hugepages.
>>>>>>
>>>>>> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
>>>>>> ---
>>>>>>  lib/librte_eal/include/rte_memory.h | 6 ++++++
>>>>>>  1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
>>>>>> index 1b7c3e5df..3ec673f51 100644
>>>>>> --- a/lib/librte_eal/include/rte_memory.h
>>>>>> +++ b/lib/librte_eal/include/rte_memory.h
>>>>>> @@ -34,8 +34,14 @@ enum rte_page_sizes {
>>>>>>         RTE_PGSIZE_256M  = 1ULL << 28,
>>>>>>         RTE_PGSIZE_512M  = 1ULL << 29,
>>>>>>         RTE_PGSIZE_1G    = 1ULL << 30,
>>>>>> +/* Work around Clang on Windows being limited to 32-bit underlying type. */
>>>>>
>>>>> It does look like "enum rte_page_sizes" NOT used as enum anywhere.
>>>>>
>>>>> [master][dpdk.org] $ grep -ri "enum rte_page_sizes" lib/
>>>>> lib/librte_eal/include/rte_memory.h:enum rte_page_sizes {
>>>>>
>>>>> Why not remove this workaround and define all items as #define to
>>>>> avoid below ifdef clutter.
>>>>>
>>>>>> +#if !defined(RTE_CC_CLANG) || !defined(RTE_EXEC_ENV_WINDOWS)
>>>>>
>>>>> See above.
>>>>>
>>>>>>         RTE_PGSIZE_4G    = 1ULL << 32,
>>>>>>         RTE_PGSIZE_16G   = 1ULL << 34,
>>>>>> +#else
>>>>>> +#define RTE_PGSIZE_4G  (1ULL << 32)
>>>>>> +#define RTE_PGSIZE_16G (1ULL << 34)
>>>>>> +#endif
>>>>>>  };
>>>>>>
>>>>>>  #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
>>>>>> --
>>>>>> 2.25.1
>>>>>>
>>>>
>>>> This is a public header and removing enum rte_page_sizes will break API.
>>>> Moving members out of enum while keeping enum itself might break compilation
>>>> because of integer constants being converted to enum (with -Werror).
>>>
>>> If none of the public API is using this enum then I think, we may not
>>> need to make this enum as public.
>>
>> Agreed.
>>
>>> Since it has ULL, I believe both cases(enum or define), it will be
>>> treated as unsigned long long. ie. NO ABI breakage.
>>
>> I was talking about API only (compile-time compatibility). Getting rid of
>> #ifdef and workarounds sounds right, we'll just need a notice in release
>> notes.
> 
> Good to check ./devtools/check-abi.sh for any ABI breakage.

or something like this to cover all the bases.
DPDK_ABI_REF_DIR=/build/dpdk/reference/ DPDK_ABI_REF_VERSION=v20.02 ./devtools/test-meson-builds.sh


>>
>> --
>> Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-05 23:20         ` Dmitry Kozlyuk
@ 2020-05-06  9:46           ` Burakov, Anatoly
  2020-05-06 21:53             ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-06  9:46 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic

On 06-May-20 12:20 AM, Dmitry Kozlyuk wrote:
> On 2020-05-05 17:24 GMT+0100 Burakov, Anatoly wrote:
>> On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
>> Lots of duplication... I wonder if it would be possible to share at
>> least some of this code in common. Tracking down bugs because of
>> duplicated code desync is always a pain...
> 
> This was the main question of the cover letter :)
> Dmitry Malloy explained to me recently that even internally Windows has
> no notion of preallocated hugepages and "memory types" (as memseg_primary_init
> describes it). Since Windows EAL is not going to support multi-process any
> time soon (if ever), maybe these reservations are not needed and memory manger
> should create MSLs and enforce socket limits dynamically? This way most of the
> duplicated code can be removed, I think. Or does MSL reservation serve some
> other purposes?

MSL reservation serves the purpose of dynamically expanding memory 
usage. If there is no notion of NUMA nodes or multiple page sizes, then 
you can greatly simplify the code, but you'd still need *some* usage of 
MSL's if you plan to support dynamically allocating memory, or 
supporting externally allocated memory (i assume it's out of scope for 
now, since you can't do IOVA as VA).

So, yes, you could greatly simplify the memory management code *if* you 
were to go FreeBSD way and not allow dynamic page reservation. If you 
do, however, then i would guess that you'd end up writing something 
that's largely similar to existing Linux code (minus multiprocess) and 
so would just be duplicating effort.

> 
>>> diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c
>>> index 7c21aa921..9fa7bf352 100644
>>> --- a/lib/librte_eal/common/eal_common_memzone.c
>>> +++ b/lib/librte_eal/common/eal_common_memzone.c
>>> @@ -19,7 +19,14 @@
>>>    #include <rte_errno.h>
>>>    #include <rte_string_fns.h>
>>>    #include <rte_common.h>
>>> +
>>> +#ifndef RTE_EXEC_ENV_WINDOWS
>>>    #include <rte_eal_trace.h>
>>> +#else
>>> +#define rte_eal_trace_memzone_reserve(...)
>>> +#define rte_eal_trace_memzone_lookup(...)
>>> +#define rte_eal_trace_memzone_free(...)
>>> +#endif
>>>      
>>
>> Is it possible for rte_eal_trace.h to implement this workaround instead?
>> It wouldn't be very wise to have to have this in each file that depends
>> on rte_eal_trace.h.
> 
> I can add a patch that makes each tracepoint a no-op on Windows.
> 
> We discussed this issue (spreading workarounds) 2020-04-30 on Windows
> community call. The proper solution would be supporting trace on Windows, but
> IIRC no one is yet directly assigned to do that.

Apologies, i'm not plugged into those discussions :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-06  9:46           ` Burakov, Anatoly
@ 2020-05-06 21:53             ` Dmitry Kozlyuk
  2020-05-07 11:57               ` Burakov, Anatoly
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-06 21:53 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic

On 2020-05-06 10:46 GMT+0100 Burakov, Anatoly wrote:
> On 06-May-20 12:20 AM, Dmitry Kozlyuk wrote:
> > On 2020-05-05 17:24 GMT+0100 Burakov, Anatoly wrote:  
> >> On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
> >> Lots of duplication... I wonder if it would be possible to share at
> >> least some of this code in common. Tracking down bugs because of
> >> duplicated code desync is always a pain...  
> > 
> > This was the main question of the cover letter :)
> > Dmitry Malloy explained to me recently that even internally Windows has
> > no notion of preallocated hugepages and "memory types" (as memseg_primary_init
> > describes it). Since Windows EAL is not going to support multi-process any
> > time soon (if ever), maybe these reservations are not needed and memory manger
> > should create MSLs and enforce socket limits dynamically? This way most of the
> > duplicated code can be removed, I think. Or does MSL reservation serve some
> > other purposes?  
> 
> MSL reservation serves the purpose of dynamically expanding memory 
> usage.

But expansion is limited during init, because alloc_more_mem_on_socket()
works with existing MSLs, correct? No going to change anything there, just
trying to understand MM internals.

> If there is no notion of NUMA nodes or multiple page sizes, then 
> you can greatly simplify the code, but you'd still need *some* usage of 
> MSL's if you plan to support dynamically allocating memory, or 
> supporting externally allocated memory (i assume it's out of scope for 
> now, since you can't do IOVA as VA).

Windows is NUMA-aware and it supports both 2MB and 1GB hugepages (although
Windows EAL does not at the moment, because Win32 API is not yet official).
What I meant is that Windows does not reserve hugepages like Linux does with
vm.nr_hugepages or hugepage-related kernel options. So logic duplicated from
Linux EAL makes sense for Windows. The bulk of it can be extracted to some
common file, but it will not be truly common, rather "everything but
FreeBSD". Against it is a point that Windows MM may change significantly, but
I honestly can't come up with an example of how can those duplicated parts
may require adjustments.

> So, yes, you could greatly simplify the memory management code *if* you 
> were to go FreeBSD way and not allow dynamic page reservation. If you 
> do, however, then i would guess that you'd end up writing something 
> that's largely similar to existing Linux code (minus multiprocess) and 
> so would just be duplicating effort.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-06 21:53             ` Dmitry Kozlyuk
@ 2020-05-07 11:57               ` Burakov, Anatoly
  0 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-07 11:57 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman,
	Thomas Monjalon, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic

On 06-May-20 10:53 PM, Dmitry Kozlyuk wrote:
> On 2020-05-06 10:46 GMT+0100 Burakov, Anatoly wrote:
>> On 06-May-20 12:20 AM, Dmitry Kozlyuk wrote:
>>> On 2020-05-05 17:24 GMT+0100 Burakov, Anatoly wrote:
>>>> On 29-Apr-20 12:50 AM, Dmitry Kozlyuk wrote:
>>>> Lots of duplication... I wonder if it would be possible to share at
>>>> least some of this code in common. Tracking down bugs because of
>>>> duplicated code desync is always a pain...
>>>
>>> This was the main question of the cover letter :)
>>> Dmitry Malloy explained to me recently that even internally Windows has
>>> no notion of preallocated hugepages and "memory types" (as memseg_primary_init
>>> describes it). Since Windows EAL is not going to support multi-process any
>>> time soon (if ever), maybe these reservations are not needed and memory manger
>>> should create MSLs and enforce socket limits dynamically? This way most of the
>>> duplicated code can be removed, I think. Or does MSL reservation serve some
>>> other purposes?
>>
>> MSL reservation serves the purpose of dynamically expanding memory
>> usage.
> 
> But expansion is limited during init, because alloc_more_mem_on_socket()
> works with existing MSLs, correct? No going to change anything there, just
> trying to understand MM internals.

Yes, system memory MSLs will stay the same for the duration of the 
program, they are not allocated on the fly. External memory will 
create/destroy MSLs but those too aren't allocated dynamically - there's 
a fixed number of MSLs, and if you run out, well, you're out of luck.

So no, the MSLs themselves don't get allocated/deallocated at runtime 
*if* they belong to system (internal) memory.

> 
>> If there is no notion of NUMA nodes or multiple page sizes, then
>> you can greatly simplify the code, but you'd still need *some* usage of
>> MSL's if you plan to support dynamically allocating memory, or
>> supporting externally allocated memory (i assume it's out of scope for
>> now, since you can't do IOVA as VA).
> 
> Windows is NUMA-aware and it supports both 2MB and 1GB hugepages (although
> Windows EAL does not at the moment, because Win32 API is not yet official).
> What I meant is that Windows does not reserve hugepages like Linux does with
> vm.nr_hugepages or hugepage-related kernel options. So logic duplicated from
> Linux EAL makes sense for Windows. The bulk of it can be extracted to some
> common file, but it will not be truly common, rather "everything but
> FreeBSD". Against it is a point that Windows MM may change significantly, but
> I honestly can't come up with an example of how can those duplicated parts
> may require adjustments.

Fair enough. It's your EAL, you can do as you like :)

> 
>> So, yes, you could greatly simplify the memory management code *if* you
>> were to go FreeBSD way and not allow dynamic page reservation. If you
>> do, however, then i would guess that you'd end up writing something
>> that's largely similar to existing Linux code (minus multiprocess) and
>> so would just be duplicating effort.
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-04-29  1:18       ` Ranjit Menon
  2020-05-05 16:24       ` Burakov, Anatoly
@ 2020-05-13  8:24       ` Fady Bader
  2020-05-13  8:42         ` Dmitry Kozlyuk
  2 siblings, 1 reply; 218+ messages in thread
From: Fady Bader @ 2020-05-13  8:24 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

Hi Dmitry,
I'm using your latest memory management patchset and getting an error in
the function VirualAlloc2 in eal_mem_commit, error code: 0x57 (ERROR_INVALID_PARAMETER).
I'm using Windows server 2019 build 17763, and followed the steps to Grant *Lock pages in memory* Privilege.

The parameters that are sent to the function are:
GetCurrentProcess() is -1.
requested_addr is 0x0000025b`93800000.
Size is 0x200000 (sysInfo.dwAllocationGranularity is 0x10000). 
Flags is 0x20007000.
Also, Socket_id is 0.

The call stack is:
00 dpdk_mempool_test!eal_mem_commit+0x253 
01 dpdk_mempool_test!alloc_seg+0x1b0
02 dpdk_mempool_test!alloc_seg_walk+0x2a1 
03 dpdk_mempool_test!rte_memseg_list_walk_thread_unsafe+0x81 
04 dpdk_mempool_test!eal_memalloc_alloc_seg_bulk+0x1a5 
05 dpdk_mempool_test!alloc_pages_on_heap+0x13a 
06 dpdk_mempool_test!try_expand_heap_primary+0x1dc 
07 dpdk_mempool_test!try_expand_heap+0xf5 
08 dpdk_mempool_test!alloc_more_mem_on_socket+0x693 
09 dpdk_mempool_test!malloc_heap_alloc_on_heap_id+0x2a7 
0a dpdk_mempool_test!malloc_heap_alloc+0x184 
0b dpdk_mempool_test!malloc_socket+0xf9
0c dpdk_mempool_test!rte_malloc_socket+0x39 
0d dpdk_mempool_test!rte_zmalloc_socket+0x31 
0e dpdk_mempool_test!rte_zmalloc+0x2d 
0f dpdk_mempool_test!rte_mempool_create_empty+0x1c9 
10 dpdk_mempool_test!rte_mempool_create+0xf8 

> -----Original Message-----
> From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> Sent: Wednesday, April 29, 2020 2:50 AM
> To: dev@dpdk.org
> Cc: Dmitry Malloy (MESHCHANINOV) <dmitrym@microsoft.com>; Narcisa
> Ana Maria Vasile <Narcisa.Vasile@microsoft.com>; Fady Bader
> <fady@mellanox.com>; Tal Shnaiderman <talshn@mellanox.com>; Dmitry
> Kozlyuk <dmitry.kozliuk@gmail.com>; Thomas Monjalon
> <thomas@monjalon.net>; Harini Ramakrishnan
> <harini.ramakrishnan@microsoft.com>; Omar Cardona
> <ocardona@microsoft.com>; Pallavi Kadam <pallavi.kadam@intel.com>;
> Ranjit Menon <ranjit.menon@intel.com>; John McNamara
> <john.mcnamara@intel.com>; Marko Kovacevic
> <marko.kovacevic@intel.com>; Anatoly Burakov
> <anatoly.burakov@intel.com>
> Subject: [PATCH v4 8/8] eal/windows: implement basic memory
> management
> 
> Basic memory management supports core libraries and PMDs operating in
> IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
> IOVAs of hugepages allocated from user-mode. Multi-process mode is not
> implemented and is forcefully disabled at startup.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  config/meson.build                            |   12 +-
>  doc/guides/windows_gsg/run_apps.rst           |   54 +-
>  lib/librte_eal/common/eal_common_memzone.c    |    7 +
>  lib/librte_eal/common/meson.build             |   10 +
>  lib/librte_eal/common/rte_malloc.c            |    9 +
>  lib/librte_eal/rte_eal_exports.def            |  119 ++
>  lib/librte_eal/windows/eal.c                  |  144 ++
>  lib/librte_eal/windows/eal_memalloc.c         |  418 ++++++
>  lib/librte_eal/windows/eal_memory.c           | 1155 +++++++++++++++++
>  lib/librte_eal/windows/eal_mp.c               |  103 ++
>  lib/librte_eal/windows/eal_windows.h          |   90 ++
>  lib/librte_eal/windows/include/meson.build    |    1 +
>  lib/librte_eal/windows/include/rte_os.h       |    4 +
>  .../windows/include/rte_virt2phys.h           |   34 +
>  lib/librte_eal/windows/include/rte_windows.h  |    2 +
>  lib/librte_eal/windows/include/unistd.h       |    3 +
>  lib/librte_eal/windows/meson.build            |    5 +
>  17 files changed, 2164 insertions(+), 6 deletions(-)
>  create mode 100644 lib/librte_eal/windows/eal_memalloc.c
>  create mode 100644 lib/librte_eal/windows/eal_memory.c
>  create mode 100644 lib/librte_eal/windows/eal_mp.c
>  create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h
> 
> diff --git a/config/meson.build b/config/meson.build
> index 74f163223..800b5ba33 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -264,15 +264,21 @@ if is_freebsd
>  endif
> 
>  if is_windows
> -	# Minimum supported API is Windows 7.
> -	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
> +	# VirtualAlloc2() is available since Windows 10 / Server 2016.
> +	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
> 
>  	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant
> formatting.
>  	if cc.get_id() == 'gcc'
>  		add_project_arguments('-D__USE_MINGW_ANSI_STDIO',
> language: 'c')
>  	endif
> 
> -	add_project_link_arguments('-ladvapi32', language: 'c')
> +	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
> +	# in Windows SDK, while MinGW exports it by advapi32.a.
> +	if is_ms_linker
> +		add_project_link_arguments('-lmincore', language: 'c')
> +	endif
> +
> +	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
>  endif
> 
>  if get_option('b_lto')
> diff --git a/doc/guides/windows_gsg/run_apps.rst
> b/doc/guides/windows_gsg/run_apps.rst
> index 21ac7f6c1..78e5a614f 100644
> --- a/doc/guides/windows_gsg/run_apps.rst
> +++ b/doc/guides/windows_gsg/run_apps.rst
> @@ -7,10 +7,10 @@ Running DPDK Applications
>  Grant *Lock pages in memory* Privilege
>  --------------------------------------
> 
> -Use of hugepages ("large pages" in Windows terminolocy) requires
> +Use of hugepages ("large pages" in Windows terminology) requires
>  ``SeLockMemoryPrivilege`` for the user running an application.
> 
> -1. Open *Local Security Policy* snap in, either:
> +1. Open *Local Security Policy* snap-in, either:
> 
>     * Control Panel / Computer Management / Local Security Policy;
>     * or Win+R, type ``secpol``, press Enter.
> @@ -24,7 +24,55 @@ Use of hugepages ("large pages" in Windows
> terminolocy) requires
> 
>  See `Large-Page Support`_ in MSDN for details.
> 
> -.. _Large-page Support:
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> microsoft.com%2Fen-us%2Fwindows%2Fwin32%2Fmemory%2Flarge-page-
> support&amp;data=02%7C01%7Cfady%40mellanox.com%7C3c6bd806786c47
> 9c1e3008d7ebcef213%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7
> C637237146372884132&amp;sdata=l41r9A%2FflDmhDInC9L84zvPkO4efbRImC
> YmtI9IOkT4%3D&amp;reserved=0
> +.. _Large-Page Support:
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> microsoft.com%2Fen-us%2Fwindows%2Fwin32%2Fmemory%2Flarge-page-
> support&amp;data=02%7C01%7Cfady%40mellanox.com%7C3c6bd806786c47
> 9c1e3008d7ebcef213%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7
> C637237146372884132&amp;sdata=l41r9A%2FflDmhDInC9L84zvPkO4efbRImC
> YmtI9IOkT4%3D&amp;reserved=0
> +
> +
> +Load virt2phys Driver
> +---------------------
> +
> +Access to physical addresses is provided by a kernel-mode driver, virt2phys.
> +It is mandatory at least for using hardware PMDs, but may also be required
> +for mempools.
> +
> +Refer to documentation in ``dpdk-kmods`` repository for details on system
> +setup, driver build and installation. This driver is not signed, so signature
> +checking must be disabled to load it.
> +
> +.. warning::
> +
> +    Disabling driver signature enforcement weakens OS security.
> +    It is discouraged in production environments.
> +
> +Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
> +and ``virt2phys.sys``. It can be installed as follows
> +from Elevated Command Prompt:
> +
> +.. code-block:: console
> +
> +    pnputil /add-driver Z:\path\to\virt2phys.inf /install
> +
> +On Windows Server additional steps are required:
> +
> +1. From Device Manager, Action menu, select "Add legacy hardware".
> +2. It will launch the "Add Hardware Wizard". Click "Next".
> +3. Select second option "Install the hardware that I manually select
> +   from a list (Advanced)".
> +4. On the next screen, "Kernel bypass" will be shown as a device class.
> +5. Select it, and click "Next".
> +6. The previously installed drivers will now be installed for the
> +   "Virtual to physical address translator" device.
> +
> +When loaded successfully, the driver is shown in *Device Manager* as
> *Virtual
> +to physical address translator* device under *Kernel bypass* category.
> +Installed driver persists across reboots.
> +
> +If DPDK is unable to communicate with the driver, a warning is printed
> +on initialization (debug-level logs provide more details):
> +
> +.. code-block:: text
> +
> +    EAL: Cannot open virt2phys driver interface
> +
> 
> 
>  Run the ``helloworld`` Example
> diff --git a/lib/librte_eal/common/eal_common_memzone.c
> b/lib/librte_eal/common/eal_common_memzone.c
> index 7c21aa921..9fa7bf352 100644
> --- a/lib/librte_eal/common/eal_common_memzone.c
> +++ b/lib/librte_eal/common/eal_common_memzone.c
> @@ -19,7 +19,14 @@
>  #include <rte_errno.h>
>  #include <rte_string_fns.h>
>  #include <rte_common.h>
> +
> +#ifndef RTE_EXEC_ENV_WINDOWS
>  #include <rte_eal_trace.h>
> +#else
> +#define rte_eal_trace_memzone_reserve(...)
> +#define rte_eal_trace_memzone_lookup(...)
> +#define rte_eal_trace_memzone_free(...)
> +#endif
> 
>  #include "malloc_heap.h"
>  #include "malloc_elem.h"
> diff --git a/lib/librte_eal/common/meson.build
> b/lib/librte_eal/common/meson.build
> index 155da29b4..9bb234009 100644
> --- a/lib/librte_eal/common/meson.build
> +++ b/lib/librte_eal/common/meson.build
> @@ -9,11 +9,21 @@ if is_windows
>  		'eal_common_class.c',
>  		'eal_common_devargs.c',
>  		'eal_common_errno.c',
> +		'eal_common_fbarray.c',
>  		'eal_common_launch.c',
>  		'eal_common_lcore.c',
>  		'eal_common_log.c',
> +		'eal_common_mcfg.c',
> +		'eal_common_memalloc.c',
> +		'eal_common_memory.c',
> +		'eal_common_memzone.c',
>  		'eal_common_options.c',
> +		'eal_common_string_fns.c',
> +		'eal_common_tailqs.c',
>  		'eal_common_thread.c',
> +		'malloc_elem.c',
> +		'malloc_heap.c',
> +		'rte_malloc.c',
>  		'rte_option.c',
>  	)
>  	subdir_done()
> diff --git a/lib/librte_eal/common/rte_malloc.c
> b/lib/librte_eal/common/rte_malloc.c
> index f1b73168b..34b416927 100644
> --- a/lib/librte_eal/common/rte_malloc.c
> +++ b/lib/librte_eal/common/rte_malloc.c
> @@ -20,7 +20,16 @@
>  #include <rte_lcore.h>
>  #include <rte_common.h>
>  #include <rte_spinlock.h>
> +
> +#ifndef RTE_EXEC_ENV_WINDOWS
>  #include <rte_eal_trace.h>
> +#else
> +/* Suppress -Wempty-body for tracepoints used as "if" body. */
> +#define rte_eal_trace_mem_malloc(...) do {} while (0)
> +#define rte_eal_trace_mem_zmalloc(...) do {} while (0)
> +#define rte_eal_trace_mem_realloc(...) do {} while (0)
> +#define rte_eal_trace_mem_free(...) do {} while (0)
> +#endif
> 
>  #include <rte_malloc.h>
>  #include "malloc_elem.h"
> diff --git a/lib/librte_eal/rte_eal_exports.def
> b/lib/librte_eal/rte_eal_exports.def
> index 12a6c79d6..854b83bcd 100644
> --- a/lib/librte_eal/rte_eal_exports.def
> +++ b/lib/librte_eal/rte_eal_exports.def
> @@ -1,9 +1,128 @@
>  EXPORTS
>  	__rte_panic
> +	rte_calloc
> +	rte_calloc_socket
>  	rte_eal_get_configuration
> +	rte_eal_has_hugepages
>  	rte_eal_init
> +	rte_eal_iova_mode
>  	rte_eal_mp_remote_launch
>  	rte_eal_mp_wait_lcore
> +	rte_eal_process_type
>  	rte_eal_remote_launch
> +	rte_eal_tailq_lookup
> +	rte_eal_tailq_register
> +	rte_eal_using_phys_addrs
> +	rte_free
>  	rte_log
> +	rte_malloc
> +	rte_malloc_dump_stats
> +	rte_malloc_get_socket_stats
> +	rte_malloc_set_limit
> +	rte_malloc_socket
> +	rte_malloc_validate
> +	rte_malloc_virt2iova
> +	rte_mcfg_mem_read_lock
> +	rte_mcfg_mem_read_unlock
> +	rte_mcfg_mem_write_lock
> +	rte_mcfg_mem_write_unlock
> +	rte_mcfg_mempool_read_lock
> +	rte_mcfg_mempool_read_unlock
> +	rte_mcfg_mempool_write_lock
> +	rte_mcfg_mempool_write_unlock
> +	rte_mcfg_tailq_read_lock
> +	rte_mcfg_tailq_read_unlock
> +	rte_mcfg_tailq_write_lock
> +	rte_mcfg_tailq_write_unlock
> +	rte_mem_lock_page
> +	rte_mem_virt2iova
> +	rte_mem_virt2phy
> +	rte_memory_get_nchannel
> +	rte_memory_get_nrank
> +	rte_memzone_dump
> +	rte_memzone_free
> +	rte_memzone_lookup
> +	rte_memzone_reserve
> +	rte_memzone_reserve_aligned
> +	rte_memzone_reserve_bounded
> +	rte_memzone_walk
>  	rte_vlog
> +	rte_realloc
> +	rte_zmalloc
> +	rte_zmalloc_socket
> +
> +	rte_mp_action_register
> +	rte_mp_action_unregister
> +	rte_mp_reply
> +	rte_mp_sendmsg
> +
> +	rte_fbarray_attach
> +	rte_fbarray_destroy
> +	rte_fbarray_detach
> +	rte_fbarray_dump_metadata
> +	rte_fbarray_find_contig_free
> +	rte_fbarray_find_contig_used
> +	rte_fbarray_find_idx
> +	rte_fbarray_find_next_free
> +	rte_fbarray_find_next_n_free
> +	rte_fbarray_find_next_n_used
> +	rte_fbarray_find_next_used
> +	rte_fbarray_get
> +	rte_fbarray_init
> +	rte_fbarray_is_used
> +	rte_fbarray_set_free
> +	rte_fbarray_set_used
> +	rte_malloc_dump_heaps
> +	rte_mem_alloc_validator_register
> +	rte_mem_alloc_validator_unregister
> +	rte_mem_check_dma_mask
> +	rte_mem_event_callback_register
> +	rte_mem_event_callback_unregister
> +	rte_mem_iova2virt
> +	rte_mem_virt2memseg
> +	rte_mem_virt2memseg_list
> +	rte_memseg_contig_walk
> +	rte_memseg_list_walk
> +	rte_memseg_walk
> +	rte_mp_request_async
> +	rte_mp_request_sync
> +
> +	rte_fbarray_find_prev_free
> +	rte_fbarray_find_prev_n_free
> +	rte_fbarray_find_prev_n_used
> +	rte_fbarray_find_prev_used
> +	rte_fbarray_find_rev_contig_free
> +	rte_fbarray_find_rev_contig_used
> +	rte_memseg_contig_walk_thread_unsafe
> +	rte_memseg_list_walk_thread_unsafe
> +	rte_memseg_walk_thread_unsafe
> +
> +	rte_malloc_heap_create
> +	rte_malloc_heap_destroy
> +	rte_malloc_heap_get_socket
> +	rte_malloc_heap_memory_add
> +	rte_malloc_heap_memory_attach
> +	rte_malloc_heap_memory_detach
> +	rte_malloc_heap_memory_remove
> +	rte_malloc_heap_socket_is_external
> +	rte_mem_check_dma_mask_thread_unsafe
> +	rte_mem_set_dma_mask
> +	rte_memseg_get_fd
> +	rte_memseg_get_fd_offset
> +	rte_memseg_get_fd_offset_thread_unsafe
> +	rte_memseg_get_fd_thread_unsafe
> +
> +	rte_extmem_attach
> +	rte_extmem_detach
> +	rte_extmem_register
> +	rte_extmem_unregister
> +
> +	rte_fbarray_find_biggest_free
> +	rte_fbarray_find_biggest_used
> +	rte_fbarray_find_rev_biggest_free
> +	rte_fbarray_find_rev_biggest_used
> +
> +	rte_get_page_size
> +	rte_mem_lock
> +	rte_mem_map
> +	rte_mem_unmap
> diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
> index 63461f51a..38f17f09c 100644
> --- a/lib/librte_eal/windows/eal.c
> +++ b/lib/librte_eal/windows/eal.c
> @@ -93,6 +93,24 @@ eal_proc_type_detect(void)
>  	return ptype;
>  }
> 
> +enum rte_proc_type_t
> +rte_eal_process_type(void)
> +{
> +	return rte_config.process_type;
> +}
> +
> +int
> +rte_eal_has_hugepages(void)
> +{
> +	return !internal_config.no_hugetlbfs;
> +}
> +
> +enum rte_iova_mode
> +rte_eal_iova_mode(void)
> +{
> +	return rte_config.iova_mode;
> +}
> +
>  /* display usage */
>  static void
>  eal_usage(const char *prgname)
> @@ -224,6 +242,89 @@ rte_eal_init_alert(const char *msg)
>  	RTE_LOG(ERR, EAL, "%s\n", msg);
>  }
> 
> +int
> +eal_file_truncate(int fd, ssize_t size)
> +{
> +	HANDLE handle;
> +	DWORD ret;
> +	LONG low = (LONG)((size_t)size);
> +	LONG high = (LONG)((size_t)size >> 32);
> +
> +	handle = (HANDLE)_get_osfhandle(fd);
> +	if (handle == INVALID_HANDLE_VALUE) {
> +		rte_errno = EBADF;
> +		return -1;
> +	}
> +
> +	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
> +	if (ret == INVALID_SET_FILE_POINTER) {
> +		RTE_LOG_WIN32_ERR("SetFilePointer()");
> +		rte_errno = EINVAL;
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode
> mode)
> +{
> +	DWORD sys_flags = 0;
> +	OVERLAPPED overlapped;
> +
> +	if (op == EAL_FLOCK_EXCLUSIVE)
> +		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
> +	if (mode == EAL_FLOCK_RETURN)
> +		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
> +
> +	memset(&overlapped, 0, sizeof(overlapped));
> +	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
> +		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
> +			(GetLastError() == ERROR_IO_PENDING)) {
> +			rte_errno = EWOULDBLOCK;
> +		} else {
> +			RTE_LOG_WIN32_ERR("LockFileEx()");
> +			rte_errno = EINVAL;
> +		}
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +unlock_file(HANDLE handle)
> +{
> +	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
> +		RTE_LOG_WIN32_ERR("UnlockFileEx()");
> +		rte_errno = EINVAL;
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +int
> +eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
> +{
> +	HANDLE handle = (HANDLE)_get_osfhandle(fd);
> +
> +	if (handle == INVALID_HANDLE_VALUE) {
> +		rte_errno = EBADF;
> +		return -1;
> +	}
> +
> +	switch (op) {
> +	case EAL_FLOCK_EXCLUSIVE:
> +	case EAL_FLOCK_SHARED:
> +		return lock_file(handle, op, mode);
> +	case EAL_FLOCK_UNLOCK:
> +		return unlock_file(handle);
> +	default:
> +		rte_errno = EINVAL;
> +		return -1;
> +	}
> +}
> +
>   /* Launch threads, called at application init(). */
>  int
>  rte_eal_init(int argc, char **argv)
> @@ -245,6 +346,13 @@ rte_eal_init(int argc, char **argv)
>  	if (fctret < 0)
>  		exit(1);
> 
> +	/* Prevent creation of shared memory files. */
> +	if (internal_config.no_shconf == 0) {
> +		RTE_LOG(WARNING, EAL, "Multi-process support is
> requested, "
> +			"but not available.\n");
> +		internal_config.no_shconf = 1;
> +	}
> +
>  	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0))
> {
>  		rte_eal_init_alert("Cannot get hugepage information");
>  		rte_errno = EACCES;
> @@ -256,6 +364,42 @@ rte_eal_init(int argc, char **argv)
>  			internal_config.memory =
> MEMSIZE_IF_NO_HUGE_PAGE;
>  	}
> 
> +	if (eal_mem_win32api_init() < 0) {
> +		rte_eal_init_alert("Cannot access Win32 memory
> management");
> +		rte_errno = ENOTSUP;
> +		return -1;
> +	}
> +
> +	if (eal_mem_virt2iova_init() < 0) {
> +		/* Non-fatal error if physical addresses are not required. */
> +		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
> +			"PA will not be available\n");
> +	}
> +
> +	if (rte_eal_memzone_init() < 0) {
> +		rte_eal_init_alert("Cannot init memzone");
> +		rte_errno = ENODEV;
> +		return -1;
> +	}
> +
> +	if (rte_eal_memory_init() < 0) {
> +		rte_eal_init_alert("Cannot init memory");
> +		rte_errno = ENOMEM;
> +		return -1;
> +	}
> +
> +	if (rte_eal_malloc_heap_init() < 0) {
> +		rte_eal_init_alert("Cannot init malloc heap");
> +		rte_errno = ENODEV;
> +		return -1;
> +	}
> +
> +	if (rte_eal_tailqs_init() < 0) {
> +		rte_eal_init_alert("Cannot init tail queues for objects");
> +		rte_errno = EFAULT;
> +		return -1;
> +	}
> +
>  	eal_thread_init_master(rte_config.master_lcore);
> 
>  	RTE_LCORE_FOREACH_SLAVE(i) {
> diff --git a/lib/librte_eal/windows/eal_memalloc.c
> b/lib/librte_eal/windows/eal_memalloc.c
> new file mode 100644
> index 000000000..e72e785b8
> --- /dev/null
> +++ b/lib/librte_eal/windows/eal_memalloc.c
> @@ -0,0 +1,418 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <rte_errno.h>
> +#include <rte_os.h>
> +#include <rte_windows.h>
> +
> +#include "eal_internal_cfg.h"
> +#include "eal_memalloc.h"
> +#include "eal_memcfg.h"
> +#include "eal_private.h"
> +#include "eal_windows.h"
> +
> +int
> +eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
> +{
> +	/* Hugepages have no assiciated files in Windows. */
> +	RTE_SET_USED(list_idx);
> +	RTE_SET_USED(seg_idx);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
> +{
> +	/* Hugepages have no assiciated files in Windows. */
> +	RTE_SET_USED(list_idx);
> +	RTE_SET_USED(seg_idx);
> +	RTE_SET_USED(offset);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +static int
> +alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
> +	struct hugepage_info *hi)
> +{
> +	HANDLE current_process;
> +	unsigned int numa_node;
> +	size_t alloc_sz;
> +	void *addr;
> +	rte_iova_t iova = RTE_BAD_IOVA;
> +	PSAPI_WORKING_SET_EX_INFORMATION info;
> +	PSAPI_WORKING_SET_EX_BLOCK *page;
> +
> +	if (ms->len > 0) {
> +		/* If a segment is already allocated as needed, return it. */
> +		if ((ms->addr == requested_addr) &&
> +			(ms->socket_id == socket_id) &&
> +			(ms->hugepage_sz == hi->hugepage_sz)) {
> +			return 0;
> +		}
> +
> +		/* Bugcheck, should not happen. */
> +		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p
> "
> +			"(size %zu) on socket %d", ms->addr,
> +			ms->len, ms->socket_id);
> +		return -1;
> +	}
> +
> +	current_process = GetCurrentProcess();
> +	numa_node = eal_socket_numa_node(socket_id);
> +	alloc_sz = hi->hugepage_sz;
> +
> +	if (requested_addr == NULL) {
> +		/* Request a new chunk of memory from OS. */
> +		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
> +		if (addr == NULL) {
> +			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
> +				"on socket %d\n", alloc_sz, socket_id);
> +			return -1;
> +		}
> +	} else {
> +		/* Requested address is already reserved, commit memory.
> */
> +		addr = eal_mem_commit(requested_addr, alloc_sz,
> socket_id);
> +		if (addr == NULL) {
> +			RTE_LOG(DEBUG, EAL, "Cannot commit reserved
> memory %p "
> +				"(size %zu) on socket %d\n",
> +				requested_addr, alloc_sz, socket_id);
> +			return -1;
> +		}
> +	}
> +
> +	/* Force OS to allocate a physical page and select a NUMA node.
> +	 * Hugepages are not pageable in Windows, so there's no race
> +	 * for physical address.
> +	 */
> +	*(volatile int *)addr = *(volatile int *)addr;
> +
> +	/* Only try to obtain IOVA if it's available, so that applications
> +	 * that do not need IOVA can use this allocator.
> +	 */
> +	if (rte_eal_using_phys_addrs()) {
> +		iova = rte_mem_virt2iova(addr);
> +		if (iova == RTE_BAD_IOVA) {
> +			RTE_LOG(DEBUG, EAL,
> +				"Cannot get IOVA of allocated segment\n");
> +			goto error;
> +		}
> +	}
> +
> +	/* Only "Ex" function can handle hugepages. */
> +	info.VirtualAddress = addr;
> +	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
> +		RTE_LOG_WIN32_ERR("QueryWorkingSetEx()");
> +		goto error;
> +	}
> +
> +	page = &info.VirtualAttributes;
> +	if (!page->Valid || !page->LargePage) {
> +		RTE_LOG(DEBUG, EAL, "Got regular page instead of a
> hugepage\n");
> +		goto error;
> +	}
> +	if (page->Node != numa_node) {
> +		RTE_LOG(DEBUG, EAL,
> +			"NUMA node hint %u (socket %d) not respected, got
> %u\n",
> +			numa_node, socket_id, page->Node);
> +		goto error;
> +	}
> +
> +	ms->addr = addr;
> +	ms->hugepage_sz = hi->hugepage_sz;
> +	ms->len = alloc_sz;
> +	ms->nchannel = rte_memory_get_nchannel();
> +	ms->nrank = rte_memory_get_nrank();
> +	ms->iova = iova;
> +	ms->socket_id = socket_id;
> +
> +	return 0;
> +
> +error:
> +	/* Only jump here when `addr` and `alloc_sz` are valid. */
> +	eal_mem_decommit(addr, alloc_sz);
> +	return -1;
> +}
> +
> +static int
> +free_seg(struct rte_memseg *ms)
> +{
> +	if (eal_mem_decommit(ms->addr, ms->len))
> +		return -1;
> +
> +	/* Must clear the segment, because alloc_seg() inspects it. */
> +	memset(ms, 0, sizeof(*ms));
> +	return 0;
> +}
> +
> +struct alloc_walk_param {
> +	struct hugepage_info *hi;
> +	struct rte_memseg **ms;
> +	size_t page_sz;
> +	unsigned int segs_allocated;
> +	unsigned int n_segs;
> +	int socket;
> +	bool exact;
> +};
> +
> +static int
> +alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
> +{
> +	struct rte_mem_config *mcfg = rte_eal_get_configuration()-
> >mem_config;
> +	struct alloc_walk_param *wa = arg;
> +	struct rte_memseg_list *cur_msl;
> +	size_t page_sz;
> +	int cur_idx, start_idx, j;
> +	unsigned int msl_idx, need, i;
> +
> +	if (msl->page_sz != wa->page_sz)
> +		return 0;
> +	if (msl->socket_id != wa->socket)
> +		return 0;
> +
> +	page_sz = (size_t)msl->page_sz;
> +
> +	msl_idx = msl - mcfg->memsegs;
> +	cur_msl = &mcfg->memsegs[msl_idx];
> +
> +	need = wa->n_segs;
> +
> +	/* try finding space in memseg list */
> +	if (wa->exact) {
> +		/* if we require exact number of pages in a list, find them */
> +		cur_idx = rte_fbarray_find_next_n_free(
> +			&cur_msl->memseg_arr, 0, need);
> +		if (cur_idx < 0)
> +			return 0;
> +		start_idx = cur_idx;
> +	} else {
> +		int cur_len;
> +
> +		/* we don't require exact number of pages, so we're going to
> go
> +		 * for best-effort allocation. that means finding the biggest
> +		 * unused block, and going with that.
> +		 */
> +		cur_idx = rte_fbarray_find_biggest_free(
> +			&cur_msl->memseg_arr, 0);
> +		if (cur_idx < 0)
> +			return 0;
> +		start_idx = cur_idx;
> +		/* adjust the size to possibly be smaller than original
> +		 * request, but do not allow it to be bigger.
> +		 */
> +		cur_len = rte_fbarray_find_contig_free(
> +			&cur_msl->memseg_arr, cur_idx);
> +		need = RTE_MIN(need, (unsigned int)cur_len);
> +	}
> +
> +	for (i = 0; i < need; i++, cur_idx++) {
> +		struct rte_memseg *cur;
> +		void *map_addr;
> +
> +		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
> +		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx *
> page_sz);
> +
> +		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
> +			RTE_LOG(DEBUG, EAL, "attempted to allocate %i
> segments, "
> +				"but only %i were allocated\n", need, i);
> +
> +			/* if exact number wasn't requested, stop */
> +			if (!wa->exact)
> +				goto out;
> +
> +			/* clean up */
> +			for (j = start_idx; j < cur_idx; j++) {
> +				struct rte_memseg *tmp;
> +				struct rte_fbarray *arr = &cur_msl-
> >memseg_arr;
> +
> +				tmp = rte_fbarray_get(arr, j);
> +				rte_fbarray_set_free(arr, j);
> +
> +				if (free_seg(tmp))
> +					RTE_LOG(DEBUG, EAL, "Cannot free
> page\n");
> +			}
> +			/* clear the list */
> +			if (wa->ms)
> +				memset(wa->ms, 0, sizeof(*wa->ms) * wa-
> >n_segs);
> +
> +			return -1;
> +		}
> +		if (wa->ms)
> +			wa->ms[i] = cur;
> +
> +		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
> +	}
> +
> +out:
> +	wa->segs_allocated = i;
> +	if (i > 0)
> +		cur_msl->version++;
> +
> +	/* if we didn't allocate any segments, move on to the next list */
> +	return i > 0;
> +}
> +
> +struct free_walk_param {
> +	struct hugepage_info *hi;
> +	struct rte_memseg *ms;
> +};
> +static int
> +free_seg_walk(const struct rte_memseg_list *msl, void *arg)
> +{
> +	struct rte_mem_config *mcfg = rte_eal_get_configuration()-
> >mem_config;
> +	struct rte_memseg_list *found_msl;
> +	struct free_walk_param *wa = arg;
> +	uintptr_t start_addr, end_addr;
> +	int msl_idx, seg_idx, ret;
> +
> +	start_addr = (uintptr_t) msl->base_va;
> +	end_addr = start_addr + msl->len;
> +
> +	if ((uintptr_t)wa->ms->addr < start_addr ||
> +		(uintptr_t)wa->ms->addr >= end_addr)
> +		return 0;
> +
> +	msl_idx = msl - mcfg->memsegs;
> +	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
> +
> +	/* msl is const */
> +	found_msl = &mcfg->memsegs[msl_idx];
> +	found_msl->version++;
> +
> +	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
> +
> +	ret = free_seg(wa->ms);
> +
> +	return (ret < 0) ? (-1) : 1;
> +}
> +
> +int
> +eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
> +		size_t page_sz, int socket, bool exact)
> +{
> +	unsigned int i;
> +	int ret = -1;
> +	struct alloc_walk_param wa;
> +	struct hugepage_info *hi = NULL;
> +
> +	if (internal_config.legacy_mem) {
> +		RTE_LOG(ERR, EAL, "dynamic allocation not supported in
> legacy mode\n");
> +		return -ENOTSUP;
> +	}
> +
> +	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
> +		struct hugepage_info *hpi =
> &internal_config.hugepage_info[i];
> +		if (page_sz == hpi->hugepage_sz) {
> +			hi = hpi;
> +			break;
> +		}
> +	}
> +	if (!hi) {
> +		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info
> entry\n");
> +		return -1;
> +	}
> +
> +	memset(&wa, 0, sizeof(wa));
> +	wa.exact = exact;
> +	wa.hi = hi;
> +	wa.ms = ms;
> +	wa.n_segs = n_segs;
> +	wa.page_sz = page_sz;
> +	wa.socket = socket;
> +	wa.segs_allocated = 0;
> +
> +	/* memalloc is locked, so it's safe to use thread-unsafe version */
> +	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
> +	if (ret == 0) {
> +		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
> +		ret = -1;
> +	} else if (ret > 0) {
> +		ret = (int)wa.segs_allocated;
> +	}
> +
> +	return ret;
> +}
> +
> +struct rte_memseg *
> +eal_memalloc_alloc_seg(size_t page_sz, int socket)
> +{
> +	struct rte_memseg *ms = NULL;
> +	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
> +	return ms;
> +}
> +
> +int
> +eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
> +{
> +	int seg, ret = 0;
> +
> +	/* dynamic free not supported in legacy mode */
> +	if (internal_config.legacy_mem)
> +		return -1;
> +
> +	for (seg = 0; seg < n_segs; seg++) {
> +		struct rte_memseg *cur = ms[seg];
> +		struct hugepage_info *hi = NULL;
> +		struct free_walk_param wa;
> +		size_t i;
> +		int walk_res;
> +
> +		/* if this page is marked as unfreeable, fail */
> +		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
> +			RTE_LOG(DEBUG, EAL, "Page is not allowed to be
> freed\n");
> +			ret = -1;
> +			continue;
> +		}
> +
> +		memset(&wa, 0, sizeof(wa));
> +
> +		for (i = 0; i < RTE_DIM(internal_config.hugepage_info);
> +				i++) {
> +			hi = &internal_config.hugepage_info[i];
> +			if (cur->hugepage_sz == hi->hugepage_sz)
> +				break;
> +		}
> +		if (i == RTE_DIM(internal_config.hugepage_info)) {
> +			RTE_LOG(ERR, EAL, "Can't find relevant
> hugepage_info entry\n");
> +			ret = -1;
> +			continue;
> +		}
> +
> +		wa.ms = cur;
> +		wa.hi = hi;
> +
> +		/* memalloc is locked, so it's safe to use thread-unsafe
> version
> +		 */
> +		walk_res =
> rte_memseg_list_walk_thread_unsafe(free_seg_walk,
> +				&wa);
> +		if (walk_res == 1)
> +			continue;
> +		if (walk_res == 0)
> +			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
> +		ret = -1;
> +	}
> +	return ret;
> +}
> +
> +int
> +eal_memalloc_free_seg(struct rte_memseg *ms)
> +{
> +	return eal_memalloc_free_seg_bulk(&ms, 1);
> +}
> +
> +int
> +eal_memalloc_sync_with_primary(void)
> +{
> +	/* No multi-process support. */
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +eal_memalloc_init(void)
> +{
> +	/* No action required. */
> +	return 0;
> +}
> diff --git a/lib/librte_eal/windows/eal_memory.c
> b/lib/librte_eal/windows/eal_memory.c
> new file mode 100644
> index 000000000..3812b7c67
> --- /dev/null
> +++ b/lib/librte_eal/windows/eal_memory.c
> @@ -0,0 +1,1155 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2010-2014 Intel Corporation (functions from Linux EAL)
> + * Copyright (c) 2020 Dmitry Kozlyuk (Windows specifics)
> + */
> +
> +#include <inttypes.h>
> +#include <io.h>
> +
> +#include <rte_errno.h>
> +#include <rte_memory.h>
> +
> +#include "eal_internal_cfg.h"
> +#include "eal_memalloc.h"
> +#include "eal_memcfg.h"
> +#include "eal_options.h"
> +#include "eal_private.h"
> +#include "eal_windows.h"
> +
> +#include <rte_virt2phys.h>
> +
> +/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
> + * Provide a copy of definitions and code to load it dynamically.
> + * Note: definitions are copied verbatim from Microsoft documentation
> + * and don't follow DPDK code style.
> + *
> + * MEM_RESERVE_PLACEHOLDER being defined means VirtualAlloc2() is
> present too.
> + */
> +#ifndef MEM_PRESERVE_PLACEHOLDER
> +
> +/*
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> microsoft.com%2Fen-us%2Fwindows%2Fwin32%2Fapi%2Fwinnt%2Fne-
> winnt-
> mem_extended_parameter_type&amp;data=02%7C01%7Cfady%40mellano
> x.com%7C3c6bd806786c479c1e3008d7ebcef213%7Ca652971c7d2e4d9ba6a4d
> 149256f461b%7C0%7C0%7C637237146372884132&amp;sdata=Pd0bUVDAN8e
> iV5zORXJ9r0ZmzIwsfOaeL650gXPpQww%3D&amp;reserved=0 */
> +typedef enum MEM_EXTENDED_PARAMETER_TYPE {
> +	MemExtendedParameterInvalidType,
> +	MemExtendedParameterAddressRequirements,
> +	MemExtendedParameterNumaNode,
> +	MemExtendedParameterPartitionHandle,
> +	MemExtendedParameterMax,
> +	MemExtendedParameterUserPhysicalHandle,
> +	MemExtendedParameterAttributeFlags
> +} *PMEM_EXTENDED_PARAMETER_TYPE;
> +
> +#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
> +
> +/*
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> microsoft.com%2Fen-us%2Fwindows%2Fwin32%2Fapi%2Fwinnt%2Fns-
> winnt-
> mem_extended_parameter&amp;data=02%7C01%7Cfady%40mellanox.com
> %7C3c6bd806786c479c1e3008d7ebcef213%7Ca652971c7d2e4d9ba6a4d149256
> f461b%7C0%7C0%7C637237146372884132&amp;sdata=s5nguLunGkdr2hgJUS
> MIqV5fw7Qo1SDfo0TC%2BA3CFfY%3D&amp;reserved=0 */
> +typedef struct MEM_EXTENDED_PARAMETER {
> +	struct {
> +		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
> +		DWORD64 Reserved : 64 -
> MEM_EXTENDED_PARAMETER_TYPE_BITS;
> +	} DUMMYSTRUCTNAME;
> +	union {
> +		DWORD64 ULong64;
> +		PVOID   Pointer;
> +		SIZE_T  Size;
> +		HANDLE  Handle;
> +		DWORD   ULong;
> +	} DUMMYUNIONNAME;
> +} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
> +
> +/*
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.
> microsoft.com%2Fen-us%2Fwindows%2Fwin32%2Fapi%2Fmemoryapi%2Fnf-
> memoryapi-
> virtualalloc2&amp;data=02%7C01%7Cfady%40mellanox.com%7C3c6bd806786
> c479c1e3008d7ebcef213%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0
> %7C637237146372884132&amp;sdata=1SsOS8O2lRVD8yqDoTsBM8vTMRlvdJL
> fTT38FMAcoec%3D&amp;reserved=0 */
> +typedef PVOID (*VirtualAlloc2_type)(
> +	HANDLE                 Process,
> +	PVOID                  BaseAddress,
> +	SIZE_T                 Size,
> +	ULONG                  AllocationType,
> +	ULONG                  PageProtection,
> +	MEM_EXTENDED_PARAMETER *ExtendedParameters,
> +	ULONG                  ParameterCount
> +);
> +
> +/* VirtualAlloc2() flags. */
> +#define MEM_COALESCE_PLACEHOLDERS 0x00000001
> +#define MEM_PRESERVE_PLACEHOLDER  0x00000002
> +#define MEM_REPLACE_PLACEHOLDER   0x00004000
> +#define MEM_RESERVE_PLACEHOLDER   0x00040000
> +
> +/* Named exactly as the function, so that user code does not depend
> + * on it being found at compile time or dynamically.
> + */
> +static VirtualAlloc2_type VirtualAlloc2;
> +
> +int
> +eal_mem_win32api_init(void)
> +{
> +	/* Contrary to the docs, VirtualAlloc2() is not in kernel32.dll,
> +	 * see
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> b.com%2FMicrosoftDocs%2Ffeedback%2Fissues%2F1129&amp;data=02%7C
> 01%7Cfady%40mellanox.com%7C3c6bd806786c479c1e3008d7ebcef213%7Ca6
> 52971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637237146372884132&amp
> ;sdata=tf%2BSJSNeKcOu9uDinDwsTYsME4R%2BHfdRery%2BUNRWflQ%3D&
> amp;reserved=0.
> +	 */
> +	static const char library_name[] = "kernelbase.dll";
> +	static const char function[] = "VirtualAlloc2";
> +
> +	HMODULE library = NULL;
> +	int ret = 0;
> +
> +	/* Already done. */
> +	if (VirtualAlloc2 != NULL)
> +		return 0;
> +
> +	library = LoadLibraryA(library_name);
> +	if (library == NULL) {
> +		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")",
> library_name);
> +		return -1;
> +	}
> +
> +	VirtualAlloc2 = (VirtualAlloc2_type)(
> +		(void *)GetProcAddress(library, function));
> +	if (VirtualAlloc2 == NULL) {
> +		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
> +			library_name, function);
> +
> +		/* Contrary to the docs, Server 2016 is not supported. */
> +		RTE_LOG(ERR, EAL, "Windows 10 or Windows Server 2019 "
> +			" is required for memory management\n");
> +		ret = -1;
> +	}
> +
> +	FreeLibrary(library);
> +
> +	return ret;
> +}
> +
> +#else
> +
> +/* Stub in case VirtualAlloc2() is provided by the compiler. */
> +int
> +eal_mem_win32api_init(void)
> +{
> +	return 0;
> +}
> +
> +#endif /* defined(MEM_RESERVE_PLACEHOLDER) */
> +
> +static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
> +
> +int
> +eal_mem_virt2iova_init(void)
> +{
> +	HDEVINFO list = INVALID_HANDLE_VALUE;
> +	SP_DEVICE_INTERFACE_DATA ifdata;
> +	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
> +	DWORD detail_size;
> +	int ret = -1;
> +
> +	list = SetupDiGetClassDevs(
> +		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
> +		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
> +	if (list == INVALID_HANDLE_VALUE) {
> +		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
> +		goto exit;
> +	}
> +
> +	ifdata.cbSize = sizeof(ifdata);
> +	if (!SetupDiEnumDeviceInterfaces(
> +		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
> +		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
> +		goto exit;
> +	}
> +
> +	if (!SetupDiGetDeviceInterfaceDetail(
> +		list, &ifdata, NULL, 0, &detail_size, NULL)) {
> +		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
> +			RTE_LOG_WIN32_ERR(
> +				"SetupDiGetDeviceInterfaceDetail(probe)");
> +			goto exit;
> +		}
> +	}
> +
> +	detail = malloc(detail_size);
> +	if (detail == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
> +			"device interface detail data\n");
> +		goto exit;
> +	}
> +
> +	detail->cbSize = sizeof(*detail);
> +	if (!SetupDiGetDeviceInterfaceDetail(
> +		list, &ifdata, detail, detail_size, NULL, NULL)) {
> +
> 	RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
> +		goto exit;
> +	}
> +
> +	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail-
> >DevicePath);
> +
> +	virt2phys_device = CreateFile(
> +		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
> +	if (virt2phys_device == INVALID_HANDLE_VALUE) {
> +		RTE_LOG_WIN32_ERR("CreateFile()");
> +		goto exit;
> +	}
> +
> +	/* Indicate success. */
> +	ret = 0;
> +
> +exit:
> +	if (detail != NULL)
> +		free(detail);
> +	if (list != INVALID_HANDLE_VALUE)
> +		SetupDiDestroyDeviceInfoList(list);
> +
> +	return ret;
> +}
> +
> +phys_addr_t
> +rte_mem_virt2phy(const void *virt)
> +{
> +	LARGE_INTEGER phys;
> +	DWORD bytes_returned;
> +
> +	if (virt2phys_device == INVALID_HANDLE_VALUE)
> +		return RTE_BAD_PHYS_ADDR;
> +
> +	if (!DeviceIoControl(
> +			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
> +			&virt, sizeof(virt), &phys, sizeof(phys),
> +			&bytes_returned, NULL)) {
> +
> 	RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSL
> ATE)");
> +		return RTE_BAD_PHYS_ADDR;
> +	}
> +
> +	return phys.QuadPart;
> +}
> +
> +/* Windows currently only supports IOVA as PA. */
> +rte_iova_t
> +rte_mem_virt2iova(const void *virt)
> +{
> +	phys_addr_t phys;
> +
> +	if (virt2phys_device == INVALID_HANDLE_VALUE)
> +		return RTE_BAD_IOVA;
> +
> +	phys = rte_mem_virt2phy(virt);
> +	if (phys == RTE_BAD_PHYS_ADDR)
> +		return RTE_BAD_IOVA;
> +
> +	return (rte_iova_t)phys;
> +}
> +
> +/* Always using physical addresses under Windows if they can be obtained.
> */
> +int
> +rte_eal_using_phys_addrs(void)
> +{
> +	return virt2phys_device != INVALID_HANDLE_VALUE;
> +}
> +
> +/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
> +static void
> +set_errno_from_win32_alloc_error(DWORD code)
> +{
> +	switch (code) {
> +	case ERROR_SUCCESS:
> +		rte_errno = 0;
> +		break;
> +
> +	case ERROR_INVALID_ADDRESS:
> +		/* A valid requested address is not available. */
> +	case ERROR_COMMITMENT_LIMIT:
> +		/* May occcur when committing regular memory. */
> +	case ERROR_NO_SYSTEM_RESOURCES:
> +		/* Occurs when the system runs out of hugepages. */
> +		rte_errno = ENOMEM;
> +		break;
> +
> +	case ERROR_INVALID_PARAMETER:
> +	default:
> +		rte_errno = EINVAL;
> +		break;
> +	}
> +}
> +
> +void *
> +eal_mem_reserve(void *requested_addr, size_t size, int flags)
> +{
> +	void *virt;
> +
> +	/* Windows requires hugepages to be committed. */
> +	if (flags & EAL_RESERVE_HUGEPAGES) {
> +		rte_errno = ENOTSUP;
> +		return NULL;
> +	}
> +
> +	virt = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
> +		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER,
> PAGE_NOACCESS,
> +		NULL, 0);
> +	if (virt == NULL) {
> +		DWORD err = GetLastError();
> +		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
> +		set_errno_from_win32_alloc_error(err);
> +	}
> +
> +	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt !=
> requested_addr)) {
> +		if (!VirtualFree(virt, 0, MEM_RELEASE))
> +			RTE_LOG_WIN32_ERR("VirtualFree()");
> +		rte_errno = ENOMEM;
> +		return NULL;
> +	}
> +
> +	return virt;
> +}
> +
> +void *
> +eal_mem_alloc(size_t size, size_t page_size)
> +{
> +	if (page_size != 0)
> +		return eal_mem_alloc_socket(size, SOCKET_ID_ANY);
> +
> +	return VirtualAlloc(
> +		NULL, size, MEM_RESERVE | MEM_COMMIT,
> PAGE_READWRITE);
> +}
> +
> +void *
> +eal_mem_alloc_socket(size_t size, int socket_id)
> +{
> +	DWORD flags = MEM_RESERVE | MEM_COMMIT;
> +	void *addr;
> +
> +	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
> +	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
> +		PAGE_READWRITE, eal_socket_numa_node(socket_id));
> +	if (addr == NULL)
> +		rte_errno = ENOMEM;
> +	return addr;
> +}
> +
> +void*
> +eal_mem_commit(void *requested_addr, size_t size, int socket_id)
> +{
> +	MEM_EXTENDED_PARAMETER param;
> +	DWORD param_count = 0;
> +	DWORD flags;
> +	void *addr;
> +
> +	if (requested_addr != NULL) {
> +		MEMORY_BASIC_INFORMATION info;
> +		if (VirtualQuery(requested_addr, &info, sizeof(info)) == 0) {
> +			RTE_LOG_WIN32_ERR("VirtualQuery()");
> +			return NULL;
> +		}
> +
> +		/* Split reserved region if only a part is committed. */
> +		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
> +		if ((info.RegionSize > size) &&
> +			!VirtualFree(requested_addr, size, flags)) {
> +			RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
> +				"<split placeholder>)", requested_addr,
> size);
> +			return NULL;
> +		}
> +	}
> +
> +	if (socket_id != SOCKET_ID_ANY) {
> +		param_count = 1;
> +		memset(&param, 0, sizeof(param));
> +		param.Type = MemExtendedParameterNumaNode;
> +		param.ULong = eal_socket_numa_node(socket_id);
> +	}
> +
> +	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
> +	if (requested_addr != NULL)
> +		flags |= MEM_REPLACE_PLACEHOLDER;
> +
> +	addr = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
> +		flags, PAGE_READWRITE, &param, param_count);
> +	if (addr == NULL) {
> +		DWORD err = GetLastError();
> +		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, "
> +			"<replace placeholder>)", addr, size);
> +		set_errno_from_win32_alloc_error(err);
> +		return NULL;
> +	}
> +
> +	return addr;
> +}
> +
> +int
> +eal_mem_decommit(void *addr, size_t size)
> +{
> +	/* Decommit memory, which might be a part of a larger reserved
> region.
> +	 * Allocator commits hugepage-sized placeholders, so there's no
> need
> +	 * to coalesce placeholders back into region, they can be reused as is.
> +	 */
> +	if (!VirtualFree(addr, size, MEM_RELEASE |
> MEM_PRESERVE_PLACEHOLDER)) {
> +		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, ...)", addr,
> size);
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Free a reserved memory region in full or in part.
> + *
> + * @param addr
> + *  Starting address of the area to free.
> + * @param size
> + *  Number of bytes to free. Must be a multiple of page size.
> + * @param reserved
> + *  Fail if the region is not in reserved state.
> + * @return
> + *  * 0 on successful deallocation;
> + *  * 1 if region mut be in reserved state but it is not;
> + *  * (-1) on system API failures.
> + */
> +static int
> +mem_free(void *addr, size_t size, bool reserved)
> +{
> +	MEMORY_BASIC_INFORMATION info;
> +	HANDLE process;
> +
> +	if (VirtualQuery(addr, &info, sizeof(info)) == 0) {
> +		RTE_LOG_WIN32_ERR("VirtualQuery()");
> +		return -1;
> +	}
> +
> +	if (reserved && (info.State != MEM_RESERVE))
> +		return 1;
> +
> +	process = GetCurrentProcess();
> +
> +	/* Free complete region. */
> +	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
> +		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
> +			RTE_LOG_WIN32_ERR("VirtualFree(%p, 0,
> MEM_RELEASE)",
> +				addr);
> +		}
> +		return 0;
> +	}
> +
> +	/* Split the part to be freed and the remaining reservation. */
> +	if (!VirtualFreeEx(process, addr, size,
> +			MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
> +		RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
> +			"MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)",
> addr, size);
> +		return -1;
> +	}
> +
> +	/* Actually free reservation part. */
> +	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
> +		RTE_LOG_WIN32_ERR("VirtualFree(%p, 0, MEM_RELEASE)",
> addr);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +eal_mem_free(void *virt, size_t size)
> +{
> +	mem_free(virt, size, false);
> +}
> +
> +int
> +eal_mem_set_dump(void *virt, size_t size, bool dump)
> +{
> +	RTE_SET_USED(virt);
> +	RTE_SET_USED(size);
> +	RTE_SET_USED(dump);
> +
> +	/* Windows does not dump reserved memory by default.
> +	 *
> +	 * There is <werapi.h> to include or exclude regions from the dump,
> +	 * but this is not currently required by EAL.
> +	 */
> +
> +	rte_errno = ENOTSUP;
> +	return -1;
> +}
> +
> +void *
> +rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset)
> +{
> +	HANDLE file_handle = INVALID_HANDLE_VALUE;
> +	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
> +	DWORD sys_prot = 0;
> +	DWORD sys_access = 0;
> +	DWORD size_high = (DWORD)(size >> 32);
> +	DWORD size_low = (DWORD)size;
> +	DWORD offset_high = (DWORD)(offset >> 32);
> +	DWORD offset_low = (DWORD)offset;
> +	LPVOID virt = NULL;
> +
> +	if (prot & RTE_PROT_EXECUTE) {
> +		if (prot & RTE_PROT_READ) {
> +			sys_prot = PAGE_EXECUTE_READ;
> +			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
> +		}
> +		if (prot & RTE_PROT_WRITE) {
> +			sys_prot = PAGE_EXECUTE_READWRITE;
> +			sys_access = FILE_MAP_WRITE |
> FILE_MAP_EXECUTE;
> +		}
> +	} else {
> +		if (prot & RTE_PROT_READ) {
> +			sys_prot = PAGE_READONLY;
> +			sys_access = FILE_MAP_READ;
> +		}
> +		if (prot & RTE_PROT_WRITE) {
> +			sys_prot = PAGE_READWRITE;
> +			sys_access = FILE_MAP_WRITE;
> +		}
> +	}
> +
> +	if (flags & RTE_MAP_PRIVATE)
> +		sys_access |= FILE_MAP_COPY;
> +
> +	if ((flags & RTE_MAP_ANONYMOUS) == 0)
> +		file_handle = (HANDLE)_get_osfhandle(fd);
> +
> +	mapping_handle = CreateFileMapping(
> +		file_handle, NULL, sys_prot, size_high, size_low, NULL);
> +	if (mapping_handle == INVALID_HANDLE_VALUE) {
> +		RTE_LOG_WIN32_ERR("CreateFileMapping()");
> +		return NULL;
> +	}
> +
> +	/* There is a race for the requested_addr between mem_free()
> +	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a
> reserved
> +	 * region with a mapping in a single operation, but it does not support
> +	 * private mappings.
> +	 */
> +	if (requested_addr != NULL) {
> +		int ret = mem_free(requested_addr, size, true);
> +		if (ret) {
> +			if (ret > 0) {
> +				RTE_LOG(ERR, EAL, "Cannot map memory "
> +					"to a region not reserved\n");
> +				rte_errno = EADDRNOTAVAIL;
> +			}
> +			return NULL;
> +		}
> +	}
> +
> +	virt = MapViewOfFileEx(mapping_handle, sys_access,
> +		offset_high, offset_low, size, requested_addr);
> +	if (!virt) {
> +		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
> +		return NULL;
> +	}
> +
> +	if ((flags & RTE_MAP_FORCE_ADDRESS) && (virt != requested_addr))
> {
> +		if (!UnmapViewOfFile(virt))
> +			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
> +		virt = NULL;
> +	}
> +
> +	if (!CloseHandle(mapping_handle))
> +		RTE_LOG_WIN32_ERR("CloseHandle()");
> +
> +	return virt;
> +}
> +
> +int
> +rte_mem_unmap(void *virt, size_t size)
> +{
> +	RTE_SET_USED(size);
> +
> +	if (!UnmapViewOfFile(virt)) {
> +		rte_errno = GetLastError();
> +		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +uint64_t
> +eal_get_baseaddr(void)
> +{
> +	/* Windows strategy for memory allocation is undocumented.
> +	 * Returning 0 here effectively disables address guessing
> +	 * unless user provides an address hint.
> +	 */
> +	return 0;
> +}
> +
> +size_t
> +rte_get_page_size(void)
> +{
> +	SYSTEM_INFO info;
> +	GetSystemInfo(&info);
> +	return info.dwPageSize;
> +}
> +
> +int
> +rte_mem_lock(const void *virt, size_t size)
> +{
> +	/* VirtualLock() takes `void*`, work around compiler warning. */
> +	void *addr = (void *)((uintptr_t)virt);
> +
> +	if (!VirtualLock(addr, size)) {
> +		RTE_LOG_WIN32_ERR("VirtualLock()");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
> +		int n_segs, int socket_id, int type_msl_idx)
> +{
> +	return eal_memseg_list_init(
> +		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
> +}
> +
> +static int
> +memseg_list_alloc(struct rte_memseg_list *msl)
> +{
> +	return eal_memseg_list_alloc(msl, 0);
> +}
> +
> +/*
> + * Remaining code in this file largely duplicates Linux EAL.
> + * Although Windows EAL supports only one hugepage size currently,
> + * code structure and comments are preserved so that changes may be
> + * easily ported until duplication is removed.
> + */
> +
> +static int
> +memseg_primary_init(void)
> +{
> +	struct rte_mem_config *mcfg = rte_eal_get_configuration()-
> >mem_config;
> +	struct memtype {
> +		uint64_t page_sz;
> +		int socket_id;
> +	} *memtypes = NULL;
> +	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
> +	struct rte_memseg_list *msl;
> +	uint64_t max_mem, max_mem_per_type;
> +	unsigned int max_seglists_per_type;
> +	unsigned int n_memtypes, cur_type;
> +
> +	/* no-huge does not need this at all */
> +	if (internal_config.no_hugetlbfs)
> +		return 0;
> +
> +	/*
> +	 * figuring out amount of memory we're going to have is a long and
> very
> +	 * involved process. the basic element we're operating with is a
> memory
> +	 * type, defined as a combination of NUMA node ID and page size (so
> that
> +	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
> +	 *
> +	 * deciding amount of memory going towards each memory type is a
> +	 * balancing act between maximum segments per type, maximum
> memory per
> +	 * type, and number of detected NUMA nodes. the goal is to make
> sure
> +	 * each memory type gets at least one memseg list.
> +	 *
> +	 * the total amount of memory is limited by RTE_MAX_MEM_MB
> value.
> +	 *
> +	 * the total amount of memory per type is limited by either
> +	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB
> divided by the number
> +	 * of detected NUMA nodes. additionally, maximum number of
> segments per
> +	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is
> because for
> +	 * smaller page sizes, it can take hundreds of thousands of segments
> to
> +	 * reach the above specified per-type memory limits.
> +	 *
> +	 * additionally, each type may have multiple memseg lists associated
> +	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for
> bigger
> +	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller
> ones.
> +	 *
> +	 * the number of memseg lists per type is decided based on the
> above
> +	 * limits, and also taking number of detected NUMA nodes, to make
> sure
> +	 * that we don't run out of memseg lists before we populate all
> NUMA
> +	 * nodes with memory.
> +	 *
> +	 * we do this in three stages. first, we collect the number of types.
> +	 * then, we figure out memory constraints and populate the list of
> +	 * would-be memseg lists. then, we go ahead and allocate the
> memseg
> +	 * lists.
> +	 */
> +
> +	/* create space for mem types */
> +	n_memtypes = internal_config.num_hugepage_sizes *
> rte_socket_count();
> +	memtypes = calloc(n_memtypes, sizeof(*memtypes));
> +	if (memtypes == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate space for memory
> types\n");
> +		return -1;
> +	}
> +
> +	/* populate mem types */
> +	cur_type = 0;
> +	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
> +			hpi_idx++) {
> +		struct hugepage_info *hpi;
> +		uint64_t hugepage_sz;
> +
> +		hpi = &internal_config.hugepage_info[hpi_idx];
> +		hugepage_sz = hpi->hugepage_sz;
> +
> +		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
> +			int socket_id = rte_socket_id_by_idx(i);
> +
> +			memtypes[cur_type].page_sz = hugepage_sz;
> +			memtypes[cur_type].socket_id = socket_id;
> +
> +			RTE_LOG(DEBUG, EAL, "Detected memory type: "
> +				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
> +				socket_id, hugepage_sz);
> +		}
> +	}
> +	/* number of memtypes could have been lower due to no NUMA
> support */
> +	n_memtypes = cur_type;
> +
> +	/* set up limits for types */
> +	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
> +	max_mem_per_type =
> RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
> +			max_mem / n_memtypes);
> +
> +	/*
> +	 * limit maximum number of segment lists per type to ensure there's
> +	 * space for memseg lists for all NUMA nodes with all page sizes
> +	 */
> +	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
> +
> +	if (max_seglists_per_type == 0) {
> +		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types,
> please increase %s\n",
> +			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
> +		goto out;
> +	}
> +
> +	/* go through all mem types and create segment lists */
> +	msl_idx = 0;
> +	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
> +		unsigned int cur_seglist, n_seglists, n_segs;
> +		unsigned int max_segs_per_type, max_segs_per_list;
> +		struct memtype *type = &memtypes[cur_type];
> +		uint64_t max_mem_per_list, pagesz;
> +		int socket_id;
> +
> +		pagesz = type->page_sz;
> +		socket_id = type->socket_id;
> +
> +		/*
> +		 * we need to create segment lists for this type. we must
> take
> +		 * into account the following things:
> +		 *
> +		 * 1. total amount of memory we can use for this memory
> type
> +		 * 2. total amount of memory per memseg list allowed
> +		 * 3. number of segments needed to fit the amount of
> memory
> +		 * 4. number of segments allowed per type
> +		 * 5. number of segments allowed per memseg list
> +		 * 6. number of memseg lists we are allowed to take up
> +		 */
> +
> +		/* calculate how much segments we will need in total */
> +		max_segs_per_type = max_mem_per_type / pagesz;
> +		/* limit number of segments to maximum allowed per type
> */
> +		max_segs_per_type = RTE_MIN(max_segs_per_type,
> +				(unsigned
> int)RTE_MAX_MEMSEG_PER_TYPE);
> +		/* limit number of segments to maximum allowed per list */
> +		max_segs_per_list = RTE_MIN(max_segs_per_type,
> +				(unsigned
> int)RTE_MAX_MEMSEG_PER_LIST);
> +
> +		/* calculate how much memory we can have per segment list
> */
> +		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
> +				(uint64_t)RTE_MAX_MEM_MB_PER_LIST <<
> 20);
> +
> +		/* calculate how many segments each segment list will have
> */
> +		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list /
> pagesz);
> +
> +		/* calculate how many segment lists we can have */
> +		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
> +				max_mem_per_type / max_mem_per_list);
> +
> +		/* limit number of segment lists according to our maximum
> */
> +		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
> +
> +		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
> +				"n_segs:%i socket_id:%i hugepage_sz:%"
> PRIu64 "\n",
> +			n_seglists, n_segs, socket_id, pagesz);
> +
> +		/* create all segment lists */
> +		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
> +			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
> +				RTE_LOG(ERR, EAL,
> +					"No more space in memseg lists,
> please increase %s\n",
> +
> 	RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
> +				goto out;
> +			}
> +			msl = &mcfg->memsegs[msl_idx++];
> +
> +			if (memseg_list_init(msl, pagesz, n_segs,
> +					socket_id, cur_seglist))
> +				goto out;
> +
> +			if (memseg_list_alloc(msl)) {
> +				RTE_LOG(ERR, EAL, "Cannot allocate VA
> space for memseg list\n");
> +				goto out;
> +			}
> +		}
> +	}
> +	/* we're successful */
> +	ret = 0;
> +out:
> +	free(memtypes);
> +	return ret;
> +}
> +
> +static int
> +memseg_secondary_init(void)
> +{
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_eal_memseg_init(void)
> +{
> +	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
> +		return memseg_primary_init();
> +	return memseg_secondary_init();
> +}
> +
> +static inline uint64_t
> +get_socket_mem_size(int socket)
> +{
> +	uint64_t size = 0;
> +	unsigned int i;
> +
> +	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
> +		struct hugepage_info *hpi =
> &internal_config.hugepage_info[i];
> +		size += hpi->hugepage_sz * hpi->num_pages[socket];
> +	}
> +
> +	return size;
> +}
> +
> +static int
> +calc_num_pages_per_socket(uint64_t *memory,
> +		struct hugepage_info *hp_info,
> +		struct hugepage_info *hp_used,
> +		unsigned int num_hp_info)
> +{
> +	unsigned int socket, j, i = 0;
> +	unsigned int requested, available;
> +	int total_num_pages = 0;
> +	uint64_t remaining_mem, cur_mem;
> +	uint64_t total_mem = internal_config.memory;
> +
> +	if (num_hp_info == 0)
> +		return -1;
> +
> +	/* if specific memory amounts per socket weren't requested */
> +	if (internal_config.force_sockets == 0) {
> +		size_t total_size;
> +		int cpu_per_socket[RTE_MAX_NUMA_NODES];
> +		size_t default_size;
> +		unsigned int lcore_id;
> +
> +		/* Compute number of cores per socket */
> +		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
> +		RTE_LCORE_FOREACH(lcore_id) {
> +
> 	cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
> +		}
> +
> +		/*
> +		 * Automatically spread requested memory amongst
> detected
> +		 * sockets according to number of cores from cpu mask
> present
> +		 * on each socket.
> +		 */
> +		total_size = internal_config.memory;
> +		for (socket = 0; socket < RTE_MAX_NUMA_NODES &&
> total_size != 0;
> +				socket++) {
> +
> +			/* Set memory amount per socket */
> +			default_size = internal_config.memory *
> +				cpu_per_socket[socket] / rte_lcore_count();
> +
> +			/* Limit to maximum available memory on socket */
> +			default_size = RTE_MIN(
> +				default_size,
> get_socket_mem_size(socket));
> +
> +			/* Update sizes */
> +			memory[socket] = default_size;
> +			total_size -= default_size;
> +		}
> +
> +		/*
> +		 * If some memory is remaining, try to allocate it by getting
> +		 * all available memory from sockets, one after the other.
> +		 */
> +		for (socket = 0; socket < RTE_MAX_NUMA_NODES &&
> total_size != 0;
> +				socket++) {
> +			/* take whatever is available */
> +			default_size = RTE_MIN(
> +				get_socket_mem_size(socket) -
> memory[socket],
> +				total_size);
> +
> +			/* Update sizes */
> +			memory[socket] += default_size;
> +			total_size -= default_size;
> +		}
> +	}
> +
> +	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem !=
> 0;
> +			socket++) {
> +		/* skips if the memory on specific socket wasn't requested */
> +		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
> +			strncpy(hp_used[i].hugedir, hp_info[i].hugedir,
> +				sizeof(hp_used[i].hugedir));
> +			hp_used[i].num_pages[socket] = RTE_MIN(
> +					memory[socket] /
> hp_info[i].hugepage_sz,
> +					hp_info[i].num_pages[socket]);
> +
> +			cur_mem = hp_used[i].num_pages[socket] *
> +					hp_used[i].hugepage_sz;
> +
> +			memory[socket] -= cur_mem;
> +			total_mem -= cur_mem;
> +
> +			total_num_pages +=
> hp_used[i].num_pages[socket];
> +
> +			/* check if we have met all memory requests */
> +			if (memory[socket] == 0)
> +				break;
> +
> +			/* Check if we have any more pages left at this size,
> +			 * if so, move on to next size.
> +			 */
> +			if (hp_used[i].num_pages[socket] ==
> +					hp_info[i].num_pages[socket])
> +				continue;
> +
> +			/* At this point we know that there are more pages
> +			 * available that are bigger than the memory we
> want,
> +			 * so lets see if we can get enough from other page
> +			 * sizes.
> +			 */
> +			remaining_mem = 0;
> +			for (j = i+1; j < num_hp_info; j++)
> +				remaining_mem += hp_info[j].hugepage_sz
> *
> +				hp_info[j].num_pages[socket];
> +
> +			/* Is there enough other memory?
> +			 * If not, allocate another page and quit.
> +			 */
> +			if (remaining_mem < memory[socket]) {
> +				cur_mem = RTE_MIN(
> +					memory[socket],
> hp_info[i].hugepage_sz);
> +				memory[socket] -= cur_mem;
> +				total_mem -= cur_mem;
> +				hp_used[i].num_pages[socket]++;
> +				total_num_pages++;
> +				break; /* we are done with this socket*/
> +			}
> +		}
> +		/* if we didn't satisfy all memory requirements per socket */
> +		if (memory[socket] > 0 &&
> +				internal_config.socket_mem[socket] != 0) {
> +			/* to prevent icc errors */
> +			requested = (unsigned int)(
> +				internal_config.socket_mem[socket] /
> 0x100000);
> +			available = requested -
> +				((unsigned int)(memory[socket] /
> 0x100000));
> +			RTE_LOG(ERR, EAL, "Not enough memory available
> on "
> +				"socket %u! Requested: %uMB, available:
> %uMB\n",
> +				socket, requested, available);
> +			return -1;
> +		}
> +	}
> +
> +	/* if we didn't satisfy total memory requirements */
> +	if (total_mem > 0) {
> +		requested = (unsigned int) (internal_config.memory /
> 0x100000);
> +		available = requested - (unsigned int) (total_mem /
> 0x100000);
> +		RTE_LOG(ERR, EAL, "Not enough memory available! "
> +			"Requested: %uMB, available: %uMB\n",
> +			requested, available);
> +		return -1;
> +	}
> +	return total_num_pages;
> +}
> +
> +/* Limit is checked by validator itself, nothing left to analyze.*/
> +static int
> +limits_callback(int socket_id, size_t cur_limit, size_t new_len)
> +{
> +	RTE_SET_USED(socket_id);
> +	RTE_SET_USED(cur_limit);
> +	RTE_SET_USED(new_len);
> +	return -1;
> +}
> +
> +static int
> +eal_hugepage_init(void)
> +{
> +	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
> +	uint64_t memory[RTE_MAX_NUMA_NODES];
> +	int hp_sz_idx, socket_id;
> +
> +	memset(used_hp, 0, sizeof(used_hp));
> +
> +	for (hp_sz_idx = 0;
> +			hp_sz_idx < (int)
> internal_config.num_hugepage_sizes;
> +			hp_sz_idx++) {
> +		/* also initialize used_hp hugepage sizes in used_hp */
> +		struct hugepage_info *hpi;
> +		hpi = &internal_config.hugepage_info[hp_sz_idx];
> +		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
> +	}
> +
> +	/* make a copy of socket_mem, needed for balanced allocation. */
> +	for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
> socket_id++)
> +		memory[socket_id] =
> internal_config.socket_mem[socket_id];
> +
> +	/* calculate final number of pages */
> +	if (calc_num_pages_per_socket(memory,
> +			internal_config.hugepage_info, used_hp,
> +			internal_config.num_hugepage_sizes) < 0)
> +		return -1;
> +
> +	for (hp_sz_idx = 0;
> +			hp_sz_idx <
> (int)internal_config.num_hugepage_sizes;
> +			hp_sz_idx++) {
> +		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
> +				socket_id++) {
> +			struct rte_memseg **pages;
> +			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
> +			unsigned int num_pages = hpi-
> >num_pages[socket_id];
> +			unsigned int num_pages_alloc;
> +
> +			if (num_pages == 0)
> +				continue;
> +
> +			RTE_LOG(DEBUG, EAL,
> +				"Allocating %u pages of size %" PRIu64 "M on
> socket %i\n",
> +				num_pages, hpi->hugepage_sz >> 20,
> socket_id);
> +
> +			/* we may not be able to allocate all pages in one go,
> +			 * because we break up our memory map into
> multiple
> +			 * memseg lists. therefore, try allocating multiple
> +			 * times and see if we can get the desired number of
> +			 * pages from multiple allocations.
> +			 */
> +
> +			num_pages_alloc = 0;
> +			do {
> +				int i, cur_pages, needed;
> +
> +				needed = num_pages - num_pages_alloc;
> +
> +				pages = malloc(sizeof(*pages) * needed);
> +
> +				/* do not request exact number of pages */
> +				cur_pages =
> eal_memalloc_alloc_seg_bulk(pages,
> +						needed, hpi->hugepage_sz,
> +						socket_id, false);
> +				if (cur_pages <= 0) {
> +					free(pages);
> +					return -1;
> +				}
> +
> +				/* mark preallocated pages as unfreeable */
> +				for (i = 0; i < cur_pages; i++) {
> +					struct rte_memseg *ms = pages[i];
> +					ms->flags |=
> +
> 	RTE_MEMSEG_FLAG_DO_NOT_FREE;
> +				}
> +				free(pages);
> +
> +				num_pages_alloc += cur_pages;
> +			} while (num_pages_alloc != num_pages);
> +		}
> +	}
> +	/* if socket limits were specified, set them */
> +	if (internal_config.force_socket_limits) {
> +		unsigned int i;
> +		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
> +			uint64_t limit = internal_config.socket_limit[i];
> +			if (limit == 0)
> +				continue;
> +			if (rte_mem_alloc_validator_register("socket-limit",
> +					limits_callback, i, limit))
> +				RTE_LOG(ERR, EAL, "Failed to register socket
> "
> +					"limits validator callback\n");
> +		}
> +	}
> +	return 0;
> +}
> +
> +static int
> +eal_nohuge_init(void)
> +{
> +	struct rte_mem_config *mcfg;
> +	struct rte_memseg_list *msl;
> +	int n_segs, cur_seg;
> +	uint64_t page_sz;
> +	void *addr;
> +	struct rte_fbarray *arr;
> +	struct rte_memseg *ms;
> +
> +	mcfg = rte_eal_get_configuration()->mem_config;
> +
> +	/* nohuge mode is legacy mode */
> +	internal_config.legacy_mem = 1;
> +
> +	/* create a memseg list */
> +	msl = &mcfg->memsegs[0];
> +
> +	page_sz = RTE_PGSIZE_4K;
> +	n_segs = internal_config.memory / page_sz;
> +
> +	if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
> +		sizeof(struct rte_memseg))) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
> +		return -1;
> +	}
> +
> +	addr = eal_mem_alloc(internal_config.memory, 0);
> +	if (addr == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate %zu bytes",
> +		internal_config.memory);
> +		return -1;
> +	}
> +
> +	msl->base_va = addr;
> +	msl->page_sz = page_sz;
> +	msl->socket_id = 0;
> +	msl->len = internal_config.memory;
> +	msl->heap = 1;
> +
> +	/* populate memsegs. each memseg is one page long */
> +	for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
> +		arr = &msl->memseg_arr;
> +
> +		ms = rte_fbarray_get(arr, cur_seg);
> +		ms->iova = RTE_BAD_IOVA;
> +		ms->addr = addr;
> +		ms->hugepage_sz = page_sz;
> +		ms->socket_id = 0;
> +		ms->len = page_sz;
> +
> +		rte_fbarray_set_used(arr, cur_seg);
> +
> +		addr = RTE_PTR_ADD(addr, (size_t)page_sz);
> +	}
> +
> +	if (mcfg->dma_maskbits &&
> +		rte_mem_check_dma_mask_thread_unsafe(mcfg-
> >dma_maskbits)) {
> +		RTE_LOG(ERR, EAL,
> +			"%s(): couldn't allocate memory due to IOVA "
> +			"exceeding limits of current DMA mask.\n",
> __func__);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_eal_hugepage_init(void)
> +{
> +	return internal_config.no_hugetlbfs ?
> +		eal_nohuge_init() : eal_hugepage_init();
> +}
> +
> +int
> +rte_eal_hugepage_attach(void)
> +{
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> diff --git a/lib/librte_eal/windows/eal_mp.c
> b/lib/librte_eal/windows/eal_mp.c
> new file mode 100644
> index 000000000..16a5e8ba0
> --- /dev/null
> +++ b/lib/librte_eal/windows/eal_mp.c
> @@ -0,0 +1,103 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +/**
> + * @file Multiprocess support stubs
> + *
> + * Stubs must log an error until implemented. If success is required
> + * for non-multiprocess operation, stub must log a warning and a comment
> + * must document what requires success emulation.
> + */
> +
> +#include <rte_eal.h>
> +#include <rte_errno.h>
> +
> +#include "eal_private.h"
> +#include "eal_windows.h"
> +#include "malloc_mp.h"
> +
> +void
> +rte_mp_channel_cleanup(void)
> +{
> +	EAL_LOG_NOT_IMPLEMENTED();
> +}
> +
> +int
> +rte_mp_action_register(const char *name, rte_mp_t action)
> +{
> +	RTE_SET_USED(name);
> +	RTE_SET_USED(action);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +void
> +rte_mp_action_unregister(const char *name)
> +{
> +	RTE_SET_USED(name);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +}
> +
> +int
> +rte_mp_sendmsg(struct rte_mp_msg *msg)
> +{
> +	RTE_SET_USED(msg);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply
> *reply,
> +	const struct timespec *ts)
> +{
> +	RTE_SET_USED(req);
> +	RTE_SET_USED(reply);
> +	RTE_SET_USED(ts);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
> +		rte_mp_async_reply_t clb)
> +{
> +	RTE_SET_USED(req);
> +	RTE_SET_USED(ts);
> +	RTE_SET_USED(clb);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
> +{
> +	RTE_SET_USED(msg);
> +	RTE_SET_USED(peer);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +register_mp_requests(void)
> +{
> +	/* Non-stub function succeeds if multi-process is not supported. */
> +	EAL_LOG_STUB();
> +	return 0;
> +}
> +
> +int
> +request_to_primary(struct malloc_mp_req *req)
> +{
> +	RTE_SET_USED(req);
> +	EAL_LOG_NOT_IMPLEMENTED();
> +	return -1;
> +}
> +
> +int
> +request_sync(void)
> +{
> +	/* Common memory allocator depends on this function success. */
> +	EAL_LOG_STUB();
> +	return 0;
> +}
> diff --git a/lib/librte_eal/windows/eal_windows.h
> b/lib/librte_eal/windows/eal_windows.h
> index 390d2fd66..9735f0293 100644
> --- a/lib/librte_eal/windows/eal_windows.h
> +++ b/lib/librte_eal/windows/eal_windows.h
> @@ -9,8 +9,24 @@
>   * @file Facilities private to Windows EAL
>   */
> 
> +#include <rte_errno.h>
>  #include <rte_windows.h>
> 
> +/**
> + * Log current function as not implemented and set rte_errno.
> + */
> +#define EAL_LOG_NOT_IMPLEMENTED() \
> +	do { \
> +		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n",
> __func__); \
> +		rte_errno = ENOTSUP; \
> +	} while (0)
> +
> +/**
> + * Log current function as a stub.
> + */
> +#define EAL_LOG_STUB() \
> +	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
> +
>  /**
>   * Create a map of processors and cores on the system.
>   */
> @@ -36,4 +52,78 @@ int eal_thread_create(pthread_t *thread);
>   */
>  unsigned int eal_socket_numa_node(unsigned int socket_id);
> 
> +/**
> + * Open virt2phys driver interface device.
> + *
> + * @return 0 on success, (-1) on failure.
> + */
> +int eal_mem_virt2iova_init(void);
> +
> +/**
> + * Locate Win32 memory management routines in system libraries.
> + *
> + * @return 0 on success, (-1) on failure.
> + */
> +int eal_mem_win32api_init(void);
> +
> +/**
> + * Allocate a contiguous chunk of virtual memory.
> + *
> + * Use eal_mem_free() to free allocated memory.
> + *
> + * @param size
> + *  Number of bytes to allocate.
> + * @param page_size
> + *  If non-zero, means memory must be allocated in hugepages
> + *  of the specified size. The *size* parameter must then be
> + *  a multiple of the largest hugepage size requested.
> + * @return
> + *  Address of allocated memory, NULL on failure and rte_errno is set.
> + */
> +void *eal_mem_alloc(size_t size, size_t page_size);
> +
> +/**
> + * Allocate new memory in hugepages on the specified NUMA node.
> + *
> + * @param size
> + *  Number of bytes to allocate. Must be a multiple of huge page size.
> + * @param socket_id
> + *  Socket ID.
> + * @return
> + *  Address of the memory allocated on success or NULL on failure.
> + */
> +void *eal_mem_alloc_socket(size_t size, int socket_id);
> +
> +/**
> + * Commit memory previously reserved with eal_mem_reserve()
> + * or decommitted from hugepages by eal_mem_decommit().
> + *
> + * @param requested_addr
> + *  Address within a reserved region. Must not be NULL.
> + * @param size
> + *  Number of bytes to commit. Must be a multiple of page size.
> + * @param socket_id
> + *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
> + * @return
> + *  On success, address of the committed memory, that is, requested_addr.
> + *  On failure, NULL and rte_errno is set.
> + */
> +void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
> +
> +/**
> + * Put allocated or committed memory back into reserved state.
> + *
> + * @param addr
> + *  Address of the region to decommit.
> + * @param size
> + *  Number of bytes to decommit.
> + *
> + * The *addr* and *size* must match location and size
> + * of a previously allocated or committed region.
> + *
> + * @return
> + *  0 on success, (-1) on failure.
> + */
> +int eal_mem_decommit(void *addr, size_t size);
> +
>  #endif /* _EAL_WINDOWS_H_ */
> diff --git a/lib/librte_eal/windows/include/meson.build
> b/lib/librte_eal/windows/include/meson.build
> index 5fb1962ac..b3534b025 100644
> --- a/lib/librte_eal/windows/include/meson.build
> +++ b/lib/librte_eal/windows/include/meson.build
> @@ -5,5 +5,6 @@ includes += include_directories('.')
> 
>  headers += files(
>          'rte_os.h',
> +        'rte_virt2phys.h',
>          'rte_windows.h',
>  )
> diff --git a/lib/librte_eal/windows/include/rte_os.h
> b/lib/librte_eal/windows/include/rte_os.h
> index 510e39e03..62805a307 100644
> --- a/lib/librte_eal/windows/include/rte_os.h
> +++ b/lib/librte_eal/windows/include/rte_os.h
> @@ -36,6 +36,10 @@ extern "C" {
> 
>  #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
> 
> +#define open _open
> +#define close _close
> +#define unlink _unlink
> +
>  /* cpu_set macros implementation */
>  #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
>  #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
> diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h
> b/lib/librte_eal/windows/include/rte_virt2phys.h
> new file mode 100644
> index 000000000..4bb2b4aaf
> --- /dev/null
> +++ b/lib/librte_eal/windows/include/rte_virt2phys.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2020 Dmitry Kozlyuk
> + */
> +
> +/**
> + * @file virt2phys driver interface
> + */
> +
> +/**
> + * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
> + */
> +DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
> +	0x539c2135, 0x793a, 0x4926,
> +	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
> +
> +/**
> + * Driver device type for IO control codes.
> + */
> +#define VIRT2PHYS_DEVTYPE 0x8000
> +
> +/**
> + * Translate a valid non-paged virtual address to a physical address.
> + *
> + * Note: A physical address zero (0) is reported if input address
> + * is paged out or not mapped. However, if input is a valid mapping
> + * of I/O port 0x0000, output is also zero. There is no way
> + * to distinguish between these cases by return value only.
> + *
> + * Input: a non-paged virtual address (PVOID).
> + *
> + * Output: the corresponding physical address (LARGE_INTEGER).
> + */
> +#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
> +	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED,
> FILE_ANY_ACCESS)
> diff --git a/lib/librte_eal/windows/include/rte_windows.h
> b/lib/librte_eal/windows/include/rte_windows.h
> index ed6e4c148..899ed7d87 100644
> --- a/lib/librte_eal/windows/include/rte_windows.h
> +++ b/lib/librte_eal/windows/include/rte_windows.h
> @@ -23,6 +23,8 @@
> 
>  #include <basetsd.h>
>  #include <psapi.h>
> +#include <setupapi.h>
> +#include <winioctl.h>
> 
>  /* Have GUIDs defined. */
>  #ifndef INITGUID
> diff --git a/lib/librte_eal/windows/include/unistd.h
> b/lib/librte_eal/windows/include/unistd.h
> index 757b7f3c5..6b33005b2 100644
> --- a/lib/librte_eal/windows/include/unistd.h
> +++ b/lib/librte_eal/windows/include/unistd.h
> @@ -9,4 +9,7 @@
>   * as Microsoft libc does not contain unistd.h. This may be removed
>   * in future releases.
>   */
> +
> +#include <io.h>
> +
>  #endif /* _UNISTD_H_ */
> diff --git a/lib/librte_eal/windows/meson.build
> b/lib/librte_eal/windows/meson.build
> index 5f118bfe2..0bd56cd8f 100644
> --- a/lib/librte_eal/windows/meson.build
> +++ b/lib/librte_eal/windows/meson.build
> @@ -8,6 +8,11 @@ sources += files(
>  	'eal_debug.c',
>  	'eal_hugepages.c',
>  	'eal_lcore.c',
> +	'eal_memalloc.c',
> +	'eal_memory.c',
> +	'eal_mp.c',
>  	'eal_thread.c',
>  	'getopt.c',
>  )
> +
> +dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-13  8:24       ` Fady Bader
@ 2020-05-13  8:42         ` Dmitry Kozlyuk
  2020-05-13  9:09           ` Fady Bader
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-13  8:42 UTC (permalink / raw)
  To: Fady Bader
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

On Wed, 13 May 2020 08:24:12 +0000
Fady Bader <fady@mellanox.com> wrote:

> Hi Dmitry,
> I'm using your latest memory management patchset and getting an error
> in the function VirualAlloc2 in eal_mem_commit, error code: 0x57
> (ERROR_INVALID_PARAMETER). I'm using Windows server 2019 build 17763,
> and followed the steps to Grant *Lock pages in memory* Privilege.
> 
> The parameters that are sent to the function are:
> GetCurrentProcess() is -1.
> requested_addr is 0x0000025b`93800000.
> Size is 0x200000 (sysInfo.dwAllocationGranularity is 0x10000). 
> Flags is 0x20007000.
> Also, Socket_id is 0.
> 
> The call stack is:
> 00 dpdk_mempool_test!eal_mem_commit+0x253 
> 01 dpdk_mempool_test!alloc_seg+0x1b0
> 02 dpdk_mempool_test!alloc_seg_walk+0x2a1 
> 03 dpdk_mempool_test!rte_memseg_list_walk_thread_unsafe+0x81 
> 04 dpdk_mempool_test!eal_memalloc_alloc_seg_bulk+0x1a5 
> 05 dpdk_mempool_test!alloc_pages_on_heap+0x13a 
> 06 dpdk_mempool_test!try_expand_heap_primary+0x1dc 
> 07 dpdk_mempool_test!try_expand_heap+0xf5 
> 08 dpdk_mempool_test!alloc_more_mem_on_socket+0x693 
> 09 dpdk_mempool_test!malloc_heap_alloc_on_heap_id+0x2a7 
> 0a dpdk_mempool_test!malloc_heap_alloc+0x184 
> 0b dpdk_mempool_test!malloc_socket+0xf9
> 0c dpdk_mempool_test!rte_malloc_socket+0x39 
> 0d dpdk_mempool_test!rte_zmalloc_socket+0x31 
> 0e dpdk_mempool_test!rte_zmalloc+0x2d 
> 0f dpdk_mempool_test!rte_mempool_create_empty+0x1c9 
> 10 dpdk_mempool_test!rte_mempool_create+0xf8 

Hi Fady,

Can you share the code snippet causing this?

--
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-13  8:42         ` Dmitry Kozlyuk
@ 2020-05-13  9:09           ` Fady Bader
  2020-05-13  9:22             ` Fady Bader
  2020-05-13  9:38             ` Dmitry Kozlyuk
  0 siblings, 2 replies; 218+ messages in thread
From: Fady Bader @ 2020-05-13  9:09 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov



> -----Original Message-----
> From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> Sent: Wednesday, May 13, 2020 11:43 AM
> To: Fady Bader <fady@mellanox.com>
> Cc: dev@dpdk.org; Dmitry Malloy (MESHCHANINOV)
> <dmitrym@microsoft.com>; Narcisa Ana Maria Vasile
> <Narcisa.Vasile@microsoft.com>; Tal Shnaiderman <talshn@mellanox.com>;
> Thomas Monjalon <thomas@monjalon.net>; Harini Ramakrishnan
> <harini.ramakrishnan@microsoft.com>; Omar Cardona
> <ocardona@microsoft.com>; Pallavi Kadam <pallavi.kadam@intel.com>;
> Ranjit Menon <ranjit.menon@intel.com>; John McNamara
> <john.mcnamara@intel.com>; Marko Kovacevic
> <marko.kovacevic@intel.com>; Anatoly Burakov
> <anatoly.burakov@intel.com>
> Subject: Re: [PATCH v4 8/8] eal/windows: implement basic memory
> management
> 
> On Wed, 13 May 2020 08:24:12 +0000
> Fady Bader <fady@mellanox.com> wrote:
> 
> > Hi Dmitry,
> > I'm using your latest memory management patchset and getting an error
> > in the function VirualAlloc2 in eal_mem_commit, error code: 0x57
> > (ERROR_INVALID_PARAMETER). I'm using Windows server 2019 build
> 17763,
> > and followed the steps to Grant *Lock pages in memory* Privilege.
> >
> > The parameters that are sent to the function are:
> > GetCurrentProcess() is -1.
> > requested_addr is 0x0000025b`93800000.
> > Size is 0x200000 (sysInfo.dwAllocationGranularity is 0x10000).
> > Flags is 0x20007000.
> > Also, Socket_id is 0.
> >
> > The call stack is:
> > 00 dpdk_mempool_test!eal_mem_commit+0x253
> > 01 dpdk_mempool_test!alloc_seg+0x1b0
> > 02 dpdk_mempool_test!alloc_seg_walk+0x2a1
> > 03 dpdk_mempool_test!rte_memseg_list_walk_thread_unsafe+0x81
> > 04 dpdk_mempool_test!eal_memalloc_alloc_seg_bulk+0x1a5
> > 05 dpdk_mempool_test!alloc_pages_on_heap+0x13a
> > 06 dpdk_mempool_test!try_expand_heap_primary+0x1dc
> > 07 dpdk_mempool_test!try_expand_heap+0xf5
> > 08 dpdk_mempool_test!alloc_more_mem_on_socket+0x693
> > 09 dpdk_mempool_test!malloc_heap_alloc_on_heap_id+0x2a7
> > 0a dpdk_mempool_test!malloc_heap_alloc+0x184
> > 0b dpdk_mempool_test!malloc_socket+0xf9
> > 0c dpdk_mempool_test!rte_malloc_socket+0x39
> > 0d dpdk_mempool_test!rte_zmalloc_socket+0x31
> > 0e dpdk_mempool_test!rte_zmalloc+0x2d
> > 0f dpdk_mempool_test!rte_mempool_create_empty+0x1c9
> > 10 dpdk_mempool_test!rte_mempool_create+0xf8
> 
> Hi Fady,
> 
> Can you share the code snippet causing this?
> 

[snip]
+
+void*
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+		if (VirtualQuery(requested_addr, &info, sizeof(info)) == 0) {
+			RTE_LOG_WIN32_ERR("VirtualQuery()");
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) &&
+			!VirtualFree(requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR("VirtualFree(%p, %zu, "
+				"<split placeholder>)", requested_addr, size);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	if (requested_addr != NULL)
+		flags |= MEM_REPLACE_PLACEHOLDER;
+
+	addr = VirtualAlloc2(GetCurrentProcess(), requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, "
+			"<replace placeholder>)", addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	return addr;
+}
+

> --
> Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-13  9:09           ` Fady Bader
@ 2020-05-13  9:22             ` Fady Bader
  2020-05-13  9:38             ` Dmitry Kozlyuk
  1 sibling, 0 replies; 218+ messages in thread
From: Fady Bader @ 2020-05-13  9:22 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov



> -----Original Message-----
> From: Fady Bader
> Sent: Wednesday, May 13, 2020 12:09 PM
> To: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> Cc: dev@dpdk.org; Dmitry Malloy (MESHCHANINOV)
> <dmitrym@microsoft.com>; Narcisa Ana Maria Vasile
> <Narcisa.Vasile@microsoft.com>; Tal Shnaiderman <talshn@mellanox.com>;
> Thomas Monjalon <thomas@monjalon.net>; Harini Ramakrishnan
> <harini.ramakrishnan@microsoft.com>; Omar Cardona
> <ocardona@microsoft.com>; Pallavi Kadam <pallavi.kadam@intel.com>;
> Ranjit Menon <ranjit.menon@intel.com>; John McNamara
> <john.mcnamara@intel.com>; Marko Kovacevic
> <marko.kovacevic@intel.com>; Anatoly Burakov
> <anatoly.burakov@intel.com>
> Subject: RE: [PATCH v4 8/8] eal/windows: implement basic memory
> management
> 
> 
> 
> > -----Original Message-----
> > From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > Sent: Wednesday, May 13, 2020 11:43 AM
> > To: Fady Bader <fady@mellanox.com>
> > Cc: dev@dpdk.org; Dmitry Malloy (MESHCHANINOV)
> > <dmitrym@microsoft.com>; Narcisa Ana Maria Vasile
> > <Narcisa.Vasile@microsoft.com>; Tal Shnaiderman
> <talshn@mellanox.com>;
> > Thomas Monjalon <thomas@monjalon.net>; Harini Ramakrishnan
> > <harini.ramakrishnan@microsoft.com>; Omar Cardona
> > <ocardona@microsoft.com>; Pallavi Kadam <pallavi.kadam@intel.com>;
> > Ranjit Menon <ranjit.menon@intel.com>; John McNamara
> > <john.mcnamara@intel.com>; Marko Kovacevic
> > <marko.kovacevic@intel.com>; Anatoly Burakov
> > <anatoly.burakov@intel.com>
> > Subject: Re: [PATCH v4 8/8] eal/windows: implement basic memory
> > management
> >
> > On Wed, 13 May 2020 08:24:12 +0000
> > Fady Bader <fady@mellanox.com> wrote:
> >
> > > Hi Dmitry,
> > > I'm using your latest memory management patchset and getting an
> > > error in the function VirualAlloc2 in eal_mem_commit, error code:
> > > 0x57 (ERROR_INVALID_PARAMETER). I'm using Windows server 2019
> build
> > 17763,
> > > and followed the steps to Grant *Lock pages in memory* Privilege.
> > >
> > > The parameters that are sent to the function are:
> > > GetCurrentProcess() is -1.
> > > requested_addr is 0x0000025b`93800000.
> > > Size is 0x200000 (sysInfo.dwAllocationGranularity is 0x10000).
> > > Flags is 0x20007000.
> > > Also, Socket_id is 0.
> > >
> > > The call stack is:
> > > 00 dpdk_mempool_test!eal_mem_commit+0x253
> > > 01 dpdk_mempool_test!alloc_seg+0x1b0
> > > 02 dpdk_mempool_test!alloc_seg_walk+0x2a1
> > > 03 dpdk_mempool_test!rte_memseg_list_walk_thread_unsafe+0x81
> > > 04 dpdk_mempool_test!eal_memalloc_alloc_seg_bulk+0x1a5
> > > 05 dpdk_mempool_test!alloc_pages_on_heap+0x13a
> > > 06 dpdk_mempool_test!try_expand_heap_primary+0x1dc
> > > 07 dpdk_mempool_test!try_expand_heap+0xf5
> > > 08 dpdk_mempool_test!alloc_more_mem_on_socket+0x693
> > > 09 dpdk_mempool_test!malloc_heap_alloc_on_heap_id+0x2a7
> > > 0a dpdk_mempool_test!malloc_heap_alloc+0x184
> > > 0b dpdk_mempool_test!malloc_socket+0xf9
> > > 0c dpdk_mempool_test!rte_malloc_socket+0x39
> > > 0d dpdk_mempool_test!rte_zmalloc_socket+0x31
> > > 0e dpdk_mempool_test!rte_zmalloc+0x2d
> > > 0f dpdk_mempool_test!rte_mempool_create_empty+0x1c9
> > > 10 dpdk_mempool_test!rte_mempool_create+0xf8
> >
> > Hi Fady,
> >
> > Can you share the code snippet causing this?
> >
> 

I got it from dpdk\app\test\test_mempool

Line 496:
/* create a mempool (without cache) */
mp_nocache = rte_mempool_create("test_nocache", MEMPOOL_SIZE,
	MEMPOOL_ELT_SIZE, 0, 0,
	NULL, NULL,
	my_obj_init, NULL,
	SOCKET_ID_ANY, 0);

> 
> > --
> > Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-13  9:09           ` Fady Bader
  2020-05-13  9:22             ` Fady Bader
@ 2020-05-13  9:38             ` Dmitry Kozlyuk
  2020-05-13 12:25               ` Fady Bader
  1 sibling, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-13  9:38 UTC (permalink / raw)
  To: Fady Bader
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

On Wed, 13 May 2020 09:09:22 +0000
Fady Bader <fady@mellanox.com> wrote:

> > -----Original Message-----
> > From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > Sent: Wednesday, May 13, 2020 11:43 AM
> > To: Fady Bader <fady@mellanox.com>
> > Cc: dev@dpdk.org; Dmitry Malloy (MESHCHANINOV)
> > <dmitrym@microsoft.com>; Narcisa Ana Maria Vasile
> > <Narcisa.Vasile@microsoft.com>; Tal Shnaiderman
> > <talshn@mellanox.com>; Thomas Monjalon <thomas@monjalon.net>;
> > Harini Ramakrishnan <harini.ramakrishnan@microsoft.com>; Omar
> > Cardona <ocardona@microsoft.com>; Pallavi Kadam
> > <pallavi.kadam@intel.com>; Ranjit Menon <ranjit.menon@intel.com>;
> > John McNamara <john.mcnamara@intel.com>; Marko Kovacevic
> > <marko.kovacevic@intel.com>; Anatoly Burakov
> > <anatoly.burakov@intel.com>
> > Subject: Re: [PATCH v4 8/8] eal/windows: implement basic memory
> > management
> > 
> > On Wed, 13 May 2020 08:24:12 +0000
> > Fady Bader <fady@mellanox.com> wrote:
> >   
> > > Hi Dmitry,
> > > I'm using your latest memory management patchset and getting an
> > > error in the function VirualAlloc2 in eal_mem_commit, error code:
> > > 0x57 (ERROR_INVALID_PARAMETER). I'm using Windows server 2019
> > > build  
> > 17763,  
> > > and followed the steps to Grant *Lock pages in memory* Privilege.
> > >
> > > The parameters that are sent to the function are:
> > > GetCurrentProcess() is -1.
> > > requested_addr is 0x0000025b`93800000.
> > > Size is 0x200000 (sysInfo.dwAllocationGranularity is 0x10000).
> > > Flags is 0x20007000.
> > > Also, Socket_id is 0.
> > >
> > > The call stack is:
> > > 00 dpdk_mempool_test!eal_mem_commit+0x253
> > > 01 dpdk_mempool_test!alloc_seg+0x1b0
> > > 02 dpdk_mempool_test!alloc_seg_walk+0x2a1
> > > 03 dpdk_mempool_test!rte_memseg_list_walk_thread_unsafe+0x81
> > > 04 dpdk_mempool_test!eal_memalloc_alloc_seg_bulk+0x1a5
> > > 05 dpdk_mempool_test!alloc_pages_on_heap+0x13a
> > > 06 dpdk_mempool_test!try_expand_heap_primary+0x1dc
> > > 07 dpdk_mempool_test!try_expand_heap+0xf5
> > > 08 dpdk_mempool_test!alloc_more_mem_on_socket+0x693
> > > 09 dpdk_mempool_test!malloc_heap_alloc_on_heap_id+0x2a7
> > > 0a dpdk_mempool_test!malloc_heap_alloc+0x184
> > > 0b dpdk_mempool_test!malloc_socket+0xf9
> > > 0c dpdk_mempool_test!rte_malloc_socket+0x39
> > > 0d dpdk_mempool_test!rte_zmalloc_socket+0x31
> > > 0e dpdk_mempool_test!rte_zmalloc+0x2d
> > > 0f dpdk_mempool_test!rte_mempool_create_empty+0x1c9
> > > 10 dpdk_mempool_test!rte_mempool_create+0xf8  
> > 
> > Hi Fady,
> > 
> > Can you share the code snippet causing this?
> >   
> 
> [snip]
[snip]

I meant the code of the application that calls rte_mempool_create(). Or
is it one of the DPDK test applications?

--
Dmitry Kozlyuk  

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-13  9:38             ` Dmitry Kozlyuk
@ 2020-05-13 12:25               ` Fady Bader
  2020-05-18  0:17                 ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Fady Bader @ 2020-05-13 12:25 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov



> -----Original Message-----
> From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> Sent: Wednesday, May 13, 2020 12:39 PM
> To: Fady Bader <fady@mellanox.com>
> Cc: dev@dpdk.org; Dmitry Malloy (MESHCHANINOV)
> <dmitrym@microsoft.com>; Narcisa Ana Maria Vasile
> <Narcisa.Vasile@microsoft.com>; Tal Shnaiderman <talshn@mellanox.com>;
> Thomas Monjalon <thomas@monjalon.net>; Harini Ramakrishnan
> <harini.ramakrishnan@microsoft.com>; Omar Cardona
> <ocardona@microsoft.com>; Pallavi Kadam <pallavi.kadam@intel.com>;
> Ranjit Menon <ranjit.menon@intel.com>; John McNamara
> <john.mcnamara@intel.com>; Marko Kovacevic
> <marko.kovacevic@intel.com>; Anatoly Burakov
> <anatoly.burakov@intel.com>
> Subject: Re: [PATCH v4 8/8] eal/windows: implement basic memory
> management
> 
> On Wed, 13 May 2020 09:09:22 +0000
> Fady Bader <fady@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > > Sent: Wednesday, May 13, 2020 11:43 AM
> > > To: Fady Bader <fady@mellanox.com>
> > > Cc: dev@dpdk.org; Dmitry Malloy (MESHCHANINOV)
> > > <dmitrym@microsoft.com>; Narcisa Ana Maria Vasile
> > > <Narcisa.Vasile@microsoft.com>; Tal Shnaiderman
> > > <talshn@mellanox.com>; Thomas Monjalon <thomas@monjalon.net>;
> Harini
> > > Ramakrishnan <harini.ramakrishnan@microsoft.com>; Omar Cardona
> > > <ocardona@microsoft.com>; Pallavi Kadam <pallavi.kadam@intel.com>;
> > > Ranjit Menon <ranjit.menon@intel.com>; John McNamara
> > > <john.mcnamara@intel.com>; Marko Kovacevic
> > > <marko.kovacevic@intel.com>; Anatoly Burakov
> > > <anatoly.burakov@intel.com>
> > > Subject: Re: [PATCH v4 8/8] eal/windows: implement basic memory
> > > management
> > >
> > > On Wed, 13 May 2020 08:24:12 +0000
> > > Fady Bader <fady@mellanox.com> wrote:
> > >
> > > > Hi Dmitry,
> > > > I'm using your latest memory management patchset and getting an
> > > > error in the function VirualAlloc2 in eal_mem_commit, error code:
> > > > 0x57 (ERROR_INVALID_PARAMETER). I'm using Windows server 2019
> > > > build
> > > 17763,
> > > > and followed the steps to Grant *Lock pages in memory* Privilege.
> > > >
> > > > The parameters that are sent to the function are:
> > > > GetCurrentProcess() is -1.
> > > > requested_addr is 0x0000025b`93800000.
> > > > Size is 0x200000 (sysInfo.dwAllocationGranularity is 0x10000).
> > > > Flags is 0x20007000.
> > > > Also, Socket_id is 0.
> > > >
> > > > The call stack is:
> > > > 00 dpdk_mempool_test!eal_mem_commit+0x253
> > > > 01 dpdk_mempool_test!alloc_seg+0x1b0
> > > > 02 dpdk_mempool_test!alloc_seg_walk+0x2a1
> > > > 03 dpdk_mempool_test!rte_memseg_list_walk_thread_unsafe+0x81
> > > > 04 dpdk_mempool_test!eal_memalloc_alloc_seg_bulk+0x1a5
> > > > 05 dpdk_mempool_test!alloc_pages_on_heap+0x13a
> > > > 06 dpdk_mempool_test!try_expand_heap_primary+0x1dc
> > > > 07 dpdk_mempool_test!try_expand_heap+0xf5
> > > > 08 dpdk_mempool_test!alloc_more_mem_on_socket+0x693
> > > > 09 dpdk_mempool_test!malloc_heap_alloc_on_heap_id+0x2a7
> > > > 0a dpdk_mempool_test!malloc_heap_alloc+0x184
> > > > 0b dpdk_mempool_test!malloc_socket+0xf9
> > > > 0c dpdk_mempool_test!rte_malloc_socket+0x39
> > > > 0d dpdk_mempool_test!rte_zmalloc_socket+0x31
> > > > 0e dpdk_mempool_test!rte_zmalloc+0x2d
> > > > 0f dpdk_mempool_test!rte_mempool_create_empty+0x1c9
> > > > 10 dpdk_mempool_test!rte_mempool_create+0xf8
> > >
> > > Hi Fady,
> > >
> > > Can you share the code snippet causing this?
> > >
> >
> > [snip]
> [snip]
> 
> I meant the code of the application that calls rte_mempool_create(). Or is it
> one of the DPDK test applications?

I got it from dpdk\app\test\test_mempool

Line 496:
/* create a mempool (without cache) */
mp_nocache = rte_mempool_create("test_nocache", MEMPOOL_SIZE,
	MEMPOOL_ELT_SIZE, 0, 0,
	NULL, NULL,
	my_obj_init, NULL,
	SOCKET_ID_ANY, 0);

> 
> --
> Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-13 12:25               ` Fady Bader
@ 2020-05-18  0:17                 ` Dmitry Kozlyuk
  2020-05-18 22:25                   ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-18  0:17 UTC (permalink / raw)
  To: Fady Bader
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

On Wed, 13 May 2020 12:25:10 +0000
Fady Bader <fady@mellanox.com> wrote:
[snip]
> > 
> > I meant the code of the application that calls
> > rte_mempool_create(). Or is it one of the DPDK test applications?  
> 
> I got it from dpdk\app\test\test_mempool
> 
> Line 496:
> /* create a mempool (without cache) */
> mp_nocache = rte_mempool_create("test_nocache", MEMPOOL_SIZE,
> 	MEMPOOL_ELT_SIZE, 0, 0,
> 	NULL, NULL,
> 	my_obj_init, NULL,
> 	SOCKET_ID_ANY, 0);
>

For building this code you must have enabled librte_ring,
librte_mempool, and drivers/mempool, and I assume you build test code
without librte_cmdline somehow. This are nontrivial changes, so I can't
be sure to reproduce them exactly. Can you please share a complete
patch?

Meanwhile, I observe a similar issue where rte_mempool_create() fails
to allocate memory and hangs when compiled with Clang, succeeds with
native MinGW, but still hangs with cross MinGW. I'm investigating it.

Testing patch follows, the snippet added is in
examples/helloworld/main.

---

diff --git a/config/meson.build b/config/meson.build
index b6d84687f..018726f75 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -117,7 +117,7 @@ if not is_windows
 endif
 
 # use pthreads if available for the platform
-if not is_ms_linker
+if cc.find_library('pthread', required: false).found()
 	add_project_link_arguments('-pthread', language: 'c')
 	dpdk_extra_ldflags += '-pthread'
 endif
diff --git a/drivers/meson.build b/drivers/meson.build
index dc293b270..ee565bc19 100644
--- a/drivers/meson.build
+++ b/drivers/meson.build
@@ -2,8 +2,8 @@
 # Copyright(c) 2017-2019 Intel Corporation
 
 if is_windows
-	subdir_done()
-endif
+	dpdk_driver_classes = ['mempool']
+else
 
 # Defines the order in which the drivers are buit.
 dpdk_driver_classes = ['common',
@@ -16,7 +16,7 @@ dpdk_driver_classes = ['common',
 	       'vdpa',    # depends on common, bus and mempool.
 	       'event',   # depends on common, bus, mempool and net.
 	       'baseband'] # depends on common and bus.
-
+endif
 disabled_drivers = run_command(list_dir_globs, get_option('disable_drivers'),
 		).stdout().split()
 
@@ -78,13 +78,13 @@ foreach class:dpdk_driver_classes
 			shared_deps = ext_deps
 			static_deps = ext_deps
 			foreach d:deps
-				if not is_variable('shared_rte_' + d)
+				if not is_variable('static_rte_' + d)
 					build = false
 					reason = 'Missing internal dependency, "@0@"'.format(d)
 					message('Disabling @1@ [@2@]: missing internal dependency "@0@"'
 							.format(d, name, 'drivers/' + drv_path))
 				else
-					shared_deps += [get_variable('shared_rte_' + d)]
+					# shared_deps += [get_variable('shared_rte_' + d)]
 					static_deps += [get_variable('static_rte_' + d)]
 				endif
 			endforeach
@@ -110,6 +110,7 @@ foreach class:dpdk_driver_classes
 
 			dpdk_extra_ldflags += pkgconfig_extra_libs
 
+			if host_machine.system() != 'windows'
 			# generate pmdinfo sources by building a temporary
 			# lib and then running pmdinfogen on the contents of
 			# that lib. The final lib reuses the object files and
@@ -126,7 +127,7 @@ foreach class:dpdk_driver_classes
 						'@OUTPUT@', pmdinfogen],
 					output: out_filename,
 					depends: [pmdinfogen, tmp_lib])
-
+			endif
 			version_map = '@0@/@1@/@2@_version.map'.format(
 					meson.current_source_dir(),
 					drv_path, lib_name)
@@ -178,31 +179,31 @@ foreach class:dpdk_driver_classes
 					output: lib_name + '.sym_chk')
 			endif
 
-			shared_lib = shared_library(lib_name,
-				sources,
-				objects: objs,
-				include_directories: includes,
-				dependencies: shared_deps,
-				c_args: cflags,
-				link_args: lk_args,
-				link_depends: lk_deps,
-				version: lib_version,
-				soversion: so_version,
-				install: true,
-				install_dir: driver_install_path)
-
-			# create a dependency object and add it to the global dictionary so
-			# testpmd or other built-in apps can find it if necessary
-			shared_dep = declare_dependency(link_with: shared_lib,
-					include_directories: includes,
-					dependencies: shared_deps)
+			# shared_lib = shared_library(lib_name,
+			# 	sources,
+			# 	objects: objs,
+			# 	include_directories: includes,
+			# 	dependencies: shared_deps,
+			# 	c_args: cflags,
+			# 	link_args: lk_args,
+			# 	link_depends: lk_deps,
+			# 	version: lib_version,
+			# 	soversion: so_version,
+			# 	install: true,
+			# 	install_dir: driver_install_path)
+
+			# # create a dependency object and add it to the global dictionary so
+			# # testpmd or other built-in apps can find it if necessary
+			# shared_dep = declare_dependency(link_with: shared_lib,
+			# 		include_directories: includes,
+			# 		dependencies: shared_deps)
 			static_dep = declare_dependency(link_with: static_lib,
 					include_directories: includes,
 					dependencies: static_deps)
 
 			dpdk_drivers += static_lib
 
-			set_variable('shared_@0@'.format(lib_name), shared_dep)
+			# set_variable('shared_@0@'.format(lib_name), shared_dep)
 			set_variable('static_@0@'.format(lib_name), static_dep)
 			dependency_name = ''.join(lib_name.split('rte_'))
 			message('drivers/@0@: Defining dependency "@1@"'.format(
diff --git a/examples/helloworld/main.c b/examples/helloworld/main.c
index 968045f1b..cf895c840 100644
--- a/examples/helloworld/main.c
+++ b/examples/helloworld/main.c
@@ -14,6 +14,8 @@
 #include <rte_per_lcore.h>
 #include <rte_lcore.h>
 #include <rte_debug.h>
+#include <rte_mempool.h>
+#include <rte_errno.h>
 
 static int
 lcore_hello(__rte_unused void *arg)
@@ -29,11 +31,30 @@ main(int argc, char **argv)
 {
 	int ret;
 	unsigned lcore_id;
+	struct rte_mempool *pool;
+
+	rte_log_set_level(RTE_LOGTYPE_EAL, RTE_LOG_DEBUG);
 
 	ret = rte_eal_init(argc, argv);
 	if (ret < 0)
 		rte_panic("Cannot init EAL\n");
 
+	pool = rte_mempool_create(
+		"test_mempool",
+		(1 << 18) - 1,
+		(1 << 12),
+		1 << 9,
+		0,
+		NULL, NULL, NULL, NULL,
+		SOCKET_ID_ANY,
+		0);
+	if (!pool) {
+		RTE_LOG(ERR, USER1, "cannot create mempool: %d\n", rte_errno);
+		return EXIT_FAILURE;
+	}
+
+	rte_mempool_free(pool);
+
 	/* call lcore_hello() on every slave lcore */
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		rte_eal_remote_launch(lcore_hello, NULL, lcore_id);
diff --git a/examples/meson.build b/examples/meson.build
index 3b540012f..407322dec 100644
--- a/examples/meson.build
+++ b/examples/meson.build
@@ -83,7 +83,7 @@ foreach example: examples
 	includes = [include_directories(example)]
 	deps = ['eal', 'mempool', 'net', 'mbuf', 'ethdev', 'cmdline']
 	if is_windows
-		deps = ['eal'] # only supported lib on Windows currently
+		deps = ['eal', 'mempool', 'ring'] # only supported lib on Windows currently
 	endif
 	subdir(example)
 
diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 0bde995b5..017eebba5 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -12,7 +12,6 @@
 #include <inttypes.h>
 #include <errno.h>
 #include <sys/queue.h>
-#include <sys/mman.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -148,7 +147,7 @@ get_min_page_size(int socket_id)
 
 	rte_memseg_list_walk(find_min_pagesz, &wa);
 
-	return wa.min == SIZE_MAX ? (size_t) getpagesize() : wa.min;
+	return wa.min == SIZE_MAX ? (size_t) rte_get_page_size() : wa.min;
 }
 
 
@@ -526,7 +525,7 @@ rte_mempool_get_page_size(struct rte_mempool *mp, size_t *pg_sz)
 	else if (rte_eal_has_hugepages() || alloc_in_ext_mem)
 		*pg_sz = get_min_page_size(mp->socket_id);
 	else
-		*pg_sz = getpagesize();
+		*pg_sz = rte_get_page_size();
 
 	rte_mempool_trace_get_page_size(mp, *pg_sz);
 	return 0;
@@ -686,7 +685,7 @@ get_anon_size(const struct rte_mempool *mp)
 	size_t min_chunk_size;
 	size_t align;
 
-	pg_sz = getpagesize();
+	pg_sz = rte_get_page_size();
 	pg_shift = rte_bsf32(pg_sz);
 	size = rte_mempool_ops_calc_mem_size(mp, mp->size, pg_shift,
 					     &min_chunk_size, &align);
@@ -710,7 +709,7 @@ rte_mempool_memchunk_anon_free(struct rte_mempool_memhdr *memhdr,
 	if (size < 0)
 		return;
 
-	munmap(opaque, size);
+	rte_mem_unmap(opaque, size);
 }
 
 /* populate the mempool with an anonymous mapping */
@@ -740,20 +739,20 @@ rte_mempool_populate_anon(struct rte_mempool *mp)
 	}
 
 	/* get chunk of virtually continuous memory */
-	addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
-		MAP_SHARED | MAP_ANONYMOUS, -1, 0);
-	if (addr == MAP_FAILED) {
-		rte_errno = errno;
+	addr = rte_mem_map(NULL, size, RTE_PROT_READ | RTE_PROT_WRITE,
+		RTE_MAP_SHARED | RTE_MAP_ANONYMOUS, -1, 0);
+	if (addr == NULL) {
 		return 0;
 	}
 	/* can't use MMAP_LOCKED, it does not exist on BSD */
-	if (mlock(addr, size) < 0) {
-		rte_errno = errno;
-		munmap(addr, size);
+	if (rte_mem_lock(addr, size) < 0) {
+		ret = rte_errno;
+		rte_mem_unmap(addr, size);
+		rte_errno = ret;
 		return 0;
 	}
 
-	ret = rte_mempool_populate_virt(mp, addr, size, getpagesize(),
+	ret = rte_mempool_populate_virt(mp, addr, size, rte_get_page_size(),
 		rte_mempool_memchunk_anon_free, addr);
 	if (ret == 0) /* should not happen */
 		ret = -ENOBUFS;
diff --git a/lib/meson.build b/lib/meson.build
index d190d84ef..77da5216f 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -36,7 +36,12 @@ libraries = [
 	'flow_classify', 'bpf', 'graph', 'node']
 
 if is_windows
-	libraries = ['kvargs','eal'] # only supported libraries for windows
+	libraries = [
+		'kvargs',
+		'eal',
+		'ring',
+		'mempool',
+		]
 endif
 
 default_cflags = machine_args
@@ -56,6 +61,7 @@ foreach l:libraries
 	use_function_versioning = false
 	sources = []
 	headers = []
+	headers_compat = {}
 	includes = []
 	cflags = default_cflags
 	objs = [] # other object files to link against, used e.g. for
@@ -77,11 +83,11 @@ foreach l:libraries
 		shared_deps = ext_deps
 		static_deps = ext_deps
 		foreach d:deps
-			if not is_variable('shared_rte_' + d)
+			if not is_variable('static_rte_' + d)
 				error('Missing internal dependency "@0@" for @1@ [@2@]'
 						.format(d, name, 'lib/' + dir_name))
 			endif
-			shared_deps += [get_variable('shared_rte_' + d)]
+			# shared_deps += [get_variable('shared_rte_' + d)]
 			static_deps += [get_variable('static_rte_' + d)]
 		endforeach
 	endif
@@ -94,6 +100,12 @@ foreach l:libraries
 		dpdk_conf.set('RTE_LIBRTE_' + name.to_upper(), 1)
 		install_headers(headers)
 
+		if is_windows and (name == 'eal')
+			foreach dir, header : headers_compat
+				install_headers(header, subdir: dir)
+			endforeach
+		endif
+
 		libname = 'rte_' + name
 		includes += include_directories(dir_name)
 
@@ -171,29 +183,29 @@ foreach l:libraries
 					output: name + '.sym_chk')
 			endif
 
-			shared_lib = shared_library(libname,
-					sources,
-					objects: objs,
-					c_args: cflags,
-					dependencies: shared_deps,
-					include_directories: includes,
-					link_args: lk_args,
-					link_depends: lk_deps,
-					version: lib_version,
-					soversion: so_version,
-					install: true)
-			shared_dep = declare_dependency(link_with: shared_lib,
-					include_directories: includes,
-					dependencies: shared_deps)
-
-			dpdk_libraries = [shared_lib] + dpdk_libraries
+			# shared_lib = shared_library(libname,
+			# 		sources,
+			# 		objects: objs,
+			# 		c_args: cflags,
+			# 		dependencies: shared_deps,
+			# 		include_directories: includes,
+			# 		link_args: lk_args,
+			# 		link_depends: lk_deps,
+			# 		version: lib_version,
+			# 		soversion: so_version,
+			# 		install: true)
+			# shared_dep = declare_dependency(link_with: shared_lib,
+			# 		include_directories: includes,
+			# 		dependencies: shared_deps)
+
+			# dpdk_libraries = [shared_lib] + dpdk_libraries
 			dpdk_static_libraries = [static_lib] + dpdk_static_libraries
 			if libname == 'rte_node'
 				dpdk_graph_nodes = [static_lib]
 			endif
 		endif # sources.length() > 0
 
-		set_variable('shared_rte_' + name, shared_dep)
+		# set_variable('shared_rte_' + name, shared_dep)
 		set_variable('static_rte_' + name, static_dep)
 		message('lib/@0@: Defining dependency "@1@"'.format(
 				dir_name, name))


--
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management
  2020-05-18  0:17                 ` Dmitry Kozlyuk
@ 2020-05-18 22:25                   ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-18 22:25 UTC (permalink / raw)
  To: Fady Bader
  Cc: dev, Dmitry Malloy (MESHCHANINOV),
	Narcisa Ana Maria Vasile, Tal Shnaiderman, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

On Mon, 18 May 2020 03:17:04 +0300
Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:

> On Wed, 13 May 2020 12:25:10 +0000
> Fady Bader <fady@mellanox.com> wrote:
> [snip]
> > > 
> > > I meant the code of the application that calls
> > > rte_mempool_create(). Or is it one of the DPDK test applications?
> > >    
> > 
> > I got it from dpdk\app\test\test_mempool
> > 
> > Line 496:
> > /* create a mempool (without cache) */
> > mp_nocache = rte_mempool_create("test_nocache", MEMPOOL_SIZE,
> > 	MEMPOOL_ELT_SIZE, 0, 0,
> > 	NULL, NULL,
> > 	my_obj_init, NULL,
> > 	SOCKET_ID_ANY, 0);
> >  
> 
> For building this code you must have enabled librte_ring,
> librte_mempool, and drivers/mempool, and I assume you build test code
> without librte_cmdline somehow. This are nontrivial changes, so I
> can't be sure to reproduce them exactly. Can you please share a
> complete patch?

Never mind, managed to reproduce it.

> Meanwhile, I observe a similar issue where rte_mempool_create() fails
> to allocate memory and hangs when compiled with Clang, succeeds with
> native MinGW, but still hangs with cross MinGW. I'm investigating it.

I must have messed up with my build setup, because I observe this no
more, now that I'm testing it from scratch with different toolchains and
their versions.
 
--
Dmitry Kozlyuk


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 0/8] Windows basic memory management
  2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
                       ` (7 preceding siblings ...)
  2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-05-25  0:37     ` Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
                         ` (11 more replies)
  8 siblings, 12 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk

Note: cover letter updated for v5.

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing without IOVA is available.

Testing revealed Windows Server 2019 does not allow allocating hugepage
memory at a reserved address, despite advertised API.  So allocator has
to temporary free the region to be allocated.  This creates in inherent
race condition. This issue is being discussed with Microsoft privately.

New EAL public functions for memory mapping are introduced to mitigate
OS differences in DPDK libraries and applications: rte_mem_map,
rte_mem_unmap, rte_mem_lock, rte_get_page_size.

To support common MM routines, internal wrappers for low-level memory
reservation and file management are introduced. These changes affect
Linux and FreeBSD EAL. Shared code is placed unded /unix/ subdirectory
(suggested by Thomas).

To avoid code duplication between Linux and Windows EAL, common code
for EALs supporting dynamic memory allocation is extracted
(discussed with Anatoly Burakov in v4 thread). This is a separate
patch to ease the review, but it can be merged with the previous one.

EAL tracepoints save size_t values as long, which is invalid on Windows.
New size_t emitter for tracepoints is introduced (suggested by Jerin
Jacob to Fady Bader, see [1]). Also, to avoid workaround in every file
using the tracepoints, stubs are added to Windows EAL.

Entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.
The file is imported as-is, so it causes a bunch of checkpatch warnings.

[1]: http://mails.dpdk.org/archives/dev/2020-May/168076.html

---

v5:
    * Fix allocation and deallocation on Windows Server (Fady Bader).
    * Replace remaining VirtualFree with VirtualFreeEx (Ranjit Menon).
    * Fix errors in eal_get_virtual_area (Anatoly Burakov).
    * Fix error handling and documentation for rte_mem_lock (Anatoly Burakov).
    * Extract common code for EALs w/dynamic allocation (Anatoly Burakov).
    * Use POSIX value for rte_errno after rte_mem_unmap() on Windows.
    * Add stubs to use tracing functions without workarounds.

v4:

    * Rebase on ToT, drop patches merged into master.
    * Rearrange patches to split Windows code (Jerin).
    * Fix Linux and FreeBSD build with make (Ophir).
    * Use int instead of enum to hold a set of flags (Anatoly).
    * Rename eal_mem_reserve items and fix their description (Anatoly).
    * Add eal_mem_set_dump() wrapper around madvise (Anatoly).
    * Don't claim Windows Server 2016 support due to lack of API (Tal).
    * Replace enum rte_page_sizes with a set of #defines (Jerin).
    * Fix documentation, SPDX tags, logging (Thomas).

v3:

    * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
    * Fix logic and error handling while allocating segments.
    * Fix Unix rte_mem_map(): return NULL on failure.
    * Fix some checkpatch.sh issues:
        * Do not return positive errno, use DWORD for GetLastError().
        * Make dpdk-kmods source files non-executable.
    * Improve GSG for Windows Server (suggested by Ranjit Menon).

v2:

    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.


Dmitry Kozlyuk (11):
  eal: replace rte_page_sizes with a set of constants
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal/mem: extract common code for memseg list initialization
  eal/mem: extract common code for dynamic memory allocation
  trace: add size_t field emitter
  eal/windows: add tracing support stubs
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal/windows: implement basic memory management

 config/meson.build                            |  12 +-
 doc/guides/windows_gsg/build_dpdk.rst         |  20 -
 doc/guides/windows_gsg/index.rst              |   1 +
 doc/guides/windows_gsg/run_apps.rst           |  95 +++
 lib/librte_eal/common/eal_common_dynmem.c     | 498 ++++++++++++
 lib/librte_eal/common/eal_common_fbarray.c    |  58 +-
 lib/librte_eal/common/eal_common_memory.c     | 152 +++-
 lib/librte_eal/common/eal_common_thread.c     |   5 +-
 lib/librte_eal/common/eal_private.h           | 228 +++++-
 lib/librte_eal/common/meson.build             |  16 +
 lib/librte_eal/common/rte_malloc.c            |   9 +
 lib/librte_eal/freebsd/Makefile               |   5 +
 lib/librte_eal/freebsd/eal_memory.c           |  94 +--
 lib/librte_eal/include/rte_eal_trace.h        |   8 +-
 lib/librte_eal/include/rte_memory.h           | 111 ++-
 lib/librte_eal/include/rte_trace_point.h      |   1 +
 lib/librte_eal/linux/Makefile                 |   5 +
 lib/librte_eal/linux/eal_memalloc.c           |   5 +-
 lib/librte_eal/linux/eal_memory.c             | 605 +--------------
 lib/librte_eal/meson.build                    |   4 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/rte_eal_version.map            |   6 +
 lib/librte_eal/unix/eal_unix.c                |  51 ++
 lib/librte_eal/unix/eal_unix_memory.c         | 152 ++++
 lib/librte_eal/unix/meson.build               |   7 +
 lib/librte_eal/windows/eal.c                  | 190 +++++
 lib/librte_eal/windows/eal_hugepages.c        | 108 +++
 lib/librte_eal/windows/eal_lcore.c            | 185 +++--
 lib/librte_eal/windows/eal_memalloc.c         | 442 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  85 +++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |   9 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/sys/queue.h    | 663 ++++++++++++++--
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   6 +
 lib/librte_mempool/rte_mempool_trace.h        |  10 +-
 40 files changed, 3903 insertions(+), 915 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c
 create mode 100644 lib/librte_eal/unix/eal_unix.c
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 01/11] eal: replace rte_page_sizes with a set of constants
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                         ` (10 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_sizes has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Remove rte_page_sizes and replace its values with #define's.
This enumeration is not used in public API, so there's no ABI breakage.

Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---

Release notes for 20.08 don't exits yet, so not adding anything.

 lib/librte_eal/include/rte_memory.h | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..65374d53a 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -24,19 +24,16 @@ extern "C" {
 #include <rte_config.h>
 #include <rte_fbarray.h>
 
-__extension__
-enum rte_page_sizes {
-	RTE_PGSIZE_4K    = 1ULL << 12,
-	RTE_PGSIZE_64K   = 1ULL << 16,
-	RTE_PGSIZE_256K  = 1ULL << 18,
-	RTE_PGSIZE_2M    = 1ULL << 21,
-	RTE_PGSIZE_16M   = 1ULL << 24,
-	RTE_PGSIZE_256M  = 1ULL << 28,
-	RTE_PGSIZE_512M  = 1ULL << 29,
-	RTE_PGSIZE_1G    = 1ULL << 30,
-	RTE_PGSIZE_4G    = 1ULL << 32,
-	RTE_PGSIZE_16G   = 1ULL << 34,
-};
+#define RTE_PGSIZE_4K   (1ULL << 12)
+#define RTE_PGSIZE_64K  (1ULL << 16)
+#define RTE_PGSIZE_256K (1ULL << 18)
+#define RTE_PGSIZE_2M   (1ULL << 21)
+#define RTE_PGSIZE_16M  (1ULL << 24)
+#define RTE_PGSIZE_256M (1ULL << 28)
+#define RTE_PGSIZE_512M (1ULL << 29)
+#define RTE_PGSIZE_1G   (1ULL << 30)
+#define RTE_PGSIZE_4G   (1ULL << 32)
+#define RTE_PGSIZE_16G  (1ULL << 34)
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-28  7:59         ` Thomas Monjalon
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
                         ` (9 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson

EAL common code uses file locking and truncation. Introduce
OS-independent wrappers in order to support both Linux/FreeBSD
and Windows:

* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Wrappers follow POSIX semantics, but interface is not POSIX,
so that it can be made more clean, e.g. by not mixing locking
operation and behaviour on conflict.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. Files should be named
after the ones from which the code is factored in OS subdirectory.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_fbarray.c | 21 ++++-----
 lib/librte_eal/common/eal_private.h        | 47 ++++++++++++++++++++
 lib/librte_eal/freebsd/Makefile            |  4 ++
 lib/librte_eal/linux/Makefile              |  4 ++
 lib/librte_eal/meson.build                 |  4 ++
 lib/librte_eal/unix/eal_unix.c             | 51 ++++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |  6 +++
 7 files changed, 124 insertions(+), 13 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_unix.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 4f8f1af73..cfcab63e9 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -8,8 +8,8 @@
 #include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -85,10 +85,8 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 
@@ -778,7 +776,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 					__func__, path, strerror(errno));
 			rte_errno = errno;
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
 					__func__, path, strerror(errno));
 			rte_errno = EBUSY;
@@ -789,10 +788,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -895,10 +892,8 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -1025,7 +1020,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,7 +1037,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 869ce183a..cef73d6fe 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -420,4 +420,51 @@ eal_malloc_no_trace(const char *type, size_t size, unsigned int align);
 
 void eal_free_no_trace(void *addr);
 
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index af95386d4..4654ca2b3 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -74,6 +75,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_hypervisor.c
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 48cc34844..4f39d462c 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -81,6 +82,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_hypervisor.c
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index e301f4558..8d492897d 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal_unix.c b/lib/librte_eal/unix/eal_unix.c
new file mode 100644
index 000000000..b9c57ef18
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix.c
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..cfa1b4ef9
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal_unix.c',
+)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-27  6:33         ` Ray Kinsella
                           ` (2 more replies)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
                         ` (8 subsequent siblings)
  11 siblings, 3 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson, Ray Kinsella, Neil Horman

Introduce OS-independent wrappers for memory management operations used
across DPDK and specifically in common code of EAL:

* rte_mem_map()
* rte_mem_unmap()
* rte_get_page_size()
* rte_mem_lock()

Windows uses different APIs for memory mapping and reservation, while
Unices reserve memory by mapping it. Introduce EAL private functions to
support memory reservation in common code:

* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_fbarray.c |  37 +++--
 lib/librte_eal/common/eal_common_memory.c  |  60 +++-----
 lib/librte_eal/common/eal_private.h        |  78 ++++++++++-
 lib/librte_eal/freebsd/Makefile            |   1 +
 lib/librte_eal/include/rte_memory.h        |  88 ++++++++++++
 lib/librte_eal/linux/Makefile              |   1 +
 lib/librte_eal/linux/eal_memalloc.c        |   5 +-
 lib/librte_eal/rte_eal_version.map         |   6 +
 lib/librte_eal/unix/eal_unix_memory.c      | 152 +++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |   1 +
 10 files changed, 365 insertions(+), 64 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index cfcab63e9..a41e8ce5f 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,15 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
 #include <string.h>
 #include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -90,12 +90,9 @@ resize_and_map(int fd, void *addr, size_t len)
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -733,7 +730,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -754,9 +751,11 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		static const int flags = RTE_MAP_FORCE_ADDRESS |
+			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
+		void *new_data = rte_mem_map(data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
 					__func__, strerror(errno));
 			goto fail;
@@ -821,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -859,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -911,7 +910,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -939,8 +938,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -959,7 +957,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -994,8 +992,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_get_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1044,7 +1041,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 4c897a13f..c6243aca1 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,7 +11,6 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
@@ -40,18 +39,10 @@
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
-#ifdef RTE_EXEC_ENV_LINUX
-#define RTE_DONTDUMP MADV_DONTDUMP
-#elif defined RTE_EXEC_ENV_FREEBSD
-#define RTE_DONTDUMP MADV_NOCORE
-#else
-#error "madvise doesn't support this OS"
-#endif
-
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, int reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -59,9 +50,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_get_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -105,24 +94,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
 			*size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, map_sz);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -132,20 +121,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -161,7 +147,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -179,19 +165,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	if (!unmap) {
 		/* Exclude these pages from a core dump. */
-		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
-			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
-				strerror(errno));
+		eal_mem_set_dump(aligned_addr, *size, false);
 	}
 
 	return aligned_addr;
@@ -547,10 +531,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	size_t page_size = rte_get_page_size();
+	uintptr_t aligned = RTE_PTR_ALIGN_FLOOR(virtual, page_size);
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index cef73d6fe..a93850c09 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**
+	 * Reserve hugepages. May be unsupported by some platforms.
+	 */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**
+	 * Force reserving memory at the requested address.
+	 * This can be a destructive action depending on the implementation.
+	 *
+	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
+	 *      (although implementations are not required to use it).
+	 */
+	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to rte_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+		size_t page_sz, int flags, int reserve_flags);
 
 /**
  * Get cpu core_id.
@@ -467,4 +486,57 @@ eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
 int
 eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation addressm which must be page-aligned.
+ *  The system might not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options, a combination of eal_mem_reserve_flags.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If *virt* and *size* describe a part of the reserved region,
+ * only this part of the region is freed (accurately up to the system
+ * page size). If *virt* points to allocated memory, *size* must match
+ * the one specified on allocation. The behavior is undefined
+ * if the memory pointed by *virt* is obtained from another source
+ * than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void
+eal_mem_free(void *virt, size_t size);
+
+/**
+ * Configure memory region inclusion into core dumps.
+ *
+ * @param virt
+ *  Starting address of the region.
+ * @param size
+ *  Size of the region.
+ * @param dump
+ *  True to include memory into core dumps, false to exclude.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index 4654ca2b3..f64a3994c 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 65374d53a..63ff0773d 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -82,6 +82,94 @@ struct rte_memseg_list {
 	struct rte_fbarray memseg_arr;
 };
 
+/**
+ * Memory protection flags.
+ */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/**
+ * Additional flags for memory mapping.
+ */
+enum rte_map_flags {
+	/** Changes to the mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/**
+	 * Force mapping to the requested address. This flag should be used
+	 * with caution, because to fulfill the request implementation
+	 * may remove all other mappings in the requested region. However,
+	 * it is not required to do so, thus mapping with this flag may fail.
+	 */
+	RTE_MAP_FORCE_ADDRESS = 1 << 3
+};
+
+/**
+ * Map a portion of an opened file or the page file into memory.
+ *
+ * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
+ * extension, except for the return value.
+ *
+ * @param requested_addr
+ *  Desired virtual address for mapping. Can be NULL to let OS choose.
+ * @param size
+ *  Size of the mapping in bytes.
+ * @param prot
+ *  Protection flags, a combination of rte_mem_prot values.
+ * @param flags
+ *  Addtional mapping flags, a combination of rte_map_flags.
+ * @param fd
+ *  Mapped file descriptor. Can be negative for anonymous mapping.
+ * @param offset
+ *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
+ * @return
+ *  Mapped address or NULL on failure and rte_errno is set to OS error.
+ */
+__rte_experimental
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_experimental
+int
+rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never fails.
+ *
+ * @return
+ *   Page size in bytes.
+ */
+__rte_experimental
+size_t
+rte_get_page_size(void);
+
+/**
+ * Lock in physical memory all pages crossed by the address region.
+ *
+ * @param virt
+ *   Base virtual address of the region.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @see rte_get_page_size() to retrieve the page size.
+ * @see rte_mem_lock_page() to lock an entire single page.
+ */
+__rte_experimental
+int
+rte_mem_lock(const void *virt, size_t size);
+
 /**
  * Lock page in physical memory and prevent from swapping.
  *
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 4f39d462c..d314648cb 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
index 2c717f8bd..bf29b83c6 100644
--- a/lib/librte_eal/linux/eal_memalloc.c
+++ b/lib/librte_eal/linux/eal_memalloc.c
@@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 mapped:
 	munmap(addr, alloc_sz);
 unmapped:
-	flags = MAP_FIXED;
+	flags = EAL_RESERVE_FORCE_ADDRESS;
 	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
 	if (new_addr != addr) {
 		if (new_addr != NULL)
@@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
-		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
+	eal_mem_set_dump(ms->addr, ms->len, false);
 
 	exit_early = false;
 
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d8038749a..dff51b13d 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -386,4 +386,10 @@ EXPERIMENTAL {
 	rte_trace_point_lookup;
 	rte_trace_regexp;
 	rte_trace_save;
+
+	# added in 20.08
+	rte_get_page_size;
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_unmap;
 };
diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
new file mode 100644
index 000000000..658595b6e
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix_memory.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+
+#include "eal_private.h"
+
+#ifdef RTE_EXEC_ENV_LINUX
+#define EAL_DONTDUMP MADV_DONTDUMP
+#define EAL_DODUMP   MADV_DODUMP
+#elif defined RTE_EXEC_ENV_FREEBSD
+#define EAL_DONTDUMP MADV_NOCORE
+#define EAL_DODUMP   MADV_CORE
+#else
+#error "madvise doesn't support this OS"
+#endif
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+		return NULL;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+#ifdef MAP_HUGETLB
+		sys_flags |= MAP_HUGETLB;
+#else
+		rte_errno = ENOTSUP;
+		return NULL;
+#endif
+	}
+
+	if (flags & EAL_RESERVE_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
+	int ret = madvise(virt, size, flags);
+	if (ret) {
+		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
+				virt, size, flags, strerror(rte_errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+static int
+mem_rte_to_sys_prot(int prot)
+{
+	int sys_prot = PROT_NONE;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	int sys_flags = 0;
+	int sys_prot;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+size_t
+rte_get_page_size(void)
+{
+	static size_t page_size;
+
+	if (!page_size)
+		page_size = sysconf(_SC_PAGESIZE);
+
+	return page_size;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	int ret = mlock(virt, size);
+	if (ret)
+		rte_errno = errno;
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index cfa1b4ef9..5734f26ad 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal_unix.c',
+	'eal_unix_memory.c',
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (2 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-28  7:31         ` Thomas Monjalon
                           ` (2 more replies)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
                         ` (7 subsequent siblings)
  11 siblings, 3 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c |  92 +++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  62 ++++++++++++
 lib/librte_eal/freebsd/eal_memory.c       |  94 ++++--------------
 lib/librte_eal/linux/eal_memory.c         | 115 +++++-----------------
 4 files changed, 195 insertions(+), 168 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index c6243aca1..0ecadd817 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -24,6 +24,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -181,6 +182,97 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+		uint64_t page_sz, int n_segs, int socket_id, bool heap)
+{
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL,
+		"Memseg list allocated at socket %i, page size 0x%zxkB\n",
+		socket_id, (size_t)page_sz >> 10);
+
+	return 0;
+}
+
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+
+	return eal_memseg_list_init_named(
+		msl, name, page_sz, n_segs, socket_id, heap);
+}
+
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
+{
+	uint64_t page_sz;
+	size_t mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(
+		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
+	if (addr == NULL) {
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+		else
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	RTE_LOG(DEBUG, EAL, "VA reserved for memseg list at %p, size %zx\n",
+			addr, mem_sz);
+
+	return 0;
+}
+
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
+{
+	uint64_t page_sz = msl->page_sz;
+	int i;
+
+	for (i = 0; i < n_segs; i++) {
+		struct rte_fbarray *arr = &msl->memseg_arr;
+		struct rte_memseg *ms = rte_fbarray_get(arr, i);
+
+		if (rte_eal_iova_mode() == RTE_IOVA_VA)
+			ms->iova = (uintptr_t)addr;
+		else
+			ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, i);
+
+		addr = RTE_PTR_ADD(addr, page_sz);
+	}
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index a93850c09..705a60e9c 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -254,6 +254,68 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int reserve_flags);
 
+/**
+ * Initialize a memory segment list and create its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param name
+ *  Name for the backing storage.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+	uint64_t page_sz, int n_segs, int socket_id, bool heap);
+
+/**
+ * Initialize memory segment list and create its backing storage
+ * with a name corresponding to MSL parameters.
+ *
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ *
+ * @see eal_memseg_list_init_named for remaining parameters description.
+ */
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
+/**
+ * Reserve VA space for a memory segment list
+ * previously initialized with eal_memseg_list_init().
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param reserve_flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
+
+/**
+ * Populate MSL, each segment is one page long.
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param addr
+ *  Starting address of list segments.
+ * @param n_segs
+ *  Number of segments to populate.
+ */
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5bc2da160..29c3ed5a9 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -66,53 +66,34 @@ rte_eal_hugepage_init(void)
 		struct rte_memseg_list *msl;
 		struct rte_fbarray *arr;
 		struct rte_memseg *ms;
-		uint64_t page_sz;
+		uint64_t mem_sz, page_sz;
 		int n_segs, cur_seg;
 
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-				sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
-		addr = mmap(NULL, internal_config.memory,
-				PROT_READ | PROT_WRITE,
+		addr = mmap(NULL, mem_sz, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (addr == MAP_FAILED) {
 			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
 					strerror(errno));
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->len = internal_config.memory;
-		msl->socket_id = 0;
-		msl->heap = 1;
-
-		/* populate memsegs. each memseg is 1 page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->len = page_sz;
-			ms->socket_id = 0;
+		msl->base_va = addr;
+		msl->len = mem_sz;
 
-			rte_fbarray_set_used(arr, cur_seg);
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			addr = RTE_PTR_ADD(addr, page_sz);
-		}
 		return 0;
 	}
 
@@ -336,64 +317,25 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
 	int flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
@@ -479,7 +421,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_init(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
@@ -487,7 +429,7 @@ memseg_primary_init(void)
 			total_type_mem = total_segs * hugepage_sz;
 			type_msl_idx++;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				return -1;
 			}
@@ -518,7 +460,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..8b5fe613e 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, 0);
 }
 
 /*
@@ -1009,12 +969,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_init(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_alloc(msl) < 0)
 				return -1;
 		}
 	}
@@ -1323,8 +1283,6 @@ eal_legacy_hugepage_init(void)
 	struct rte_mem_config *mcfg;
 	struct hugepage_file *hugepage = NULL, *tmp_hp = NULL;
 	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	struct rte_fbarray *arr;
-	struct rte_memseg *ms;
 
 	uint64_t memory[RTE_MAX_NUMA_NODES];
 
@@ -1343,7 +1301,7 @@ eal_legacy_hugepage_init(void)
 		void *prealloc_addr;
 		size_t mem_sz;
 		struct rte_memseg_list *msl;
-		int n_segs, cur_seg, fd, flags;
+		int n_segs, fd, flags;
 #ifdef MEMFD_SUPPORTED
 		int memfd;
 #endif
@@ -1358,12 +1316,12 @@ eal_legacy_hugepage_init(void)
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-					sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
@@ -1400,16 +1358,10 @@ eal_legacy_hugepage_init(void)
 		/* preallocate address space for the memory, so that it can be
 		 * fit into the DMA mask.
 		 */
-		mem_sz = internal_config.memory;
-		prealloc_addr = eal_get_virtual_area(
-				NULL, &mem_sz, page_sz, 0, 0);
-		if (prealloc_addr == NULL) {
-			RTE_LOG(ERR, EAL,
-					"%s: reserving memory area failed: "
-					"%s\n",
-					__func__, strerror(errno));
+		if (eal_memseg_list_alloc(msl, 0))
 			return -1;
-		}
+
+		prealloc_addr = msl->base_va;
 		addr = mmap(prealloc_addr, mem_sz, PROT_READ | PROT_WRITE,
 				flags | MAP_FIXED, fd, 0);
 		if (addr == MAP_FAILED || addr != prealloc_addr) {
@@ -1418,11 +1370,6 @@ eal_legacy_hugepage_init(void)
 			munmap(prealloc_addr, mem_sz);
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->socket_id = 0;
-		msl->len = mem_sz;
-		msl->heap = 1;
 
 		/* we're in single-file segments mode, so only the segment list
 		 * fd needs to be set up.
@@ -1434,24 +1381,8 @@ eal_legacy_hugepage_init(void)
 			}
 		}
 
-		/* populate memsegs. each memseg is one page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->socket_id = 0;
-			ms->len = page_sz;
-
-			rte_fbarray_set_used(arr, cur_seg);
-
-			addr = RTE_PTR_ADD(addr, (size_t)page_sz);
-		}
 		if (mcfg->dma_maskbits &&
 		    rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
 			RTE_LOG(ERR, EAL,
@@ -2191,7 +2122,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_init(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2131,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_alloc(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2326,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_init(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2364,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (3 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-28  8:34         ` Thomas Monjalon
  2020-05-28 12:21         ` Burakov, Anatoly
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 06/11] trace: add size_t field emitter Dmitry Kozlyuk
                         ` (6 subsequent siblings)
  11 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov

Code in Linux EAL that supports dynamic memory allocation (as opposed to
static allocation used by FreeBSD) is not OS-dependent and can be reused
by Windows EAL. Move such code to a file compiled only for the OS that
require it.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_dynmem.c | 498 ++++++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  43 +-
 lib/librte_eal/common/meson.build         |   4 +
 lib/librte_eal/linux/eal_memory.c         | 494 +--------------------
 4 files changed, 547 insertions(+), 492 deletions(-)
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c

diff --git a/lib/librte_eal/common/eal_common_dynmem.c b/lib/librte_eal/common/eal_common_dynmem.c
new file mode 100644
index 000000000..5073c1d7d
--- /dev/null
+++ b/lib/librte_eal/common/eal_common_dynmem.c
@@ -0,0 +1,498 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation.
+ * Copyright(c) 2013 6WIND S.A.
+ */
+
+#include <inttypes.h>
+#include <string.h>
+
+#include <rte_log.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+
+/** @file Functions common to EALs that support dynamic memory allocation. */
+
+int
+eal_dynmem_memseg_lists_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
+			/* we can still sort pages by socket in legacy mode */
+			if (!internal_config.legacy_mem && socket_id > 0)
+				break;
+#endif
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (eal_memseg_list_init(msl, pagesz, n_segs,
+					socket_id, cur_seglist, true))
+				goto out;
+
+			if (eal_memseg_list_alloc(msl, 0)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int __rte_unused
+hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct hugepage_info *hpi = arg;
+
+	if (msl->page_sz != hpi->hugepage_sz)
+		return 0;
+
+	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
+	return 0;
+}
+
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+int
+eal_dynmem_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+#ifndef RTE_ARCH_64
+		struct hugepage_info dummy;
+		unsigned int i;
+#endif
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+
+#ifndef RTE_ARCH_64
+		/* for 32-bit, limit number of pages on socket to whatever we've
+		 * preallocated, as we cannot allocate more.
+		 */
+		memset(&dummy, 0, sizeof(dummy));
+		dummy.hugepage_sz = hpi->hugepage_sz;
+		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
+			return -1;
+
+		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
+			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
+					dummy.num_pages[i]);
+		}
+#endif
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
+		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
+
+	/* calculate final number of pages */
+	if (eal_dynmem_calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M "
+				"on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+__rte_unused /* function is unused on 32-bit builds */
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+int
+eal_dynmem_calc_num_pages_per_socket(
+	uint64_t *memory, struct hugepage_info *hp_info,
+	struct hugepage_info *hp_used, unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from cpu mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			strncpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int) (internal_config.memory / 0x100000);
+		available = requested - (unsigned int) (total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 705a60e9c..5f8cd254b 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -13,6 +13,8 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+
 /**
  * Structure storing internal configuration (per-lcore)
  */
@@ -316,6 +318,45 @@ eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
 void
 eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
 
+/**
+ * Distribute available memory between MSLs.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_memseg_lists_init(void);
+
+/**
+ * Preallocate hugepages for dynamic allocation.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_hugepage_init(void);
+
+/**
+ * Given the list of hugepage sizes and the number of pages thereof,
+ * calculate the best number of pages of each size to fulfill the request
+ * for RAM on each NUMA node.
+ *
+ * @param memory
+ *  Amounts of memory requested for each NUMA node of RTE_MAX_NUMA_NODES.
+ * @param hp_info
+ *  Information about hugepages of different size.
+ * @param hp_used
+ *  Receives information about used hugepages of each size.
+ * @param num_hp_info
+ *  Number of elements in hp_info and hp_used.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_calc_num_pages_per_socket(
+		uint64_t *memory, struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used, unsigned int num_hp_info);
+
 /**
  * Get cpu core_id.
  *
@@ -569,7 +610,7 @@ void *
 eal_mem_reserve(void *requested_addr, size_t size, int flags);
 
 /**
- * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ * Free memory obtained by eal_mem_reserve() and possibly allocated.
  *
  * If *virt* and *size* describe a part of the reserved region,
  * only this part of the region is freed (accurately up to the system
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 55aaeb18e..d91c22220 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -56,3 +56,7 @@ sources += files(
 	'rte_reciprocal.c',
 	'rte_service.c',
 )
+
+if is_linux
+	sources += files('eal_common_dynmem.c')
+endif
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 8b5fe613e..9f5005503 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -1045,182 +1045,6 @@ remap_needed_hugepages(struct hugepage_file *hugepages, int n_pages)
 	return 0;
 }
 
-__rte_unused /* function is unused on 32-bit builds */
-static inline uint64_t
-get_socket_mem_size(int socket)
-{
-	uint64_t size = 0;
-	unsigned i;
-
-	for (i = 0; i < internal_config.num_hugepage_sizes; i++){
-		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
-		size += hpi->hugepage_sz * hpi->num_pages[socket];
-	}
-
-	return size;
-}
-
-/*
- * This function is a NUMA-aware equivalent of calc_num_pages.
- * It takes in the list of hugepage sizes and the
- * number of pages thereof, and calculates the best number of
- * pages of each size to fulfill the request for <memory> ram
- */
-static int
-calc_num_pages_per_socket(uint64_t * memory,
-		struct hugepage_info *hp_info,
-		struct hugepage_info *hp_used,
-		unsigned num_hp_info)
-{
-	unsigned socket, j, i = 0;
-	unsigned requested, available;
-	int total_num_pages = 0;
-	uint64_t remaining_mem, cur_mem;
-	uint64_t total_mem = internal_config.memory;
-
-	if (num_hp_info == 0)
-		return -1;
-
-	/* if specific memory amounts per socket weren't requested */
-	if (internal_config.force_sockets == 0) {
-		size_t total_size;
-#ifdef RTE_ARCH_64
-		int cpu_per_socket[RTE_MAX_NUMA_NODES];
-		size_t default_size;
-		unsigned lcore_id;
-
-		/* Compute number of cores per socket */
-		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
-		RTE_LCORE_FOREACH(lcore_id) {
-			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
-		}
-
-		/*
-		 * Automatically spread requested memory amongst detected sockets according
-		 * to number of cores from cpu mask present on each socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-
-			/* Set memory amount per socket */
-			default_size = (internal_config.memory * cpu_per_socket[socket])
-					/ rte_lcore_count();
-
-			/* Limit to maximum available memory on socket */
-			default_size = RTE_MIN(default_size, get_socket_mem_size(socket));
-
-			/* Update sizes */
-			memory[socket] = default_size;
-			total_size -= default_size;
-		}
-
-		/*
-		 * If some memory is remaining, try to allocate it by getting all
-		 * available memory from sockets, one after the other
-		 */
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-			/* take whatever is available */
-			default_size = RTE_MIN(get_socket_mem_size(socket) - memory[socket],
-					       total_size);
-
-			/* Update sizes */
-			memory[socket] += default_size;
-			total_size -= default_size;
-		}
-#else
-		/* in 32-bit mode, allocate all of the memory only on master
-		 * lcore socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
-				socket++) {
-			struct rte_config *cfg = rte_eal_get_configuration();
-			unsigned int master_lcore_socket;
-
-			master_lcore_socket =
-				rte_lcore_to_socket_id(cfg->master_lcore);
-
-			if (master_lcore_socket != socket)
-				continue;
-
-			/* Update sizes */
-			memory[socket] = total_size;
-			break;
-		}
-#endif
-	}
-
-	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0; socket++) {
-		/* skips if the memory on specific socket wasn't requested */
-		for (i = 0; i < num_hp_info && memory[socket] != 0; i++){
-			strlcpy(hp_used[i].hugedir, hp_info[i].hugedir,
-				sizeof(hp_used[i].hugedir));
-			hp_used[i].num_pages[socket] = RTE_MIN(
-					memory[socket] / hp_info[i].hugepage_sz,
-					hp_info[i].num_pages[socket]);
-
-			cur_mem = hp_used[i].num_pages[socket] *
-					hp_used[i].hugepage_sz;
-
-			memory[socket] -= cur_mem;
-			total_mem -= cur_mem;
-
-			total_num_pages += hp_used[i].num_pages[socket];
-
-			/* check if we have met all memory requests */
-			if (memory[socket] == 0)
-				break;
-
-			/* check if we have any more pages left at this size, if so
-			 * move on to next size */
-			if (hp_used[i].num_pages[socket] == hp_info[i].num_pages[socket])
-				continue;
-			/* At this point we know that there are more pages available that are
-			 * bigger than the memory we want, so lets see if we can get enough
-			 * from other page sizes.
-			 */
-			remaining_mem = 0;
-			for (j = i+1; j < num_hp_info; j++)
-				remaining_mem += hp_info[j].hugepage_sz *
-				hp_info[j].num_pages[socket];
-
-			/* is there enough other memory, if not allocate another page and quit */
-			if (remaining_mem < memory[socket]){
-				cur_mem = RTE_MIN(memory[socket],
-						hp_info[i].hugepage_sz);
-				memory[socket] -= cur_mem;
-				total_mem -= cur_mem;
-				hp_used[i].num_pages[socket]++;
-				total_num_pages++;
-				break; /* we are done with this socket*/
-			}
-		}
-		/* if we didn't satisfy all memory requirements per socket */
-		if (memory[socket] > 0 &&
-				internal_config.socket_mem[socket] != 0) {
-			/* to prevent icc errors */
-			requested = (unsigned) (internal_config.socket_mem[socket] /
-					0x100000);
-			available = requested -
-					((unsigned) (memory[socket] / 0x100000));
-			RTE_LOG(ERR, EAL, "Not enough memory available on socket %u! "
-					"Requested: %uMB, available: %uMB\n", socket,
-					requested, available);
-			return -1;
-		}
-	}
-
-	/* if we didn't satisfy total memory requirements */
-	if (total_mem > 0) {
-		requested = (unsigned) (internal_config.memory / 0x100000);
-		available = requested - (unsigned) (total_mem / 0x100000);
-		RTE_LOG(ERR, EAL, "Not enough memory available! Requested: %uMB,"
-				" available: %uMB\n", requested, available);
-		return -1;
-	}
-	return total_num_pages;
-}
-
 static inline size_t
 eal_get_hugepage_mem_size(void)
 {
@@ -1524,7 +1348,7 @@ eal_legacy_hugepage_init(void)
 		memory[i] = internal_config.socket_mem[i];
 
 	/* calculate final number of pages */
-	nr_hugepages = calc_num_pages_per_socket(memory,
+	nr_hugepages = eal_dynmem_calc_num_pages_per_socket(memory,
 			internal_config.hugepage_info, used_hp,
 			internal_config.num_hugepage_sizes);
 
@@ -1651,140 +1475,6 @@ eal_legacy_hugepage_init(void)
 	return -1;
 }
 
-static int __rte_unused
-hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
-{
-	struct hugepage_info *hpi = arg;
-
-	if (msl->page_sz != hpi->hugepage_sz)
-		return 0;
-
-	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
-	return 0;
-}
-
-static int
-limits_callback(int socket_id, size_t cur_limit, size_t new_len)
-{
-	RTE_SET_USED(socket_id);
-	RTE_SET_USED(cur_limit);
-	RTE_SET_USED(new_len);
-	return -1;
-}
-
-static int
-eal_hugepage_init(void)
-{
-	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	uint64_t memory[RTE_MAX_NUMA_NODES];
-	int hp_sz_idx, socket_id;
-
-	memset(used_hp, 0, sizeof(used_hp));
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-#ifndef RTE_ARCH_64
-		struct hugepage_info dummy;
-		unsigned int i;
-#endif
-		/* also initialize used_hp hugepage sizes in used_hp */
-		struct hugepage_info *hpi;
-		hpi = &internal_config.hugepage_info[hp_sz_idx];
-		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
-
-#ifndef RTE_ARCH_64
-		/* for 32-bit, limit number of pages on socket to whatever we've
-		 * preallocated, as we cannot allocate more.
-		 */
-		memset(&dummy, 0, sizeof(dummy));
-		dummy.hugepage_sz = hpi->hugepage_sz;
-		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
-			return -1;
-
-		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
-			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
-					dummy.num_pages[i]);
-		}
-#endif
-	}
-
-	/* make a copy of socket_mem, needed for balanced allocation. */
-	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
-		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
-
-	/* calculate final number of pages */
-	if (calc_num_pages_per_socket(memory,
-			internal_config.hugepage_info, used_hp,
-			internal_config.num_hugepage_sizes) < 0)
-		return -1;
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
-				socket_id++) {
-			struct rte_memseg **pages;
-			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
-			unsigned int num_pages = hpi->num_pages[socket_id];
-			unsigned int num_pages_alloc;
-
-			if (num_pages == 0)
-				continue;
-
-			RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %" PRIu64 "M on socket %i\n",
-				num_pages, hpi->hugepage_sz >> 20, socket_id);
-
-			/* we may not be able to allocate all pages in one go,
-			 * because we break up our memory map into multiple
-			 * memseg lists. therefore, try allocating multiple
-			 * times and see if we can get the desired number of
-			 * pages from multiple allocations.
-			 */
-
-			num_pages_alloc = 0;
-			do {
-				int i, cur_pages, needed;
-
-				needed = num_pages - num_pages_alloc;
-
-				pages = malloc(sizeof(*pages) * needed);
-
-				/* do not request exact number of pages */
-				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
-						needed, hpi->hugepage_sz,
-						socket_id, false);
-				if (cur_pages <= 0) {
-					free(pages);
-					return -1;
-				}
-
-				/* mark preallocated pages as unfreeable */
-				for (i = 0; i < cur_pages; i++) {
-					struct rte_memseg *ms = pages[i];
-					ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE;
-				}
-				free(pages);
-
-				num_pages_alloc += cur_pages;
-			} while (num_pages_alloc != num_pages);
-		}
-	}
-	/* if socket limits were specified, set them */
-	if (internal_config.force_socket_limits) {
-		unsigned int i;
-		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
-			uint64_t limit = internal_config.socket_limit[i];
-			if (limit == 0)
-				continue;
-			if (rte_mem_alloc_validator_register("socket-limit",
-					limits_callback, i, limit))
-				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
-		}
-	}
-	return 0;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -1943,7 +1633,7 @@ rte_eal_hugepage_init(void)
 {
 	return internal_config.legacy_mem ?
 			eal_legacy_hugepage_init() :
-			eal_hugepage_init();
+			eal_dynmem_hugepage_init();
 }
 
 int
@@ -2162,185 +1852,7 @@ memseg_primary_init_32(void)
 static int __rte_unused
 memseg_primary_init(void)
 {
-	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
-	struct memtype {
-		uint64_t page_sz;
-		int socket_id;
-	} *memtypes = NULL;
-	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
-	struct rte_memseg_list *msl;
-	uint64_t max_mem, max_mem_per_type;
-	unsigned int max_seglists_per_type;
-	unsigned int n_memtypes, cur_type;
-
-	/* no-huge does not need this at all */
-	if (internal_config.no_hugetlbfs)
-		return 0;
-
-	/*
-	 * figuring out amount of memory we're going to have is a long and very
-	 * involved process. the basic element we're operating with is a memory
-	 * type, defined as a combination of NUMA node ID and page size (so that
-	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
-	 *
-	 * deciding amount of memory going towards each memory type is a
-	 * balancing act between maximum segments per type, maximum memory per
-	 * type, and number of detected NUMA nodes. the goal is to make sure
-	 * each memory type gets at least one memseg list.
-	 *
-	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
-	 *
-	 * the total amount of memory per type is limited by either
-	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
-	 * of detected NUMA nodes. additionally, maximum number of segments per
-	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
-	 * smaller page sizes, it can take hundreds of thousands of segments to
-	 * reach the above specified per-type memory limits.
-	 *
-	 * additionally, each type may have multiple memseg lists associated
-	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
-	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
-	 *
-	 * the number of memseg lists per type is decided based on the above
-	 * limits, and also taking number of detected NUMA nodes, to make sure
-	 * that we don't run out of memseg lists before we populate all NUMA
-	 * nodes with memory.
-	 *
-	 * we do this in three stages. first, we collect the number of types.
-	 * then, we figure out memory constraints and populate the list of
-	 * would-be memseg lists. then, we go ahead and allocate the memseg
-	 * lists.
-	 */
-
-	/* create space for mem types */
-	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
-	memtypes = calloc(n_memtypes, sizeof(*memtypes));
-	if (memtypes == NULL) {
-		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
-		return -1;
-	}
-
-	/* populate mem types */
-	cur_type = 0;
-	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
-			hpi_idx++) {
-		struct hugepage_info *hpi;
-		uint64_t hugepage_sz;
-
-		hpi = &internal_config.hugepage_info[hpi_idx];
-		hugepage_sz = hpi->hugepage_sz;
-
-		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
-			int socket_id = rte_socket_id_by_idx(i);
-
-#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
-			/* we can still sort pages by socket in legacy mode */
-			if (!internal_config.legacy_mem && socket_id > 0)
-				break;
-#endif
-			memtypes[cur_type].page_sz = hugepage_sz;
-			memtypes[cur_type].socket_id = socket_id;
-
-			RTE_LOG(DEBUG, EAL, "Detected memory type: "
-				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
-				socket_id, hugepage_sz);
-		}
-	}
-	/* number of memtypes could have been lower due to no NUMA support */
-	n_memtypes = cur_type;
-
-	/* set up limits for types */
-	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
-	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
-			max_mem / n_memtypes);
-	/*
-	 * limit maximum number of segment lists per type to ensure there's
-	 * space for memseg lists for all NUMA nodes with all page sizes
-	 */
-	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
-
-	if (max_seglists_per_type == 0) {
-		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
-			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-		goto out;
-	}
-
-	/* go through all mem types and create segment lists */
-	msl_idx = 0;
-	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
-		unsigned int cur_seglist, n_seglists, n_segs;
-		unsigned int max_segs_per_type, max_segs_per_list;
-		struct memtype *type = &memtypes[cur_type];
-		uint64_t max_mem_per_list, pagesz;
-		int socket_id;
-
-		pagesz = type->page_sz;
-		socket_id = type->socket_id;
-
-		/*
-		 * we need to create segment lists for this type. we must take
-		 * into account the following things:
-		 *
-		 * 1. total amount of memory we can use for this memory type
-		 * 2. total amount of memory per memseg list allowed
-		 * 3. number of segments needed to fit the amount of memory
-		 * 4. number of segments allowed per type
-		 * 5. number of segments allowed per memseg list
-		 * 6. number of memseg lists we are allowed to take up
-		 */
-
-		/* calculate how much segments we will need in total */
-		max_segs_per_type = max_mem_per_type / pagesz;
-		/* limit number of segments to maximum allowed per type */
-		max_segs_per_type = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
-		/* limit number of segments to maximum allowed per list */
-		max_segs_per_list = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
-
-		/* calculate how much memory we can have per segment list */
-		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
-				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
-
-		/* calculate how many segments each segment list will have */
-		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
-
-		/* calculate how many segment lists we can have */
-		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
-				max_mem_per_type / max_mem_per_list);
-
-		/* limit number of segment lists according to our maximum */
-		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
-
-		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
-				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
-			n_seglists, n_segs, socket_id, pagesz);
-
-		/* create all segment lists */
-		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
-			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
-				RTE_LOG(ERR, EAL,
-					"No more space in memseg lists, please increase %s\n",
-					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-				goto out;
-			}
-			msl = &mcfg->memsegs[msl_idx++];
-
-			if (memseg_list_init(msl, pagesz, n_segs,
-					socket_id, cur_seglist))
-				goto out;
-
-			if (memseg_list_alloc(msl)) {
-				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
-				goto out;
-			}
-		}
-	}
-	/* we're successful */
-	ret = 0;
-out:
-	free(memtypes);
-	return ret;
+	return eal_dynmem_memseg_lists_init();
 }
 
 static int
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 06/11] trace: add size_t field emitter
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (4 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-25  5:53         ` Jerin Jacob
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
                         ` (5 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, Sunil Kumar Kori,
	Olivier Matz, Andrew Rybchenko

It is not guaranteed that sizeof(long) == sizeof(size_t). On Windows,
sizeof(long) == 4 and sizeof(size_t) == 8 for 64-bit programs.
Tracepoints using "long" field emitter are therefore invalid there.
Add dedicated field emitter for size_t and use it to store size_t values
in all existing tracepoints.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/include/rte_eal_trace.h   |  8 ++++----
 lib/librte_eal/include/rte_trace_point.h |  1 +
 lib/librte_mempool/rte_mempool_trace.h   | 10 +++++-----
 3 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/include/rte_eal_trace.h b/lib/librte_eal/include/rte_eal_trace.h
index 1ebb2905a..bcfef0cfa 100644
--- a/lib/librte_eal/include/rte_eal_trace.h
+++ b/lib/librte_eal/include/rte_eal_trace.h
@@ -143,7 +143,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -154,7 +154,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -164,7 +164,7 @@ RTE_TRACE_POINT(
 	rte_eal_trace_mem_realloc,
 	RTE_TRACE_POINT_ARGS(size_t size, unsigned int align, int socket,
 		void *ptr),
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -183,7 +183,7 @@ RTE_TRACE_POINT(
 		unsigned int flags, unsigned int align, unsigned int bound,
 		const void *mz),
 	rte_trace_point_emit_string(name);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_int(socket_id);
 	rte_trace_point_emit_u32(flags);
 	rte_trace_point_emit_u32(align);
diff --git a/lib/librte_eal/include/rte_trace_point.h b/lib/librte_eal/include/rte_trace_point.h
index b45171275..3b513bd43 100644
--- a/lib/librte_eal/include/rte_trace_point.h
+++ b/lib/librte_eal/include/rte_trace_point.h
@@ -395,6 +395,7 @@ do { \
 #define rte_trace_point_emit_i8(in) __rte_trace_point_emit(in, int8_t)
 #define rte_trace_point_emit_int(in) __rte_trace_point_emit(in, int32_t)
 #define rte_trace_point_emit_long(in) __rte_trace_point_emit(in, long)
+#define rte_trace_point_emit_size_t(in) __rte_trace_point_emit(in, size_t)
 #define rte_trace_point_emit_float(in) __rte_trace_point_emit(in, float)
 #define rte_trace_point_emit_double(in) __rte_trace_point_emit(in, double)
 #define rte_trace_point_emit_ptr(in) __rte_trace_point_emit(in, uintptr_t)
diff --git a/lib/librte_mempool/rte_mempool_trace.h b/lib/librte_mempool/rte_mempool_trace.h
index e776df0a6..087c913c8 100644
--- a/lib/librte_mempool/rte_mempool_trace.h
+++ b/lib/librte_mempool/rte_mempool_trace.h
@@ -72,7 +72,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -84,8 +84,8 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(addr);
-	rte_trace_point_emit_long(len);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(len);
+	rte_trace_point_emit_size_t(pg_sz);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -126,7 +126,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(struct rte_mempool *mempool, size_t pg_sz),
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(pg_sz);
 )
 
 RTE_TRACE_POINT(
@@ -139,7 +139,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_u32(max_objs);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(obj_cb);
 	rte_trace_point_emit_ptr(obj_cb_arg);
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 07/11] eal/windows: add tracing support stubs
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (5 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 06/11] trace: add size_t field emitter Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                         ` (4 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

EAL common code depends on tracepoint calls, but generic implementation
cannot be enabled on Windows due to missing standard library facilities.
Add stub functions to support tracepoint compilation, so that common
code does not have to conditionally include tracepoints until proper
support is added.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_thread.c |  5 +---
 lib/librte_eal/common/meson.build         |  1 +
 lib/librte_eal/windows/eal.c              | 34 ++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_thread.c b/lib/librte_eal/common/eal_common_thread.c
index f9f588c17..370bb1b63 100644
--- a/lib/librte_eal/common/eal_common_thread.c
+++ b/lib/librte_eal/common/eal_common_thread.c
@@ -15,9 +15,7 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 #include <rte_log.h>
-#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_trace_point.h>
-#endif
 
 #include "eal_internal_cfg.h"
 #include "eal_private.h"
@@ -169,9 +167,8 @@ static void *rte_thread_init(void *arg)
 		free(params);
 	}
 
-#ifndef RTE_EXEC_ENV_WINDOWS
 	__rte_trace_mem_per_thread_alloc();
-#endif
+
 	return start_routine(routine_arg);
 }
 
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index d91c22220..4e9208129 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -14,6 +14,7 @@ if is_windows
 		'eal_common_log.c',
 		'eal_common_options.c',
 		'eal_common_thread.c',
+		'eal_common_trace_points.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index d084606a6..e7461f731 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -17,6 +17,7 @@
 #include <eal_filesystem.h>
 #include <eal_options.h>
 #include <eal_private.h>
+#include <rte_trace_point.h>
 
 #include "eal_windows.h"
 
@@ -221,7 +222,38 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
- /* Launch threads, called at application init(). */
+/* Stubs to enable EAL trace point compilation
+ * until eal_common_trace.c can be compiled.
+ */
+
+RTE_DEFINE_PER_LCORE(volatile int, trace_point_sz);
+RTE_DEFINE_PER_LCORE(void *, trace_mem);
+
+void
+__rte_trace_mem_per_thread_alloc(void)
+{
+}
+
+void
+__rte_trace_point_emit_field(size_t sz, const char *field,
+	const char *type)
+{
+	RTE_SET_USED(sz);
+	RTE_SET_USED(field);
+	RTE_SET_USED(type);
+}
+
+int
+__rte_trace_point_register(rte_trace_point_t *trace, const char *name,
+	void (*register_fn)(void))
+{
+	RTE_SET_USED(trace);
+	RTE_SET_USED(name);
+	RTE_SET_USED(register_fn);
+	return -ENOTSUP;
+}
+
+/* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (6 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                         ` (3 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 09/11] eal/windows: improve CPU and NUMA node detection
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (7 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
                         ` (2 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal_lcore.c   | 185 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..9d931d50a 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() may not be available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 10/11] eal/windows: initialize hugepage info
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (8 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup. Only 2MB
hugepages are supported and their number is estimated roughly
due to the lack or unstable status of suitable OS APIs.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/build_dpdk.rst  |  20 -----
 doc/guides/windows_gsg/index.rst       |   1 +
 doc/guides/windows_gsg/run_apps.rst    |  47 +++++++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 7 files changed, 173 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/config/meson.build b/config/meson.build
index 43ab11310..c1e80de4b 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -268,6 +268,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index e7461f731..7c2fcc860 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -19,8 +19,11 @@
 #include <eal_private.h>
 #include <rte_trace_point.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -276,6 +279,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..61d0dcd3c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available on Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem on Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index adfc8b9b7..52978e9d7 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
 	'eal_thread.c',
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v5 11/11] eal/windows: implement basic memory management
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (9 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-05-25  0:37       ` Dmitry Kozlyuk
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-25  0:37 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode. Multi-process mode is not
implemented and is forcefully disabled at startup.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 config/meson.build                            |  12 +-
 doc/guides/windows_gsg/run_apps.rst           |  54 +-
 lib/librte_eal/common/meson.build             |  11 +
 lib/librte_eal/common/rte_malloc.c            |   9 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/windows/eal.c                  | 146 +++-
 lib/librte_eal/windows/eal_memalloc.c         | 442 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  75 ++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |   9 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   5 +
 16 files changed, 1728 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/config/meson.build b/config/meson.build
index c1e80de4b..d3f05f878 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -261,15 +261,21 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..78e5a614f 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -7,10 +7,10 @@ Running DPDK Applications
 Grant *Lock pages in memory* Privilege
 --------------------------------------
 
-Use of hugepages ("large pages" in Windows terminolocy) requires
+Use of hugepages ("large pages" in Windows terminology) requires
 ``SeLockMemoryPrivilege`` for the user running an application.
 
-1. Open *Local Security Policy* snap in, either:
+1. Open *Local Security Policy* snap-in, either:
 
    * Control Panel / Computer Management / Local Security Policy;
    * or Win+R, type ``secpol``, press Enter.
@@ -24,7 +24,55 @@ Use of hugepages ("large pages" in Windows terminolocy) requires
 
 See `Large-Page Support`_ in MSDN for details.
 
-.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+.. _Large-Page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation. This driver is not signed, so signature
+checking must be disabled to load it.
+
+.. warning::
+
+    Disabling driver signature enforcement weakens OS security.
+    It is discouraged in production environments.
+
+Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``. It can be installed as follows
+from Elevated Command Prompt:
+
+.. code-block:: console
+
+    pnputil /add-driver Z:\path\to\virt2phys.inf /install
+
+On Windows Server additional steps are required:
+
+1. From Device Manager, Action menu, select "Add legacy hardware".
+2. It will launch the "Add Hardware Wizard". Click "Next".
+3. Select second option "Install the hardware that I manually select
+   from a list (Advanced)".
+4. On the next screen, "Kernel bypass" will be shown as a device class.
+5. Select it, and click "Next".
+6. The previously installed drivers will now be installed for the
+   "Virtual to physical address translator" device.
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+Installed driver persists across reboots.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
 
 
 Run the ``helloworld`` Example
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 4e9208129..310844269 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -8,13 +8,24 @@ if is_windows
 		'eal_common_bus.c',
 		'eal_common_class.c',
 		'eal_common_devargs.c',
+		'eal_common_dynmem.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_string_fns.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
 		'eal_common_trace_points.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index f1b73168b..34b416927 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -20,7 +20,16 @@
 #include <rte_lcore.h>
 #include <rte_common.h>
 #include <rte_spinlock.h>
+
+#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_eal_trace.h>
+#else
+/* Suppress -Wempty-body for tracepoints used as "if" body. */
+#define rte_eal_trace_mem_malloc(...) do {} while (0)
+#define rte_eal_trace_mem_zmalloc(...) do {} while (0)
+#define rte_eal_trace_mem_realloc(...) do {} while (0)
+#define rte_eal_trace_mem_free(...) do {} while (0)
+#endif
 
 #include <rte_malloc.h>
 #include "malloc_elem.h"
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..854b83bcd 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,9 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
 	rte_log
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
 	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_get_page_size
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_unmap
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 7c2fcc860..d7020ffa8 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -94,6 +94,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -225,6 +243,89 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
+
 /* Stubs to enable EAL trace point compilation
  * until eal_common_trace.c can be compiled.
  */
@@ -256,7 +357,7 @@ __rte_trace_point_register(rte_trace_point_t *trace, const char *name,
 	return -ENOTSUP;
 }
 
-/* Launch threads, called at application init(). */
+ /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
@@ -279,6 +380,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.in_memory == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.in_memory = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -290,6 +398,42 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..844910f1f
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,442 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+#include <rte_windows.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory from OS. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+
+		/* During commitment, memory is temporary freed and might
+		 * be allocated by different non-EAL thread. This is a fatal
+		 * error, because it breaks MSL assumptions.
+		 */
+		if ((addr != NULL) && (addr != requested_addr)) {
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				requested_addr);
+			return -1;
+		}
+
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu) on socket %d\n",
+				requested_addr, alloc_sz, socket_id);
+			return -1;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx(%p)", addr);
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of a hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	if (eal_mem_decommit(addr, alloc_sz) && (rte_errno == EADDRNOTAVAIL)) {
+		/* During decommitment, memory is temporarily returned
+		 * to the system and the address may become unavailable.
+		 */
+		RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+			" allocation - MSL is not VA-contiguous!\n", addr);
+	}
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len)) {
+		if (rte_errno == EADDRNOTAVAIL) {
+			/* See alloc_seg() for explanation. */
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				ms->addr);
+		}
+		return -1;
+	}
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..ec40bab16
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,710 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <inttypes.h>
+#include <io.h>
+
+#include <rte_errno.h>
+#include <rte_memory.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+#include <rte_virt2phys.h>
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ *
+ * MEM_RESERVE_PLACEHOLDER being defined means VirtualAlloc2() is present too.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags,
+	MemExtendedParameterMax
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	/* Contrary to the docs, VirtualAlloc2() is not in kernel32.dll,
+	 * see https://github.com/MicrosoftDocs/feedback/issues/1129.
+	 */
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+
+		/* Contrary to the docs, Server 2016 is not supported. */
+		RTE_LOG(ERR, EAL, "Windows 10 or Windows Server 2019 "
+			" is required for memory management\n");
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* defined(MEM_RESERVE_PLACEHOLDER) */
+
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static void
+set_errno_from_win32_alloc_error(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		rte_errno = 0;
+		break;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		rte_errno = ENOMEM;
+		break;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		rte_errno = EINVAL;
+		break;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	HANDLE process;
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	process = GetCurrentProcess();
+
+	virt = VirtualAlloc2(process, requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFreeEx(process, virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void *
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	HANDLE process;
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	process = GetCurrentProcess();
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+
+		if (VirtualQueryEx(process, requested_addr, &info,
+				sizeof(info)) != sizeof(info)) {
+			RTE_LOG_WIN32_ERR("VirtualQuery(%p)", requested_addr);
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) && !VirtualFreeEx(
+				process, requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR(
+				"VirtualFreeEx(%p, %zu, preserve placeholder)",
+				requested_addr, size);
+			return NULL;
+		}
+
+		/* Temporarily release the region to be committed.
+		 *
+		 * There is an inherent race for this memory range
+		 * if another thread allocates memory via OS API.
+		 * However, VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)
+		 * doesn't work with MEM_LARGE_PAGES on Windows Server.
+		 */
+		if (!VirtualFreeEx(process, requested_addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				requested_addr);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAlloc2(process, requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		/* Logging may overwrite GetLastError() result. */
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, commit large pages)",
+			requested_addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((requested_addr != NULL) && (addr != requested_addr)) {
+		/* We lost the race for the requested_addr. */
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", addr);
+
+		rte_errno = EADDRNOTAVAIL;
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	HANDLE process;
+	void *stub;
+	DWORD flags;
+
+	process = GetCurrentProcess();
+
+	/* Hugepages cannot be decommited on Windows,
+	 * so free them and replace the block with a placeholder.
+	 * There is a race for VA in this block until VirtualAlloc2 call.
+	 */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	flags = MEM_RESERVE | MEM_RESERVE_PLACEHOLDER;
+	stub = VirtualAlloc2(
+		process, addr, size, flags, PAGE_NOACCESS, NULL, 0);
+	if (stub == NULL) {
+		/* We lost the race for the VA. */
+		if (!VirtualFreeEx(process, stub, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", stub);
+		rte_errno = EADDRNOTAVAIL;
+		return -1;
+	}
+
+	/* No need to join reserved regions adjascent to the freed one:
+	 * eal_mem_commit() will just pick up the page-size placeholder
+	 * created here.
+	 */
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	HANDLE process;
+
+	process = GetCurrentProcess();
+
+	if (VirtualQueryEx(
+			process, addr, &info, sizeof(info)) != sizeof(info)) {
+		RTE_LOG_WIN32_ERR("VirtualQueryEx(%p)", addr);
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFreeEx(process, addr, size,
+			MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR(
+			"VirtualFreeEx(%p, %zu, preserve placeholder)",
+			addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	RTE_SET_USED(virt);
+	RTE_SET_USED(size);
+	RTE_SET_USED(dump);
+
+	/* Windows does not dump reserved memory by default.
+	 *
+	 * There is <werapi.h> to include or exclude regions from the dump,
+	 * but this is not currently required by EAL.
+	 */
+
+	rte_errno = ENOTSUP;
+	return -1;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* There is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!UnmapViewOfFile(virt))
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		virt = NULL;
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
+size_t
+rte_get_page_size(void)
+{
+	static SYSTEM_INFO info;
+
+	if (info.dwPageSize == 0)
+		GetSystemInfo(&info);
+
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock(%p %#zx)", virt, size);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		EAL_LOG_NOT_IMPLEMENTED();
+		return -1;
+	}
+
+	return eal_dynmem_memseg_lists_init();
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs;
+	uint64_t mem_sz, page_sz;
+	void *addr;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	msl = &mcfg->memsegs[0];
+
+	mem_sz = internal_config.memory;
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = mem_sz / page_sz;
+
+	if (eal_memseg_list_init_named(
+			msl, "nohugemem", page_sz, n_segs, 0, true)) {
+		return -1;
+	}
+
+	addr = VirtualAlloc(
+		NULL, mem_sz, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+	if (addr == NULL) {
+		RTE_LOG_WIN32_ERR("VirtualAlloc(size=%#zx)", mem_sz);
+		RTE_LOG(ERR, EAL, "Cannot allocate memory\n");
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	eal_memseg_list_populate(msl, addr, n_segs);
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_dynmem_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index 390d2fd66..caabffedf 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,4 +52,63 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with eal_mem_reserve()
+ * or decommitted from hugepages by eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and rte_errno is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit, must be the size of a page
+ *  (hugepage or regular one).
+ *
+ * The *addr* and *size* must match location and size
+ * of a previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..fdd4a6f40 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -36,6 +36,15 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+/* MinGW-w64 7.0.0 defines _open() and open() as inlines, 6.0.0 doesn't.
+ * Other pairs of functions are only defined, not declared.
+ */
+#if !defined RTE_TOOLCHAIN_GCC || defined NO_OLDNAMES
+#define open _open
+#endif
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 52978e9d7..f2387762a 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -9,7 +9,12 @@ sources += files(
 	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
+	'eal_memalloc.c',
+	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'fnmatch.c',
 	'getopt.c',
 )
+
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/11] trace: add size_t field emitter
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 06/11] trace: add size_t field emitter Dmitry Kozlyuk
@ 2020-05-25  5:53         ` Jerin Jacob
  0 siblings, 0 replies; 218+ messages in thread
From: Jerin Jacob @ 2020-05-25  5:53 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Jerin Jacob, Sunil Kumar Kori, Olivier Matz,
	Andrew Rybchenko

On Mon, May 25, 2020 at 6:08 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> It is not guaranteed that sizeof(long) == sizeof(size_t). On Windows,
> sizeof(long) == 4 and sizeof(size_t) == 8 for 64-bit programs.
> Tracepoints using "long" field emitter are therefore invalid there.
> Add dedicated field emitter for size_t and use it to store size_t values
> in all existing tracepoints.
>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

The Doxygen comment for the new emit function is missing.
See https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/include/rte_trace_point.h#L138

Other than the above nit, rest looks good to me.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-05-27  6:33         ` Ray Kinsella
  2020-05-27 16:34           ` Dmitry Kozlyuk
  2020-05-28 11:26         ` Burakov, Anatoly
  2020-05-28 11:52         ` Burakov, Anatoly
  2 siblings, 1 reply; 218+ messages in thread
From: Ray Kinsella @ 2020-05-27  6:33 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Neil Horman


Are wrappers 100% are required.
Would it be simpler (and less invasive) to have a windows_compat.h that plugged this holes?
I am not sure on the standard approach here - so I will leave this to others. 

Outside of that - do these symbols really require experimental status.
Are they really likely to change?

Ray K

On 25/05/2020 01:37, Dmitry Kozlyuk wrote:
> Introduce OS-independent wrappers for memory management operations used
> across DPDK and specifically in common code of EAL:
> 
> * rte_mem_map()
> * rte_mem_unmap()
> * rte_get_page_size()
> * rte_mem_lock()
> 
> Windows uses different APIs for memory mapping and reservation, while
> Unices reserve memory by mapping it. Introduce EAL private functions to
> support memory reservation in common code:
> 
> * eal_mem_reserve()
> * eal_mem_free()
> * eal_mem_set_dump()
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  lib/librte_eal/common/eal_common_fbarray.c |  37 +++--
>  lib/librte_eal/common/eal_common_memory.c  |  60 +++-----
>  lib/librte_eal/common/eal_private.h        |  78 ++++++++++-
>  lib/librte_eal/freebsd/Makefile            |   1 +
>  lib/librte_eal/include/rte_memory.h        |  88 ++++++++++++
>  lib/librte_eal/linux/Makefile              |   1 +
>  lib/librte_eal/linux/eal_memalloc.c        |   5 +-
>  lib/librte_eal/rte_eal_version.map         |   6 +
>  lib/librte_eal/unix/eal_unix_memory.c      | 152 +++++++++++++++++++++
>  lib/librte_eal/unix/meson.build            |   1 +
>  10 files changed, 365 insertions(+), 64 deletions(-)
>  create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
> 
> diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
> index cfcab63e9..a41e8ce5f 100644
> --- a/lib/librte_eal/common/eal_common_fbarray.c
> +++ b/lib/librte_eal/common/eal_common_fbarray.c
> @@ -5,15 +5,15 @@
>  #include <fcntl.h>
>  #include <inttypes.h>
>  #include <limits.h>
> -#include <sys/mman.h>
>  #include <stdint.h>
>  #include <errno.h>
>  #include <string.h>
>  #include <unistd.h>
>  
>  #include <rte_common.h>
> -#include <rte_log.h>
>  #include <rte_errno.h>
> +#include <rte_log.h>
> +#include <rte_memory.h>
>  #include <rte_spinlock.h>
>  #include <rte_tailq.h>
>  
> @@ -90,12 +90,9 @@ resize_and_map(int fd, void *addr, size_t len)
>  		return -1;
>  	}
>  
> -	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
> -			MAP_SHARED | MAP_FIXED, fd, 0);
> +	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
> +			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
>  	if (map_addr != addr) {
> -		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
> -		/* pass errno up the chain */
> -		rte_errno = errno;
>  		return -1;
>  	}
>  	return 0;
> @@ -733,7 +730,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  		return -1;
>  	}
>  
> -	page_sz = sysconf(_SC_PAGESIZE);
> +	page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1) {
>  		free(ma);
>  		return -1;
> @@ -754,9 +751,11 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  
>  	if (internal_config.no_shconf) {
>  		/* remap virtual area as writable */
> -		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
> -				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
> -		if (new_data == MAP_FAILED) {
> +		static const int flags = RTE_MAP_FORCE_ADDRESS |
> +			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
> +		void *new_data = rte_mem_map(data, mmap_len,
> +			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
> +		if (new_data == NULL) {
>  			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
>  					__func__, strerror(errno));
>  			goto fail;
> @@ -821,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  	return 0;
>  fail:
>  	if (data)
> -		munmap(data, mmap_len);
> +		rte_mem_unmap(data, mmap_len);
>  	if (fd >= 0)
>  		close(fd);
>  	free(ma);
> @@ -859,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  		return -1;
>  	}
>  
> -	page_sz = sysconf(_SC_PAGESIZE);
> +	page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1) {
>  		free(ma);
>  		return -1;
> @@ -911,7 +910,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  	return 0;
>  fail:
>  	if (data)
> -		munmap(data, mmap_len);
> +		rte_mem_unmap(data, mmap_len);
>  	if (fd >= 0)
>  		close(fd);
>  	free(ma);
> @@ -939,8 +938,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
>  	 * really do anything about it, things will blow up either way.
>  	 */
>  
> -	size_t page_sz = sysconf(_SC_PAGESIZE);
> -
> +	size_t page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1)
>  		return -1;
>  
> @@ -959,7 +957,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
>  		goto out;
>  	}
>  
> -	munmap(arr->data, mmap_len);
> +	rte_mem_unmap(arr->data, mmap_len);
>  
>  	/* area is unmapped, close fd and remove the tailq entry */
>  	if (tmp->fd >= 0)
> @@ -994,8 +992,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  	 * really do anything about it, things will blow up either way.
>  	 */
>  
> -	size_t page_sz = sysconf(_SC_PAGESIZE);
> -
> +	size_t page_sz = rte_get_page_size();
>  	if (page_sz == (size_t)-1)
>  		return -1;
>  
> @@ -1044,7 +1041,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  		}
>  		close(fd);
>  	}
> -	munmap(arr->data, mmap_len);
> +	rte_mem_unmap(arr->data, mmap_len);
>  
>  	/* area is unmapped, remove the tailq entry */
>  	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
> diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
> index 4c897a13f..c6243aca1 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> @@ -11,7 +11,6 @@
>  #include <string.h>
>  #include <unistd.h>
>  #include <inttypes.h>
> -#include <sys/mman.h>
>  #include <sys/queue.h>
>  
>  #include <rte_fbarray.h>
> @@ -40,18 +39,10 @@
>  static void *next_baseaddr;
>  static uint64_t system_page_sz;
>  
> -#ifdef RTE_EXEC_ENV_LINUX
> -#define RTE_DONTDUMP MADV_DONTDUMP
> -#elif defined RTE_EXEC_ENV_FREEBSD
> -#define RTE_DONTDUMP MADV_NOCORE
> -#else
> -#error "madvise doesn't support this OS"
> -#endif
> -
>  #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
>  void *
>  eal_get_virtual_area(void *requested_addr, size_t *size,
> -		size_t page_sz, int flags, int mmap_flags)
> +	size_t page_sz, int flags, int reserve_flags)
>  {
>  	bool addr_is_hint, allow_shrink, unmap, no_align;
>  	uint64_t map_sz;
> @@ -59,9 +50,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  	uint8_t try = 0;
>  
>  	if (system_page_sz == 0)
> -		system_page_sz = sysconf(_SC_PAGESIZE);
> -
> -	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
> +		system_page_sz = rte_get_page_size();
>  
>  	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
>  
> @@ -105,24 +94,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  			return NULL;
>  		}
>  
> -		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
> -				mmap_flags, -1, 0);
> -		if (mapped_addr == MAP_FAILED && allow_shrink)
> +		mapped_addr = eal_mem_reserve(
> +			requested_addr, (size_t)map_sz, reserve_flags);
> +		if ((mapped_addr == NULL) && allow_shrink)
>  			*size -= page_sz;
>  
> -		if (mapped_addr != MAP_FAILED && addr_is_hint &&
> -		    mapped_addr != requested_addr) {
> +		if ((mapped_addr != NULL) && addr_is_hint &&
> +				(mapped_addr != requested_addr)) {
>  			try++;
>  			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
>  			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
>  				/* hint was not used. Try with another offset */
> -				munmap(mapped_addr, map_sz);
> -				mapped_addr = MAP_FAILED;
> +				eal_mem_free(mapped_addr, map_sz);
> +				mapped_addr = NULL;
>  				requested_addr = next_baseaddr;
>  			}
>  		}
>  	} while ((allow_shrink || addr_is_hint) &&
> -		 mapped_addr == MAP_FAILED && *size > 0);
> +		(mapped_addr == NULL) && (*size > 0));
>  
>  	/* align resulting address - if map failed, we will ignore the value
>  	 * anyway, so no need to add additional checks.
> @@ -132,20 +121,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  
>  	if (*size == 0) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
> -			strerror(errno));
> -		rte_errno = errno;
> +			strerror(rte_errno));
>  		return NULL;
> -	} else if (mapped_addr == MAP_FAILED) {
> +	} else if (mapped_addr == NULL) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
> -			strerror(errno));
> -		/* pass errno up the call chain */
> -		rte_errno = errno;
> +			strerror(rte_errno));
>  		return NULL;
>  	} else if (requested_addr != NULL && !addr_is_hint &&
>  			aligned_addr != requested_addr) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
>  			requested_addr, aligned_addr);
> -		munmap(mapped_addr, map_sz);
> +		eal_mem_free(mapped_addr, map_sz);
>  		rte_errno = EADDRNOTAVAIL;
>  		return NULL;
>  	} else if (requested_addr != NULL && addr_is_hint &&
> @@ -161,7 +147,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  		aligned_addr, *size);
>  
>  	if (unmap) {
> -		munmap(mapped_addr, map_sz);
> +		eal_mem_free(mapped_addr, map_sz);
>  	} else if (!no_align) {
>  		void *map_end, *aligned_end;
>  		size_t before_len, after_len;
> @@ -179,19 +165,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  		/* unmap space before aligned mmap address */
>  		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
>  		if (before_len > 0)
> -			munmap(mapped_addr, before_len);
> +			eal_mem_free(mapped_addr, before_len);
>  
>  		/* unmap space after aligned end mmap address */
>  		after_len = RTE_PTR_DIFF(map_end, aligned_end);
>  		if (after_len > 0)
> -			munmap(aligned_end, after_len);
> +			eal_mem_free(aligned_end, after_len);
>  	}
>  
>  	if (!unmap) {
>  		/* Exclude these pages from a core dump. */
> -		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
> -			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
> -				strerror(errno));
> +		eal_mem_set_dump(aligned_addr, *size, false);
>  	}
>  
>  	return aligned_addr;
> @@ -547,10 +531,10 @@ rte_eal_memdevice_init(void)
>  int
>  rte_mem_lock_page(const void *virt)
>  {
> -	unsigned long virtual = (unsigned long)virt;
> -	int page_size = getpagesize();
> -	unsigned long aligned = (virtual & ~(page_size - 1));
> -	return mlock((void *)aligned, page_size);
> +	uintptr_t virtual = (uintptr_t)virt;
> +	size_t page_size = rte_get_page_size();
> +	uintptr_t aligned = RTE_PTR_ALIGN_FLOOR(virtual, page_size);
> +	return rte_mem_lock((void *)aligned, page_size);
>  }
>  
>  int
> diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
> index cef73d6fe..a93850c09 100644
> --- a/lib/librte_eal/common/eal_private.h
> +++ b/lib/librte_eal/common/eal_private.h
> @@ -11,6 +11,7 @@
>  
>  #include <rte_dev.h>
>  #include <rte_lcore.h>
> +#include <rte_memory.h>
>  
>  /**
>   * Structure storing internal configuration (per-lcore)
> @@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
>   */
>  int rte_eal_check_module(const char *module_name);
>  
> +/**
> + * Memory reservation flags.
> + */
> +enum eal_mem_reserve_flags {
> +	/**
> +	 * Reserve hugepages. May be unsupported by some platforms.
> +	 */
> +	EAL_RESERVE_HUGEPAGES = 1 << 0,
> +	/**
> +	 * Force reserving memory at the requested address.
> +	 * This can be a destructive action depending on the implementation.
> +	 *
> +	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
> +	 *      (although implementations are not required to use it).
> +	 */
> +	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
> +};
> +
>  /**
>   * Get virtual area of specified size from the OS.
>   *
> @@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
>   *   Page size on which to align requested virtual area.
>   * @param flags
>   *   EAL_VIRTUAL_AREA_* flags.
> - * @param mmap_flags
> - *   Extra flags passed directly to mmap().
> + * @param reserve_flags
> + *   Extra flags passed directly to rte_mem_reserve().
>   *
>   * @return
>   *   Virtual area address if successful.
> @@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
>  /**< immediately unmap reserved virtual area. */
>  void *
>  eal_get_virtual_area(void *requested_addr, size_t *size,
> -		size_t page_sz, int flags, int mmap_flags);
> +		size_t page_sz, int flags, int reserve_flags);
>  
>  /**
>   * Get cpu core_id.
> @@ -467,4 +486,57 @@ eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
>  int
>  eal_file_truncate(int fd, ssize_t size);
>  
> +/**
> + * Reserve a region of virtual memory.
> + *
> + * Use eal_mem_free() to free reserved memory.
> + *
> + * @param requested_addr
> + *  A desired reservation addressm which must be page-aligned.
> + *  The system might not respect it.
> + *  NULL means the address will be chosen by the system.
> + * @param size
> + *  Reservation size. Must be a multiple of system page size.
> + * @param flags
> + *  Reservation options, a combination of eal_mem_reserve_flags.
> + * @returns
> + *  Starting address of the reserved area on success, NULL on failure.
> + *  Callers must not access this memory until remapping it.
> + */
> +void *
> +eal_mem_reserve(void *requested_addr, size_t size, int flags);
> +
> +/**
> + * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
> + *
> + * If *virt* and *size* describe a part of the reserved region,
> + * only this part of the region is freed (accurately up to the system
> + * page size). If *virt* points to allocated memory, *size* must match
> + * the one specified on allocation. The behavior is undefined
> + * if the memory pointed by *virt* is obtained from another source
> + * than listed above.
> + *
> + * @param virt
> + *  A virtual address in a region previously reserved.
> + * @param size
> + *  Number of bytes to unreserve.
> + */
> +void
> +eal_mem_free(void *virt, size_t size);
> +
> +/**
> + * Configure memory region inclusion into core dumps.
> + *
> + * @param virt
> + *  Starting address of the region.
> + * @param size
> + *  Size of the region.
> + * @param dump
> + *  True to include memory into core dumps, false to exclude.
> + * @return
> + *  0 on success, (-1) on failure and rte_errno is set.
> + */
> +int
> +eal_mem_set_dump(void *virt, size_t size, bool dump);
> +
>  #endif /* _EAL_PRIVATE_H_ */
> diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
> index 4654ca2b3..f64a3994c 100644
> --- a/lib/librte_eal/freebsd/Makefile
> +++ b/lib/librte_eal/freebsd/Makefile
> @@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
>  
>  # from unix dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix.c
> +SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
>  
>  # from arch dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
> diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
> index 65374d53a..63ff0773d 100644
> --- a/lib/librte_eal/include/rte_memory.h
> +++ b/lib/librte_eal/include/rte_memory.h
> @@ -82,6 +82,94 @@ struct rte_memseg_list {
>  	struct rte_fbarray memseg_arr;
>  };
>  
> +/**
> + * Memory protection flags.
> + */
> +enum rte_mem_prot {
> +	RTE_PROT_READ = 1 << 0,   /**< Read access. */
> +	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
> +	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
> +};
> +
> +/**
> + * Additional flags for memory mapping.
> + */
> +enum rte_map_flags {
> +	/** Changes to the mapped memory are visible to other processes. */
> +	RTE_MAP_SHARED = 1 << 0,
> +	/** Mapping is not backed by a regular file. */
> +	RTE_MAP_ANONYMOUS = 1 << 1,
> +	/** Copy-on-write mapping, changes are invisible to other processes. */
> +	RTE_MAP_PRIVATE = 1 << 2,
> +	/**
> +	 * Force mapping to the requested address. This flag should be used
> +	 * with caution, because to fulfill the request implementation
> +	 * may remove all other mappings in the requested region. However,
> +	 * it is not required to do so, thus mapping with this flag may fail.
> +	 */
> +	RTE_MAP_FORCE_ADDRESS = 1 << 3
> +};
> +
> +/**
> + * Map a portion of an opened file or the page file into memory.
> + *
> + * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
> + * extension, except for the return value.
> + *
> + * @param requested_addr
> + *  Desired virtual address for mapping. Can be NULL to let OS choose.
> + * @param size
> + *  Size of the mapping in bytes.
> + * @param prot
> + *  Protection flags, a combination of rte_mem_prot values.
> + * @param flags
> + *  Addtional mapping flags, a combination of rte_map_flags.
> + * @param fd
> + *  Mapped file descriptor. Can be negative for anonymous mapping.
> + * @param offset
> + *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
> + * @return
> + *  Mapped address or NULL on failure and rte_errno is set to OS error.
> + */
> +__rte_experimental
> +void *
> +rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset);
> +
> +/**
> + * OS-independent implementation of POSIX munmap(3).
> + */
> +__rte_experimental
> +int
> +rte_mem_unmap(void *virt, size_t size);
> +
> +/**
> + * Get system page size. This function never fails.
> + *
> + * @return
> + *   Page size in bytes.
> + */
> +__rte_experimental
> +size_t
> +rte_get_page_size(void);
> +
> +/**
> + * Lock in physical memory all pages crossed by the address region.
> + *
> + * @param virt
> + *   Base virtual address of the region.
> + * @param size
> + *   Size of the region.
> + * @return
> + *   0 on success, negative on error.
> + *
> + * @see rte_get_page_size() to retrieve the page size.
> + * @see rte_mem_lock_page() to lock an entire single page.
> + */
> +__rte_experimental
> +int
> +rte_mem_lock(const void *virt, size_t size);
> +
>  /**
>   * Lock page in physical memory and prevent from swapping.
>   *
> diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
> index 4f39d462c..d314648cb 100644
> --- a/lib/librte_eal/linux/Makefile
> +++ b/lib/librte_eal/linux/Makefile
> @@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
>  
>  # from unix dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix.c
> +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
>  
>  # from arch dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
> diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
> index 2c717f8bd..bf29b83c6 100644
> --- a/lib/librte_eal/linux/eal_memalloc.c
> +++ b/lib/librte_eal/linux/eal_memalloc.c
> @@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
>  mapped:
>  	munmap(addr, alloc_sz);
>  unmapped:
> -	flags = MAP_FIXED;
> +	flags = EAL_RESERVE_FORCE_ADDRESS;
>  	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
>  	if (new_addr != addr) {
>  		if (new_addr != NULL)
> @@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
>  		return -1;
>  	}
>  
> -	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
> -		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
> +	eal_mem_set_dump(ms->addr, ms->len, false);
>  
>  	exit_early = false;
>  
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index d8038749a..dff51b13d 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -386,4 +386,10 @@ EXPERIMENTAL {
>  	rte_trace_point_lookup;
>  	rte_trace_regexp;
>  	rte_trace_save;
> +
> +	# added in 20.08
> +	rte_get_page_size;
> +	rte_mem_lock;
> +	rte_mem_map;
> +	rte_mem_unmap;
>  };
> diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
> new file mode 100644
> index 000000000..658595b6e
> --- /dev/null
> +++ b/lib/librte_eal/unix/eal_unix_memory.c
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <rte_errno.h>
> +#include <rte_log.h>
> +#include <rte_memory.h>
> +
> +#include "eal_private.h"
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +#define EAL_DONTDUMP MADV_DONTDUMP
> +#define EAL_DODUMP   MADV_DODUMP
> +#elif defined RTE_EXEC_ENV_FREEBSD
> +#define EAL_DONTDUMP MADV_NOCORE
> +#define EAL_DODUMP   MADV_CORE
> +#else
> +#error "madvise doesn't support this OS"
> +#endif
> +
> +static void *
> +mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset)
> +{
> +	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
> +	if (virt == MAP_FAILED) {
> +		RTE_LOG(DEBUG, EAL,
> +			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
> +			requested_addr, size, prot, flags, fd, offset,
> +			strerror(errno));
> +		rte_errno = errno;
> +		return NULL;
> +	}
> +	return virt;
> +}
> +
> +static int
> +mem_unmap(void *virt, size_t size)
> +{
> +	int ret = munmap(virt, size);
> +	if (ret < 0) {
> +		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
> +			virt, size, strerror(errno));
> +		rte_errno = errno;
> +	}
> +	return ret;
> +}
> +
> +void *
> +eal_mem_reserve(void *requested_addr, size_t size, int flags)
> +{
> +	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
> +
> +	if (flags & EAL_RESERVE_HUGEPAGES) {
> +#ifdef MAP_HUGETLB
> +		sys_flags |= MAP_HUGETLB;
> +#else
> +		rte_errno = ENOTSUP;
> +		return NULL;
> +#endif
> +	}
> +
> +	if (flags & EAL_RESERVE_FORCE_ADDRESS)
> +		sys_flags |= MAP_FIXED;
> +
> +	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
> +}
> +
> +void
> +eal_mem_free(void *virt, size_t size)
> +{
> +	mem_unmap(virt, size);
> +}
> +
> +int
> +eal_mem_set_dump(void *virt, size_t size, bool dump)
> +{
> +	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
> +	int ret = madvise(virt, size, flags);
> +	if (ret) {
> +		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
> +				virt, size, flags, strerror(rte_errno));
> +		rte_errno = errno;
> +	}
> +	return ret;
> +}
> +
> +static int
> +mem_rte_to_sys_prot(int prot)
> +{
> +	int sys_prot = PROT_NONE;
> +
> +	if (prot & RTE_PROT_READ)
> +		sys_prot |= PROT_READ;
> +	if (prot & RTE_PROT_WRITE)
> +		sys_prot |= PROT_WRITE;
> +	if (prot & RTE_PROT_EXECUTE)
> +		sys_prot |= PROT_EXEC;
> +
> +	return sys_prot;
> +}
> +
> +void *
> +rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset)
> +{
> +	int sys_flags = 0;
> +	int sys_prot;
> +
> +	sys_prot = mem_rte_to_sys_prot(prot);
> +
> +	if (flags & RTE_MAP_SHARED)
> +		sys_flags |= MAP_SHARED;
> +	if (flags & RTE_MAP_ANONYMOUS)
> +		sys_flags |= MAP_ANONYMOUS;
> +	if (flags & RTE_MAP_PRIVATE)
> +		sys_flags |= MAP_PRIVATE;
> +	if (flags & RTE_MAP_FORCE_ADDRESS)
> +		sys_flags |= MAP_FIXED;
> +
> +	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
> +}
> +
> +int
> +rte_mem_unmap(void *virt, size_t size)
> +{
> +	return mem_unmap(virt, size);
> +}
> +
> +size_t
> +rte_get_page_size(void)
> +{
> +	static size_t page_size;
> +
> +	if (!page_size)
> +		page_size = sysconf(_SC_PAGESIZE);
> +
> +	return page_size;
> +}
> +
> +int
> +rte_mem_lock(const void *virt, size_t size)
> +{
> +	int ret = mlock(virt, size);
> +	if (ret)
> +		rte_errno = errno;
> +	return ret;
> +}
> diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
> index cfa1b4ef9..5734f26ad 100644
> --- a/lib/librte_eal/unix/meson.build
> +++ b/lib/librte_eal/unix/meson.build
> @@ -3,4 +3,5 @@
>  
>  sources += files(
>  	'eal_unix.c',
> +	'eal_unix_memory.c',
>  )
> 

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers
  2020-05-27  6:33         ` Ray Kinsella
@ 2020-05-27 16:34           ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-27 16:34 UTC (permalink / raw)
  To: Ray Kinsella
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Neil Horman

Answers below is the summary of discussion with Thomas, Ranjit, Tal, et al.

On Wed, 27 May 2020 07:33:32 +0100 Ray Kinsella <mdr@ashroe.eu> wrote:
> Are wrappers 100% are required.
> Would it be simpler (and less invasive) to have a windows_compat.h that plugged this holes?
> I am not sure on the standard approach here - so I will leave this to others. 

With wrappers, we control API and semantics, which is limited compared to the
underlying syscalls. It is also cleaner not to export non-RTE symbols from
DPDK libraries. Regarding invasion: it requires little change, factoring
out some common error logging in the process.

> Outside of that - do these symbols really require experimental status.
> Are they really likely to change?

Indeed, the wrappers should be internal, not experimental. Will fix in v6.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-05-28  7:31         ` Thomas Monjalon
  2020-05-28 10:04           ` Dmitry Kozlyuk
  2020-05-28 11:46         ` Burakov, Anatoly
  2020-05-28 12:19         ` Burakov, Anatoly
  2 siblings, 1 reply; 218+ messages in thread
From: Thomas Monjalon @ 2020-05-28  7:31 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson,
	david.marchand

25/05/2020 02:37, Dmitry Kozlyuk:
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
> +void
> +eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
> +{
> +	uint64_t page_sz = msl->page_sz;
[...]
> +		addr = RTE_PTR_ADD(addr, page_sz);

This is an error in 32-bit compilation:

lib/librte_eal/common/eal_common_memory.c:
In function ‘eal_memseg_list_populate’: rte_common.h:215:30: error:
cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
  215 | #define RTE_PTR_ADD(ptr, x) ((void*)((uintptr_t)(ptr) + (x)))
      |                              ^

The original code was doing a cast to size_t.

> --- a/lib/librte_eal/linux/eal_memory.c
> +++ b/lib/librte_eal/linux/eal_memory.c
> -			addr = RTE_PTR_ADD(addr, (size_t)page_sz);

I believe the correct cast should be uintptr_t.
Maybe it would be even more correct to do this cast inside RTE_PTR_ADD?



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-05-28  7:59         ` Thomas Monjalon
  2020-05-28 10:09           ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Thomas Monjalon @ 2020-05-28  7:59 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

25/05/2020 02:37, Dmitry Kozlyuk:
> * eal_file_lock: lock or unlock an open file.
> * eal_file_truncate: enforce a given size for an open file.
[...]
> Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
> which is intended for common code between the two. Files should be named
> after the ones from which the code is factored in OS subdirectory.
[...]
>  lib/librte_eal/unix/eal_unix.c             | 51 ++++++++++++++++++++++

Why naming this file eal_unix?
If it's truly global, it should be unix/eal.c
If it's only about file operations, it could be unix/eal_file.c

Please update MAINTAINERS when creating a new file.
All files or directories must be listed in MAINTAINERS,
even if there is no maintainer for them.




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
@ 2020-05-28  8:34         ` Thomas Monjalon
  2020-05-28 12:21         ` Burakov, Anatoly
  1 sibling, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-05-28  8:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov

25/05/2020 02:37, Dmitry Kozlyuk:
> Code in Linux EAL that supports dynamic memory allocation (as opposed to
> static allocation used by FreeBSD) is not OS-dependent and can be reused
> by Windows EAL. Move such code to a file compiled only for the OS that
> require it.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
> @@ -1943,7 +1633,7 @@ rte_eal_hugepage_init(void)
>  {
>  	return internal_config.legacy_mem ?
>  			eal_legacy_hugepage_init() :
> -			eal_hugepage_init();
> +			eal_dynmem_hugepage_init();

There is a compilation issue, building clang+shared:
	undefined reference to `eal_dynmem_hugepage_init'




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization
  2020-05-28  7:31         ` Thomas Monjalon
@ 2020-05-28 10:04           ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-28 10:04 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson,
	david.marchand

On Thu, 28 May 2020 09:31:54 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

> 25/05/2020 02:37, Dmitry Kozlyuk:
> > All supported OS create memory segment lists (MSL) and reserve VA space
> > for them in a nearly identical way. Move common code into EAL private
> > functions to reduce duplication.
> > 
> > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > ---
> > +void
> > +eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
> > +{
> > +	uint64_t page_sz = msl->page_sz;  
> [...]
> > +		addr = RTE_PTR_ADD(addr, page_sz);  
> 
> This is an error in 32-bit compilation:
> 
> lib/librte_eal/common/eal_common_memory.c:
> In function ‘eal_memseg_list_populate’: rte_common.h:215:30: error:
> cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
>   215 | #define RTE_PTR_ADD(ptr, x) ((void*)((uintptr_t)(ptr) + (x)))
>       |                              ^
> 
> The original code was doing a cast to size_t.
> 
> > --- a/lib/librte_eal/linux/eal_memory.c
> > +++ b/lib/librte_eal/linux/eal_memory.c
> > -			addr = RTE_PTR_ADD(addr, (size_t)page_sz);  
> 
> I believe the correct cast should be uintptr_t.
> Maybe it would be even more correct to do this cast inside RTE_PTR_ADD?

Ack, this is the issue I mentioned in the Community Call letter. I think
size_t is a more suitable type for page_*sz* than uintptr_t.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations
  2020-05-28  7:59         ` Thomas Monjalon
@ 2020-05-28 10:09           ` Dmitry Kozlyuk
  2020-05-28 11:29             ` Thomas Monjalon
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-28 10:09 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

On Thu, 28 May 2020 09:59:13 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

> 25/05/2020 02:37, Dmitry Kozlyuk:
> > * eal_file_lock: lock or unlock an open file.
> > * eal_file_truncate: enforce a given size for an open file.  
> [...]
> > Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
> > which is intended for common code between the two. Files should be named
> > after the ones from which the code is factored in OS subdirectory.  
> [...]
> >  lib/librte_eal/unix/eal_unix.c             | 51 ++++++++++++++++++++++  
> 
> Why naming this file eal_unix?
> If it's truly global, it should be unix/eal.c

We've already discussed this: Makefiles are written in such way that files in
common and arch directories must have distinct names for OS-specific ones.
Until Makefiles exist, this is consistent naming with the rest of librte_eal.

> If it's only about file operations, it could be unix/eal_file.c

Good suggestion, given that more file-related functions are expected to be
extracted here in future (e.g. for tracing).

> Please update MAINTAINERS when creating a new file.
> All files or directories must be listed in MAINTAINERS,
> even if there is no maintainer for them.

Will do.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
  2020-05-27  6:33         ` Ray Kinsella
@ 2020-05-28 11:26         ` Burakov, Anatoly
  2020-06-01 21:08           ` Thomas Monjalon
  2020-05-28 11:52         ` Burakov, Anatoly
  2 siblings, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 11:26 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson, Ray Kinsella, Neil Horman

On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> Introduce OS-independent wrappers for memory management operations used
> across DPDK and specifically in common code of EAL:
> 
> * rte_mem_map()
> * rte_mem_unmap()
> * rte_get_page_size()
> * rte_mem_lock()
> 
> Windows uses different APIs for memory mapping and reservation, while
> Unices reserve memory by mapping it. Introduce EAL private functions to
> support memory reservation in common code:
> 
> * eal_mem_reserve()
> * eal_mem_free()
> * eal_mem_set_dump()
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

> -	page_sz = sysconf(_SC_PAGESIZE);
> +	page_sz = rte_get_page_size();
>   	if (page_sz == (size_t)-1) {
>   		free(ma);
>   		return -1;
> @@ -754,9 +751,11 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>   
>   	if (internal_config.no_shconf) {
>   		/* remap virtual area as writable */
> -		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
> -				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
> -		if (new_data == MAP_FAILED) {
> +		static const int flags = RTE_MAP_FORCE_ADDRESS |
> +			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
> +		void *new_data = rte_mem_map(data, mmap_len,
> +			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
> +		if (new_data == NULL) {
>   			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
>   					__func__, strerror(errno));

I believe this should be rte_strerror(rte_errno) instead of strerror(errno).

>   			goto fail;
> @@ -821,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>   	return 0;
>   fail:
>   	if (data)
> -		munmap(data, mmap_len);
> +		rte_mem_unmap(data, mmap_len);
>   	if (fd >= 0)
>   		close(fd);
>   	free(ma);
> @@ -859,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>   		return -1;

<snip>

>   
> +/**
> + * Memory protection flags.
> + */
> +enum rte_mem_prot {
> +	RTE_PROT_READ = 1 << 0,   /**< Read access. */
> +	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
> +	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
> +};
> +
> +/**
> + * Additional flags for memory mapping.
> + */
> +enum rte_map_flags {
> +	/** Changes to the mapped memory are visible to other processes. */
> +	RTE_MAP_SHARED = 1 << 0,
> +	/** Mapping is not backed by a regular file. */
> +	RTE_MAP_ANONYMOUS = 1 << 1,
> +	/** Copy-on-write mapping, changes are invisible to other processes. */
> +	RTE_MAP_PRIVATE = 1 << 2,
> +	/**
> +	 * Force mapping to the requested address. This flag should be used
> +	 * with caution, because to fulfill the request implementation
> +	 * may remove all other mappings in the requested region. However,
> +	 * it is not required to do so, thus mapping with this flag may fail.
> +	 */
> +	RTE_MAP_FORCE_ADDRESS = 1 << 3
> +};

I have no strong opinion on this, but it feels like the fact that these 
are enums is a relic from the times where you used enum everywhere :) i 
have a feeling that DPDK codebase prefers #define's for this usage, 
while what you have here is more of a C++ thing.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations
  2020-05-28 10:09           ` Dmitry Kozlyuk
@ 2020-05-28 11:29             ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-05-28 11:29 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

28/05/2020 12:09, Dmitry Kozlyuk:
> On Thu, 28 May 2020 09:59:13 +0200
> Thomas Monjalon <thomas@monjalon.net> wrote:
> 
> > 25/05/2020 02:37, Dmitry Kozlyuk:
> > > * eal_file_lock: lock or unlock an open file.
> > > * eal_file_truncate: enforce a given size for an open file.  
> > [...]
> > > Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
> > > which is intended for common code between the two. Files should be named
> > > after the ones from which the code is factored in OS subdirectory.  
> > [...]
> > >  lib/librte_eal/unix/eal_unix.c             | 51 ++++++++++++++++++++++  
> > 
> > Why naming this file eal_unix?
> > If it's truly global, it should be unix/eal.c
> 
> We've already discussed this: Makefiles are written in such way that files in
> common and arch directories must have distinct names for OS-specific ones.
> Until Makefiles exist, this is consistent naming with the rest of librte_eal.

Oh right.

> > If it's only about file operations, it could be unix/eal_file.c
> 
> Good suggestion, given that more file-related functions are expected to be
> extracted here in future (e.g. for tracing).
> 
> > Please update MAINTAINERS when creating a new file.
> > All files or directories must be listed in MAINTAINERS,
> > even if there is no maintainer for them.
> 
> Will do.

Thanks




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
  2020-05-28  7:31         ` Thomas Monjalon
@ 2020-05-28 11:46         ` Burakov, Anatoly
  2020-05-28 14:41           ` Dmitry Kozlyuk
  2020-05-28 12:19         ` Burakov, Anatoly
  2 siblings, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 11:46 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

> +eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
> +{
> +	uint64_t page_sz;
> +	size_t mem_sz;
> +	void *addr;
> +
> +	page_sz = msl->page_sz;
> +	mem_sz = page_sz * msl->memseg_arr.len;
> +
> +	addr = eal_get_virtual_area(
> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
> +	if (addr == NULL) {
> +		if (rte_errno == EADDRNOTAVAIL)
> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> +				(unsigned long long)mem_sz, msl->base_va);

Do all OS's support this EAL option?

> +		else
> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> +		return -1;
> +	}
> +	msl->base_va = addr;
> +	msl->len = mem_sz;
> +
> +	RTE_LOG(DEBUG, EAL, "VA reserved for memseg list at %p, size %zx\n",
> +			addr, mem_sz);
> +

<snip>

>   
> -#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
>   static int
> -alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
> +memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
>   		int n_segs, int socket_id, int type_msl_idx)
>   {
> -	char name[RTE_FBARRAY_NAME_LEN];
> -
> -	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
> -		 type_msl_idx);
> -	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
> -			sizeof(struct rte_memseg))) {
> -		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
> -			rte_strerror(rte_errno));
> -		return -1;
> -	}
> -
> -	msl->page_sz = page_sz;
> -	msl->socket_id = socket_id;
> -	msl->base_va = NULL;
> -	msl->heap = 1; /* mark it as a heap segment */
> -
> -	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
> -			(size_t)page_sz >> 10, socket_id);
> -
> -	return 0;
> +	return eal_memseg_list_init(
> +		msl, page_sz, n_segs, socket_id, type_msl_idx, true);

Here and in similar places: I wonder if there's value of keeping 
memseg_list_init function instead of just calling eal_memseg_list_init() 
directly from where this is called?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
  2020-05-27  6:33         ` Ray Kinsella
  2020-05-28 11:26         ` Burakov, Anatoly
@ 2020-05-28 11:52         ` Burakov, Anatoly
  2 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 11:52 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson, Ray Kinsella, Neil Horman

On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> Introduce OS-independent wrappers for memory management operations used
> across DPDK and specifically in common code of EAL:
> 
> * rte_mem_map()
> * rte_mem_unmap()
> * rte_get_page_size()
> * rte_mem_lock()
> 
> Windows uses different APIs for memory mapping and reservation, while
> Unices reserve memory by mapping it. Introduce EAL private functions to
> support memory reservation in common code:
> 
> * eal_mem_reserve()
> * eal_mem_free()
> * eal_mem_set_dump()
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

> +	} else if (mapped_addr == NULL) {
>   		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
> -			strerror(errno));
> -		/* pass errno up the call chain */
> -		rte_errno = errno;
> +			strerror(rte_errno));

Also, please check that you're using rte_strerror with rte_errno :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
  2020-05-28  7:31         ` Thomas Monjalon
  2020-05-28 11:46         ` Burakov, Anatoly
@ 2020-05-28 12:19         ` Burakov, Anatoly
  2 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 12:19 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

> +void
> +eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
> +{
> +	uint64_t page_sz = msl->page_sz;
> +	int i;
> +
> +	for (i = 0; i < n_segs; i++) {
> +		struct rte_fbarray *arr = &msl->memseg_arr;
> +		struct rte_memseg *ms = rte_fbarray_get(arr, i);
> +
> +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
> +			ms->iova = (uintptr_t)addr;
> +		else
> +			ms->iova = RTE_BAD_IOVA;
> +		ms->addr = addr;
> +		ms->hugepage_sz = page_sz;
> +		ms->socket_id = 0;
> +		ms->len = page_sz;
> +
> +		rte_fbarray_set_used(arr, i);
> +
> +		addr = RTE_PTR_ADD(addr, page_sz);

This breaks 32-bit build. I believe page_sz should be size_t, not uint64_t.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
  2020-05-28  8:34         ` Thomas Monjalon
@ 2020-05-28 12:21         ` Burakov, Anatoly
  2020-05-28 13:24           ` Dmitry Kozlyuk
  1 sibling, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 12:21 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader, Tal Shnaiderman

On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> Code in Linux EAL that supports dynamic memory allocation (as opposed to
> static allocation used by FreeBSD) is not OS-dependent and can be reused
> by Windows EAL. Move such code to a file compiled only for the OS that
> require it.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

I believe you forgot to add dynmem to Makefile.

> +eal_dynmem_calc_num_pages_per_socket(
> +	uint64_t *memory, struct hugepage_info *hp_info,
> +	struct hugepage_info *hp_used, unsigned int num_hp_info)
> +{
> +	unsigned int socket, j, i = 0;
> +	unsigned int requested, available;
> +	int total_num_pages = 0;
> +	uint64_t remaining_mem, cur_mem;
> +	uint64_t total_mem = internal_config.memory;
> +
> +	if (num_hp_info == 0)
> +		return -1;
> +
> +	/* if specific memory amounts per socket weren't requested */
> +	if (internal_config.force_sockets == 0) {
> +		size_t total_size;
> +		int cpu_per_socket[RTE_MAX_NUMA_NODES];
> +		size_t default_size;
> +		unsigned int lcore_id;

Comparing code from eal_memory.c and this one, it seems like you've 
dropped all 32-bit code from this function. Is that intentional?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation
  2020-05-28 12:21         ` Burakov, Anatoly
@ 2020-05-28 13:24           ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-28 13:24 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman

On Thu, 28 May 2020 13:21:06 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> > Code in Linux EAL that supports dynamic memory allocation (as opposed to
> > static allocation used by FreeBSD) is not OS-dependent and can be reused
> > by Windows EAL. Move such code to a file compiled only for the OS that
> > require it.
> > 
> > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > ---  
> 
> I believe you forgot to add dynmem to Makefile.

Right, thanks.

> 
> > +eal_dynmem_calc_num_pages_per_socket(
> > +	uint64_t *memory, struct hugepage_info *hp_info,
> > +	struct hugepage_info *hp_used, unsigned int num_hp_info)
> > +{
> > +	unsigned int socket, j, i = 0;
> > +	unsigned int requested, available;
> > +	int total_num_pages = 0;
> > +	uint64_t remaining_mem, cur_mem;
> > +	uint64_t total_mem = internal_config.memory;
> > +
> > +	if (num_hp_info == 0)
> > +		return -1;
> > +
> > +	/* if specific memory amounts per socket weren't requested */
> > +	if (internal_config.force_sockets == 0) {
> > +		size_t total_size;
> > +		int cpu_per_socket[RTE_MAX_NUMA_NODES];
> > +		size_t default_size;
> > +		unsigned int lcore_id;  
> 
> Comparing code from eal_memory.c and this one, it seems like you've 
> dropped all 32-bit code from this function. Is that intentional?

No, it's a mistake.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization
  2020-05-28 11:46         ` Burakov, Anatoly
@ 2020-05-28 14:41           ` Dmitry Kozlyuk
  2020-05-29  8:49             ` Burakov, Anatoly
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-05-28 14:41 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On Thu, 28 May 2020 12:46:49 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> > All supported OS create memory segment lists (MSL) and reserve VA space
> > for them in a nearly identical way. Move common code into EAL private
> > functions to reduce duplication.
> > 
> > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > ---  
> 
> <snip>
> 
> > +eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
> > +{
> > +	uint64_t page_sz;
> > +	size_t mem_sz;
> > +	void *addr;
> > +
> > +	page_sz = msl->page_sz;
> > +	mem_sz = page_sz * msl->memseg_arr.len;
> > +
> > +	addr = eal_get_virtual_area(
> > +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
> > +	if (addr == NULL) {
> > +		if (rte_errno == EADDRNOTAVAIL)
> > +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> > +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> > +				(unsigned long long)mem_sz, msl->base_va);  
> 
> Do all OS's support this EAL option?

Supported, yes; meaningful, not quite: for Windows, we start with address 0
(let the OS choose) and using the option can hardly help. Probably Linux and
FreeBSD EALs should print this hint. For Windows, we can leave the option,
but not print misleading hint.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization
  2020-05-28 14:41           ` Dmitry Kozlyuk
@ 2020-05-29  8:49             ` Burakov, Anatoly
  0 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-05-29  8:49 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On 28-May-20 3:41 PM, Dmitry Kozlyuk wrote:
> On Thu, 28 May 2020 12:46:49 +0100
> "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:
> 
>> On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
>>> All supported OS create memory segment lists (MSL) and reserve VA space
>>> for them in a nearly identical way. Move common code into EAL private
>>> functions to reduce duplication.
>>>
>>> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
>>> ---
>>
>> <snip>
>>
>>> +eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
>>> +{
>>> +	uint64_t page_sz;
>>> +	size_t mem_sz;
>>> +	void *addr;
>>> +
>>> +	page_sz = msl->page_sz;
>>> +	mem_sz = page_sz * msl->memseg_arr.len;
>>> +
>>> +	addr = eal_get_virtual_area(
>>> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
>>> +	if (addr == NULL) {
>>> +		if (rte_errno == EADDRNOTAVAIL)
>>> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
>>> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
>>> +				(unsigned long long)mem_sz, msl->base_va);
>>
>> Do all OS's support this EAL option?
> 
> Supported, yes; meaningful, not quite: for Windows, we start with address 0
> (let the OS choose) and using the option can hardly help. Probably Linux and
> FreeBSD EALs should print this hint. For Windows, we can leave the option,
> but not print misleading hint.
> 

We keep rte_errno when exiting this function, do we not? How about we 
just check it in the caller? It's a bit of a hack, but will avoid 
OS-specific #ifdef-ery in the common code. Not sure if it's worth it 
though, over just having an #idef :) Up to you!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers
  2020-05-28 11:26         ` Burakov, Anatoly
@ 2020-06-01 21:08           ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-01 21:08 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson, Ray Kinsella, Neil Horman,
	Burakov, Anatoly

28/05/2020 13:26, Burakov, Anatoly:
> On 25-May-20 1:37 AM, Dmitry Kozlyuk wrote:
> > +/**
> > + * Memory protection flags.
> > + */
> > +enum rte_mem_prot {
> > +	RTE_PROT_READ = 1 << 0,   /**< Read access. */
> > +	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
> > +	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
> > +};
> > +
> > +/**
> > + * Additional flags for memory mapping.
> > + */
> > +enum rte_map_flags {
> > +	/** Changes to the mapped memory are visible to other processes. */
> > +	RTE_MAP_SHARED = 1 << 0,
> > +	/** Mapping is not backed by a regular file. */
> > +	RTE_MAP_ANONYMOUS = 1 << 1,
> > +	/** Copy-on-write mapping, changes are invisible to other processes. */
> > +	RTE_MAP_PRIVATE = 1 << 2,
> > +	/**
> > +	 * Force mapping to the requested address. This flag should be used
> > +	 * with caution, because to fulfill the request implementation
> > +	 * may remove all other mappings in the requested region. However,
> > +	 * it is not required to do so, thus mapping with this flag may fail.
> > +	 */
> > +	RTE_MAP_FORCE_ADDRESS = 1 << 3
> > +};
> 
> I have no strong opinion on this, but it feels like the fact that these 
> are enums is a relic from the times where you used enum everywhere :) i 
> have a feeling that DPDK codebase prefers #define's for this usage, 
> while what you have here is more of a C++ thing.

The benefit of using an enum is to explicitly name the type
of the variables, serving documentation purpose.

+1 for the enums



^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 00/11] Windows basic memory management
  2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
                         ` (10 preceding siblings ...)
  2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-06-02 23:03       ` Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
                           ` (11 more replies)
  11 siblings, 12 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing without IOVA is available.

Testing revealed Windows Server 2019 does not allow allocating hugepage
memory at a reserved address, despite advertised API.  So allocator has
to temporary free the region to be allocated.  This creates in inherent
race condition. This issue is being discussed with Microsoft privately.

New EAL public functions for memory mapping are introduced to mitigate
OS differences in DPDK libraries and applications: rte_mem_map,
rte_mem_unmap, rte_mem_lock, rte_mem_page_size.

To support common MM routines, internal wrappers for low-level memory
reservation and file management are introduced. These changes affect
Linux and FreeBSD EAL. Shared code is placed unded /unix/ subdirectory
(suggested by Thomas).

To avoid code duplication between Linux and Windows EAL, common code
for EALs supporting dynamic memory allocation is extracted
(discussed with Anatoly Burakov in v4 thread). This is a separate
patch to ease the review, but it can be merged with the previous one.

EAL tracepoints save size_t values as long, which is invalid on Windows.
New size_t emitter for tracepoints is introduced (suggested by Jerin
Jacob to Fady Bader, see [1]). Also, to avoid workaround in every file
using the tracepoints, stubs are added to Windows EAL.

Entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.
The file is imported as-is, so it causes a bunch of checkpatch warnings.

[1]: http://mails.dpdk.org/archives/dev/2020-May/168076.html

---

v6:
    * Fix 32-bit build on x86 (CI).
    * Fix Makefile build (Anatoly Burakov, Thomas Monjalon).
    * Restore 32-bit common code (Anatoly Burakov).
    * Fix error reporting in memory management (Anatoly Burakov).
    * Add Doxygen comment for size_t tracepoint emitter (Jerin Jacob).
    * Update MAINTAINERS for new files and new code (Thomas Monjalon).
    * Rename rte_get_page_size to rte_mem_page_size.
    * Mark DPDK-only wrappers internal, move them to separate file.
    * Get rid of warnings in enabled common code with Clang on Windows.

v5:
    * Fix allocation and deallocation on Windows Server (Fady Bader).
    * Replace remaining VirtualFree with VirtualFreeEx (Ranjit Menon).
    * Fix errors in eal_get_virtual_area (Anatoly Burakov).
    * Fix error handling and documentation for rte_mem_lock (Anatoly Burakov).
    * Extract common code for EALs w/dynamic allocation (Anatoly Burakov).
    * Use POSIX value for rte_errno after rte_mem_unmap() on Windows.
    * Add stubs to use tracing functions without workarounds.

v4:

    * Rebase on ToT, drop patches merged into master.
    * Rearrange patches to split Windows code (Jerin).
    * Fix Linux and FreeBSD build with make (Ophir).
    * Use int instead of enum to hold a set of flags (Anatoly).
    * Rename eal_mem_reserve items and fix their description (Anatoly).
    * Add eal_mem_set_dump() wrapper around madvise (Anatoly).
    * Don't claim Windows Server 2016 support due to lack of API (Tal).
    * Replace enum rte_page_sizes with a set of #defines (Jerin).
    * Fix documentation, SPDX tags, logging (Thomas).

v3:

    * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
    * Fix logic and error handling while allocating segments.
    * Fix Unix rte_mem_map(): return NULL on failure.
    * Fix some checkpatch.sh issues:
        * Do not return positive errno, use DWORD for GetLastError().
        * Make dpdk-kmods source files non-executable.
    * Improve GSG for Windows Server (suggested by Ranjit Menon).

v2:

    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.

Dmitry Kozlyuk (11):
  eal: replace rte_page_sizes with a set of constants
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal/mem: extract common code for memseg list initialization
  eal/mem: extract common code for dynamic memory allocation
  trace: add size_t field emitter
  eal/windows: add tracing support stubs
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal/windows: implement basic memory management

 MAINTAINERS                                   |   9 +
 config/meson.build                            |  12 +-
 doc/guides/rel_notes/release_20_08.rst        |   2 +
 doc/guides/windows_gsg/build_dpdk.rst         |  20 -
 doc/guides/windows_gsg/index.rst              |   1 +
 doc/guides/windows_gsg/run_apps.rst           |  95 +++
 lib/librte_eal/common/eal_common_dynmem.c     | 521 +++++++++++++
 lib/librte_eal/common/eal_common_fbarray.c    |  71 +-
 lib/librte_eal/common/eal_common_memory.c     | 158 +++-
 lib/librte_eal/common/eal_common_thread.c     |   5 +-
 lib/librte_eal/common/eal_private.h           | 255 ++++++-
 lib/librte_eal/common/meson.build             |  16 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/freebsd/Makefile               |   5 +
 lib/librte_eal/freebsd/eal_memory.c           |  98 +--
 lib/librte_eal/include/rte_eal_memory.h       |  93 +++
 lib/librte_eal/include/rte_eal_trace.h        |   8 +-
 lib/librte_eal/include/rte_memory.h           |  23 +-
 lib/librte_eal/include/rte_trace_point.h      |   3 +
 lib/librte_eal/linux/Makefile                 |   6 +
 lib/librte_eal/linux/eal_memalloc.c           |   5 +-
 lib/librte_eal/linux/eal_memory.c             | 614 +--------------
 lib/librte_eal/meson.build                    |   4 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/rte_eal_version.map            |   9 +
 lib/librte_eal/unix/eal_file.c                |  76 ++
 lib/librte_eal/unix/eal_unix_memory.c         | 152 ++++
 lib/librte_eal/unix/meson.build               |   7 +
 lib/librte_eal/windows/eal.c                  | 107 +++
 lib/librte_eal/windows/eal_file.c             | 123 +++
 lib/librte_eal/windows/eal_hugepages.c        | 108 +++
 lib/librte_eal/windows/eal_lcore.c            | 185 +++--
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  85 +++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/sys/queue.h    | 663 ++++++++++++++--
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   7 +
 lib/librte_mempool/rte_mempool_trace.h        |  10 +-
 44 files changed, 4048 insertions(+), 939 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c
 create mode 100644 lib/librte_eal/include/rte_eal_memory.h
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 01/11] eal: replace rte_page_sizes with a set of constants
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-03  1:59           ` Stephen Hemminger
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                           ` (10 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, John McNamara,
	Marko Kovacevic, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_sizes has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Remove rte_page_sizes and replace its values with #define's.
This enumeration is not used in public API, so there's no ABI breakage.
Announce API changes for 20.08 in documentation.

Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 doc/guides/rel_notes/release_20_08.rst |  2 ++
 lib/librte_eal/include/rte_memory.h    | 23 ++++++++++-------------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
index 39064afbe..2041a29b9 100644
--- a/doc/guides/rel_notes/release_20_08.rst
+++ b/doc/guides/rel_notes/release_20_08.rst
@@ -85,6 +85,8 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* ``rte_page_sizes`` enumeration is replaced with ``RTE_PGSIZE_xxx`` defines.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..65374d53a 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -24,19 +24,16 @@ extern "C" {
 #include <rte_config.h>
 #include <rte_fbarray.h>
 
-__extension__
-enum rte_page_sizes {
-	RTE_PGSIZE_4K    = 1ULL << 12,
-	RTE_PGSIZE_64K   = 1ULL << 16,
-	RTE_PGSIZE_256K  = 1ULL << 18,
-	RTE_PGSIZE_2M    = 1ULL << 21,
-	RTE_PGSIZE_16M   = 1ULL << 24,
-	RTE_PGSIZE_256M  = 1ULL << 28,
-	RTE_PGSIZE_512M  = 1ULL << 29,
-	RTE_PGSIZE_1G    = 1ULL << 30,
-	RTE_PGSIZE_4G    = 1ULL << 32,
-	RTE_PGSIZE_16G   = 1ULL << 34,
-};
+#define RTE_PGSIZE_4K   (1ULL << 12)
+#define RTE_PGSIZE_64K  (1ULL << 16)
+#define RTE_PGSIZE_256K (1ULL << 18)
+#define RTE_PGSIZE_2M   (1ULL << 21)
+#define RTE_PGSIZE_16M  (1ULL << 24)
+#define RTE_PGSIZE_256M (1ULL << 28)
+#define RTE_PGSIZE_512M (1ULL << 29)
+#define RTE_PGSIZE_1G   (1ULL << 30)
+#define RTE_PGSIZE_4G   (1ULL << 32)
+#define RTE_PGSIZE_16G  (1ULL << 34)
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-03 12:07           ` Neil Horman
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
                           ` (9 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Introduce OS-independent wrappers in order to support common EAL code
on Unix and Windows:

* eal_file_create: open an existing file.
* eal_file_open: create a file or open it if exists.
* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. These thin wrappers
require no special maintenance.

Common code supporting multi-process doesn't use the new wrappers,
because it is inherently Unix-specific and would impose excessive
requirements on the wrappers.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                |  3 +
 lib/librte_eal/common/eal_common_fbarray.c | 31 ++++-----
 lib/librte_eal/common/eal_private.h        | 74 +++++++++++++++++++++
 lib/librte_eal/freebsd/Makefile            |  4 ++
 lib/librte_eal/linux/Makefile              |  4 ++
 lib/librte_eal/meson.build                 |  4 ++
 lib/librte_eal/unix/eal_file.c             | 76 ++++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |  6 ++
 8 files changed, 183 insertions(+), 19 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/MAINTAINERS b/MAINTAINERS
index d2b286701..1d9aff26d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -323,6 +323,9 @@ FreeBSD UIO
 M: Bruce Richardson <bruce.richardson@intel.com>
 F: kernel/freebsd/nic_uio/
 
+Unix shared files
+F: lib/librte_eal/unix/
+
 Windows support
 M: Harini Ramakrishnan <harini.ramakrishnan@microsoft.com>
 M: Omar Cardona <ocardona@microsoft.com>
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 4f8f1af73..81ce4bd42 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -8,8 +8,8 @@
 #include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -85,10 +85,8 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 
@@ -772,15 +770,15 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * and see if we succeed. If we don't, someone else is using it
 		 * already.
 		 */
-		fd = open(path, O_CREAT | O_RDWR, 0600);
+		fd = eal_file_create(path);
 		if (fd < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't open %s: %s\n",
-					__func__, path, strerror(errno));
-			rte_errno = errno;
+				__func__, path, rte_strerror(rte_errno));
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
-					__func__, path, strerror(errno));
+				__func__, path, rte_strerror(rte_errno));
 			rte_errno = EBUSY;
 			goto fail;
 		}
@@ -789,10 +787,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -888,17 +884,14 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 
 	eal_get_fbarray_path(path, sizeof(path), arr->name);
 
-	fd = open(path, O_RDWR);
+	fd = eal_file_open(path, true);
 	if (fd < 0) {
-		rte_errno = errno;
 		goto fail;
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -1025,7 +1018,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,7 +1035,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 869ce183a..727f26881 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -420,4 +420,78 @@ eal_malloc_no_trace(const char *type, size_t size, unsigned int align);
 
 void eal_free_no_trace(void *addr);
 
+/**
+ * Create a file or open it if exits.
+ *
+ * Newly created file is only accessible to the owner (0600 equivalent).
+ * Returned descriptor is always read/write.
+ *
+ * @param path
+ *  Path to the file.
+ * @return
+ *  Open file descriptor on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_file_create(const char *path);
+
+/**
+ * Open an existing file.
+ *
+ * @param path
+ *  Path to the file.
+ * @param writable
+ *  Whether to open file read/write or read-only.
+ * @return
+ *  Open file descriptor on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_file_open(const char *path, bool writable);
+
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index af95386d4..0f8741d96 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -74,6 +75,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_hypervisor.c
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 48cc34844..331489f99 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -81,6 +82,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_hypervisor.c
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index e301f4558..8d492897d 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal_file.c b/lib/librte_eal/unix/eal_file.c
new file mode 100644
index 000000000..7b3ffa629
--- /dev/null
+++ b/lib/librte_eal/unix/eal_file.c
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_create(const char *path)
+{
+	int ret;
+
+	ret = open(path, O_CREAT | O_RDWR, 0600);
+	if (ret < 0)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_open(const char *path, bool writable)
+{
+	int ret, flags;
+
+	flags = writable ? O_RDWR : O_RDONLY;
+	ret = open(path, flags);
+	if (ret < 0)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..21029ba1a
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal_file.c',
+)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 03/11] eal: introduce memory management wrappers
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
                           ` (8 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson, Ray Kinsella, Neil Horman

Introduce OS-independent wrappers for memory management operations used
across DPDK and specifically in common code of EAL:

* rte_mem_map()
* rte_mem_unmap()
* rte_mem_page_size()
* rte_mem_lock()

Windows uses different APIs for memory mapping and reservation, while
Unices reserve memory by mapping it. Introduce EAL private functions to
support memory reservation in common code:

* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive. New symbols are internal. Being thin wrappers, they require
no special maintenance.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_fbarray.c |  40 +++---
 lib/librte_eal/common/eal_common_memory.c  |  61 ++++-----
 lib/librte_eal/common/eal_private.h        |  78 ++++++++++-
 lib/librte_eal/freebsd/Makefile            |   1 +
 lib/librte_eal/include/rte_eal_memory.h    |  93 +++++++++++++
 lib/librte_eal/linux/Makefile              |   1 +
 lib/librte_eal/linux/eal_memalloc.c        |   5 +-
 lib/librte_eal/rte_eal_version.map         |   9 ++
 lib/librte_eal/unix/eal_unix_memory.c      | 152 +++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |   1 +
 10 files changed, 376 insertions(+), 65 deletions(-)
 create mode 100644 lib/librte_eal/include/rte_eal_memory.h
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 81ce4bd42..790d3ccba 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,16 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
 #include <string.h>
 #include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
+#include <rte_eal_memory.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -90,12 +91,9 @@ resize_and_map(int fd, void *addr, size_t len)
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -733,7 +731,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -754,11 +752,13 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		static const int flags = RTE_MAP_FORCE_ADDRESS |
+			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
+		void *new_data = rte_mem_map(data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
-					__func__, strerror(errno));
+					__func__, rte_strerror(rte_errno));
 			goto fail;
 		}
 	} else {
@@ -820,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -858,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -909,7 +909,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -937,8 +937,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -957,7 +956,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -992,8 +991,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1042,7 +1040,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 4c897a13f..f9fbd3e4e 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,13 +11,13 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
 #include <rte_memory.h>
 #include <rte_eal.h>
 #include <rte_eal_memconfig.h>
+#include <rte_eal_memory.h>
 #include <rte_errno.h>
 #include <rte_log.h>
 
@@ -40,18 +40,10 @@
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
-#ifdef RTE_EXEC_ENV_LINUX
-#define RTE_DONTDUMP MADV_DONTDUMP
-#elif defined RTE_EXEC_ENV_FREEBSD
-#define RTE_DONTDUMP MADV_NOCORE
-#else
-#error "madvise doesn't support this OS"
-#endif
-
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, int reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -59,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_mem_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -105,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
 			*size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, map_sz);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -132,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -161,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -179,19 +166,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	if (!unmap) {
 		/* Exclude these pages from a core dump. */
-		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
-			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
-				strerror(errno));
+		eal_mem_set_dump(aligned_addr, *size, false);
 	}
 
 	return aligned_addr;
@@ -547,10 +532,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	size_t page_size = rte_mem_page_size();
+	uintptr_t aligned = RTE_PTR_ALIGN_FLOOR(virtual, page_size);
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 727f26881..846236648 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**
+	 * Reserve hugepages. May be unsupported by some platforms.
+	 */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**
+	 * Force reserving memory at the requested address.
+	 * This can be a destructive action depending on the implementation.
+	 *
+	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
+	 *      (although implementations are not required to use it).
+	 */
+	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to eal_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+		size_t page_sz, int flags, int reserve_flags);
 
 /**
  * Get cpu core_id.
@@ -494,4 +513,57 @@ eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
 int
 eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation addressm which must be page-aligned.
+ *  The system might not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options, a combination of eal_mem_reserve_flags.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If *virt* and *size* describe a part of the reserved region,
+ * only this part of the region is freed (accurately up to the system
+ * page size). If *virt* points to allocated memory, *size* must match
+ * the one specified on allocation. The behavior is undefined
+ * if the memory pointed by *virt* is obtained from another source
+ * than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void
+eal_mem_free(void *virt, size_t size);
+
+/**
+ * Configure memory region inclusion into core dumps.
+ *
+ * @param virt
+ *  Starting address of the region.
+ * @param size
+ *  Size of the region.
+ * @param dump
+ *  True to include memory into core dumps, false to exclude.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index 0f8741d96..2374ba0b7 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
diff --git a/lib/librte_eal/include/rte_eal_memory.h b/lib/librte_eal/include/rte_eal_memory.h
new file mode 100644
index 000000000..0c5ef309d
--- /dev/null
+++ b/lib/librte_eal/include/rte_eal_memory.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+/** @file Mamory management wrappers used across DPDK. */
+
+/** Memory protection flags. */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/** Additional flags for memory mapping. */
+enum rte_map_flags {
+	/** Changes to the mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/**
+	 * Force mapping to the requested address. This flag should be used
+	 * with caution, because to fulfill the request implementation
+	 * may remove all other mappings in the requested region. However,
+	 * it is not required to do so, thus mapping with this flag may fail.
+	 */
+	RTE_MAP_FORCE_ADDRESS = 1 << 3
+};
+
+/**
+ * Map a portion of an opened file or the page file into memory.
+ *
+ * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
+ * extension, except for the return value.
+ *
+ * @param requested_addr
+ *  Desired virtual address for mapping. Can be NULL to let OS choose.
+ * @param size
+ *  Size of the mapping in bytes.
+ * @param prot
+ *  Protection flags, a combination of rte_mem_prot values.
+ * @param flags
+ *  Addtional mapping flags, a combination of rte_map_flags.
+ * @param fd
+ *  Mapped file descriptor. Can be negative for anonymous mapping.
+ * @param offset
+ *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
+ * @return
+ *  Mapped address or NULL on failure and rte_errno is set to OS error.
+ */
+__rte_internal
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_internal
+int
+rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never fails.
+ *
+ * @return
+ *   Page size in bytes.
+ */
+__rte_internal
+size_t
+rte_mem_page_size(void);
+
+/**
+ * Lock in physical memory all pages crossed by the address region.
+ *
+ * @param virt
+ *   Base virtual address of the region.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @see rte_mem_page_size() to retrieve the page size.
+ * @see rte_mem_lock_page() to lock an entire single page.
+ */
+__rte_internal
+int
+rte_mem_lock(const void *virt, size_t size);
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 331489f99..8febf2212 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
index 2c717f8bd..bf29b83c6 100644
--- a/lib/librte_eal/linux/eal_memalloc.c
+++ b/lib/librte_eal/linux/eal_memalloc.c
@@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 mapped:
 	munmap(addr, alloc_sz);
 unmapped:
-	flags = MAP_FIXED;
+	flags = EAL_RESERVE_FORCE_ADDRESS;
 	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
 	if (new_addr != addr) {
 		if (new_addr != NULL)
@@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
-		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
+	eal_mem_set_dump(ms->addr, ms->len, false);
 
 	exit_early = false;
 
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d8038749a..196eef5af 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -387,3 +387,12 @@ EXPERIMENTAL {
 	rte_trace_regexp;
 	rte_trace_save;
 };
+
+INTERNAL {
+	global:
+
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_page_size;
+	rte_mem_unmap;
+};
diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
new file mode 100644
index 000000000..4dd891667
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix_memory.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_eal_memory.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+
+#include "eal_private.h"
+
+#ifdef RTE_EXEC_ENV_LINUX
+#define EAL_DONTDUMP MADV_DONTDUMP
+#define EAL_DODUMP   MADV_DODUMP
+#elif defined RTE_EXEC_ENV_FREEBSD
+#define EAL_DONTDUMP MADV_NOCORE
+#define EAL_DODUMP   MADV_CORE
+#else
+#error "madvise doesn't support this OS"
+#endif
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+		return NULL;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+#ifdef MAP_HUGETLB
+		sys_flags |= MAP_HUGETLB;
+#else
+		rte_errno = ENOTSUP;
+		return NULL;
+#endif
+	}
+
+	if (flags & EAL_RESERVE_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
+	int ret = madvise(virt, size, flags);
+	if (ret) {
+		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
+				virt, size, flags, strerror(rte_errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+static int
+mem_rte_to_sys_prot(int prot)
+{
+	int sys_prot = PROT_NONE;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	int sys_flags = 0;
+	int sys_prot;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static size_t page_size;
+
+	if (!page_size)
+		page_size = sysconf(_SC_PAGESIZE);
+
+	return page_size;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	int ret = mlock(virt, size);
+	if (ret)
+		rte_errno = errno;
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index 21029ba1a..e733910a1 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal_file.c',
+	'eal_unix_memory.c',
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (2 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-09 13:36           ` Burakov, Anatoly
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
                           ` (7 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c |  97 ++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  62 ++++++++++++
 lib/librte_eal/freebsd/eal_memory.c       |  94 ++++--------------
 lib/librte_eal/linux/eal_memory.c         | 115 +++++-----------------
 4 files changed, 200 insertions(+), 168 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index f9fbd3e4e..3325d8c35 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -25,6 +25,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -182,6 +183,102 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+		uint64_t page_sz, int n_segs, int socket_id, bool heap)
+{
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL,
+		"Memseg list allocated at socket %i, page size 0x%zxkB\n",
+		socket_id, (size_t)page_sz >> 10);
+
+	return 0;
+}
+
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+
+	return eal_memseg_list_init_named(
+		msl, name, page_sz, n_segs, socket_id, heap);
+}
+
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
+{
+	size_t page_sz, mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(
+		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
+	if (addr == NULL) {
+#ifndef RTE_EXEC_ENV_WINDOWS
+		/* The hint would be misleading on Windows, but this function
+		 * is called from many places, including common code,
+		 * so don't duplicate the message.
+		 */
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+		else
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+#endif
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	RTE_LOG(DEBUG, EAL, "VA reserved for memseg list at %p, size %zx\n",
+			addr, mem_sz);
+
+	return 0;
+}
+
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
+{
+	size_t page_sz = msl->page_sz;
+	int i;
+
+	for (i = 0; i < n_segs; i++) {
+		struct rte_fbarray *arr = &msl->memseg_arr;
+		struct rte_memseg *ms = rte_fbarray_get(arr, i);
+
+		if (rte_eal_iova_mode() == RTE_IOVA_VA)
+			ms->iova = (uintptr_t)addr;
+		else
+			ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, i);
+
+		addr = RTE_PTR_ADD(addr, page_sz);
+	}
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 846236648..c698ffbaf 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -254,6 +254,68 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int reserve_flags);
 
+/**
+ * Initialize a memory segment list and create its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param name
+ *  Name for the backing storage.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+	uint64_t page_sz, int n_segs, int socket_id, bool heap);
+
+/**
+ * Initialize memory segment list and create its backing storage
+ * with a name corresponding to MSL parameters.
+ *
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ *
+ * @see eal_memseg_list_init_named for remaining parameters description.
+ */
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
+/**
+ * Reserve VA space for a memory segment list
+ * previously initialized with eal_memseg_list_init().
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param reserve_flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
+
+/**
+ * Populate MSL, each segment is one page long.
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param addr
+ *  Starting address of list segments.
+ * @param n_segs
+ *  Number of segments to populate.
+ */
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5bc2da160..29c3ed5a9 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -66,53 +66,34 @@ rte_eal_hugepage_init(void)
 		struct rte_memseg_list *msl;
 		struct rte_fbarray *arr;
 		struct rte_memseg *ms;
-		uint64_t page_sz;
+		uint64_t mem_sz, page_sz;
 		int n_segs, cur_seg;
 
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-				sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
-		addr = mmap(NULL, internal_config.memory,
-				PROT_READ | PROT_WRITE,
+		addr = mmap(NULL, mem_sz, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (addr == MAP_FAILED) {
 			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
 					strerror(errno));
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->len = internal_config.memory;
-		msl->socket_id = 0;
-		msl->heap = 1;
-
-		/* populate memsegs. each memseg is 1 page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->len = page_sz;
-			ms->socket_id = 0;
+		msl->base_va = addr;
+		msl->len = mem_sz;
 
-			rte_fbarray_set_used(arr, cur_seg);
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			addr = RTE_PTR_ADD(addr, page_sz);
-		}
 		return 0;
 	}
 
@@ -336,64 +317,25 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
 	int flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
@@ -479,7 +421,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_init(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
@@ -487,7 +429,7 @@ memseg_primary_init(void)
 			total_type_mem = total_segs * hugepage_sz;
 			type_msl_idx++;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				return -1;
 			}
@@ -518,7 +460,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..8b5fe613e 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, 0);
 }
 
 /*
@@ -1009,12 +969,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_init(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_alloc(msl) < 0)
 				return -1;
 		}
 	}
@@ -1323,8 +1283,6 @@ eal_legacy_hugepage_init(void)
 	struct rte_mem_config *mcfg;
 	struct hugepage_file *hugepage = NULL, *tmp_hp = NULL;
 	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	struct rte_fbarray *arr;
-	struct rte_memseg *ms;
 
 	uint64_t memory[RTE_MAX_NUMA_NODES];
 
@@ -1343,7 +1301,7 @@ eal_legacy_hugepage_init(void)
 		void *prealloc_addr;
 		size_t mem_sz;
 		struct rte_memseg_list *msl;
-		int n_segs, cur_seg, fd, flags;
+		int n_segs, fd, flags;
 #ifdef MEMFD_SUPPORTED
 		int memfd;
 #endif
@@ -1358,12 +1316,12 @@ eal_legacy_hugepage_init(void)
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-					sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
@@ -1400,16 +1358,10 @@ eal_legacy_hugepage_init(void)
 		/* preallocate address space for the memory, so that it can be
 		 * fit into the DMA mask.
 		 */
-		mem_sz = internal_config.memory;
-		prealloc_addr = eal_get_virtual_area(
-				NULL, &mem_sz, page_sz, 0, 0);
-		if (prealloc_addr == NULL) {
-			RTE_LOG(ERR, EAL,
-					"%s: reserving memory area failed: "
-					"%s\n",
-					__func__, strerror(errno));
+		if (eal_memseg_list_alloc(msl, 0))
 			return -1;
-		}
+
+		prealloc_addr = msl->base_va;
 		addr = mmap(prealloc_addr, mem_sz, PROT_READ | PROT_WRITE,
 				flags | MAP_FIXED, fd, 0);
 		if (addr == MAP_FAILED || addr != prealloc_addr) {
@@ -1418,11 +1370,6 @@ eal_legacy_hugepage_init(void)
 			munmap(prealloc_addr, mem_sz);
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->socket_id = 0;
-		msl->len = mem_sz;
-		msl->heap = 1;
 
 		/* we're in single-file segments mode, so only the segment list
 		 * fd needs to be set up.
@@ -1434,24 +1381,8 @@ eal_legacy_hugepage_init(void)
 			}
 		}
 
-		/* populate memsegs. each memseg is one page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->socket_id = 0;
-			ms->len = page_sz;
-
-			rte_fbarray_set_used(arr, cur_seg);
-
-			addr = RTE_PTR_ADD(addr, (size_t)page_sz);
-		}
 		if (mcfg->dma_maskbits &&
 		    rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
 			RTE_LOG(ERR, EAL,
@@ -2191,7 +2122,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_init(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2131,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_alloc(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2326,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_init(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2364,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 05/11] eal/mem: extract common code for dynamic memory allocation
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (3 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 06/11] trace: add size_t field emitter Dmitry Kozlyuk
                           ` (6 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Code in Linux EAL that supports dynamic memory allocation (as opposed to
static allocation used by FreeBSD) is not OS-dependent and can be reused
by Windows EAL. Move such code to a file compiled only for the OS that
require it. Keep Anatoly Burakov maintainer of extracted code.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                               |   1 +
 lib/librte_eal/common/eal_common_dynmem.c | 521 +++++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  43 +-
 lib/librte_eal/common/meson.build         |   4 +
 lib/librte_eal/freebsd/eal_memory.c       |  12 +-
 lib/librte_eal/linux/Makefile             |   1 +
 lib/librte_eal/linux/eal_memory.c         | 523 +---------------------
 7 files changed, 582 insertions(+), 523 deletions(-)
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d9aff26d..a1722ca73 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -208,6 +208,7 @@ F: lib/librte_eal/include/rte_fbarray.h
 F: lib/librte_eal/include/rte_mem*
 F: lib/librte_eal/include/rte_malloc.h
 F: lib/librte_eal/common/*malloc*
+F: lib/librte_eal/common/eal_common_dynmem.c
 F: lib/librte_eal/common/eal_common_fbarray.c
 F: lib/librte_eal/common/eal_common_mem*
 F: lib/librte_eal/common/eal_hugepages.h
diff --git a/lib/librte_eal/common/eal_common_dynmem.c b/lib/librte_eal/common/eal_common_dynmem.c
new file mode 100644
index 000000000..6b07672d0
--- /dev/null
+++ b/lib/librte_eal/common/eal_common_dynmem.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation.
+ * Copyright(c) 2013 6WIND S.A.
+ */
+
+#include <inttypes.h>
+#include <string.h>
+
+#include <rte_log.h>
+#include <rte_string_fns.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+
+/** @file Functions common to EALs that support dynamic memory allocation. */
+
+int
+eal_dynmem_memseg_lists_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
+			/* we can still sort pages by socket in legacy mode */
+			if (!internal_config.legacy_mem && socket_id > 0)
+				break;
+#endif
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (eal_memseg_list_init(msl, pagesz, n_segs,
+					socket_id, cur_seglist, true))
+				goto out;
+
+			if (eal_memseg_list_alloc(msl, 0)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int __rte_unused
+hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct hugepage_info *hpi = arg;
+
+	if (msl->page_sz != hpi->hugepage_sz)
+		return 0;
+
+	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
+	return 0;
+}
+
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+int
+eal_dynmem_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+#ifndef RTE_ARCH_64
+		struct hugepage_info dummy;
+		unsigned int i;
+#endif
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+
+#ifndef RTE_ARCH_64
+		/* for 32-bit, limit number of pages on socket to whatever we've
+		 * preallocated, as we cannot allocate more.
+		 */
+		memset(&dummy, 0, sizeof(dummy));
+		dummy.hugepage_sz = hpi->hugepage_sz;
+		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
+			return -1;
+
+		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
+			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
+					dummy.num_pages[i]);
+		}
+#endif
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
+		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
+
+	/* calculate final number of pages */
+	if (eal_dynmem_calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M "
+				"on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+__rte_unused /* function is unused on 32-bit builds */
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+int
+eal_dynmem_calc_num_pages_per_socket(
+	uint64_t *memory, struct hugepage_info *hp_info,
+	struct hugepage_info *hp_used, unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+#ifdef RTE_ARCH_64
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from CPU mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+#else
+		/* in 32-bit mode, allocate all of the memory only on master
+		 * lcore socket
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			struct rte_config *cfg = rte_eal_get_configuration();
+			unsigned int master_lcore_socket;
+
+			master_lcore_socket =
+				rte_lcore_to_socket_id(cfg->master_lcore);
+
+			if (master_lcore_socket != socket)
+				continue;
+
+			/* Update sizes */
+			memory[socket] = total_size;
+			break;
+		}
+#endif
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			rte_strscpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int)(internal_config.memory / 0x100000);
+		available = requested - (unsigned int)(total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index c698ffbaf..290262ff5 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -13,6 +13,8 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+
 /**
  * Structure storing internal configuration (per-lcore)
  */
@@ -316,6 +318,45 @@ eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
 void
 eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
 
+/**
+ * Distribute available memory between MSLs.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_memseg_lists_init(void);
+
+/**
+ * Preallocate hugepages for dynamic allocation.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_hugepage_init(void);
+
+/**
+ * Given the list of hugepage sizes and the number of pages thereof,
+ * calculate the best number of pages of each size to fulfill the request
+ * for RAM on each NUMA node.
+ *
+ * @param memory
+ *  Amounts of memory requested for each NUMA node of RTE_MAX_NUMA_NODES.
+ * @param hp_info
+ *  Information about hugepages of different size.
+ * @param hp_used
+ *  Receives information about used hugepages of each size.
+ * @param num_hp_info
+ *  Number of elements in hp_info and hp_used.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_calc_num_pages_per_socket(
+		uint64_t *memory, struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used, unsigned int num_hp_info);
+
 /**
  * Get cpu core_id.
  *
@@ -596,7 +637,7 @@ void *
 eal_mem_reserve(void *requested_addr, size_t size, int flags);
 
 /**
- * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ * Free memory obtained by eal_mem_reserve() and possibly allocated.
  *
  * If *virt* and *size* describe a part of the reserved region,
  * only this part of the region is freed (accurately up to the system
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 55aaeb18e..d91c22220 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -56,3 +56,7 @@ sources += files(
 	'rte_reciprocal.c',
 	'rte_service.c',
 )
+
+if is_linux
+	sources += files('eal_common_dynmem.c')
+endif
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 29c3ed5a9..7106b8b84 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -317,14 +317,6 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
-}
-
 static int
 memseg_list_alloc(struct rte_memseg_list *msl)
 {
@@ -421,8 +413,8 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (memseg_list_init(msl, hugepage_sz, n_segs,
-					0, type_msl_idx))
+			if (eal_memseg_list_init(msl, hugepage_sz, n_segs,
+					0, type_msl_idx, false))
 				return -1;
 
 			total_segs += msl->memseg_arr.len;
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 8febf2212..07ce643ba 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -50,6 +50,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_timer.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memzone.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_log.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_launch.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_dynmem.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_mcfg.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memalloc.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memory.c
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 8b5fe613e..12d72f726 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -812,20 +812,6 @@ memseg_list_free(struct rte_memseg_list *msl)
 	return 0;
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
-}
-
-static int
-memseg_list_alloc(struct rte_memseg_list *msl)
-{
-	return eal_memseg_list_alloc(msl, 0);
-}
-
 /*
  * Our VA space is not preallocated yet, so preallocate it here. We need to know
  * how many segments there are in order to map all pages into one address space,
@@ -969,12 +955,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (memseg_list_init(msl, page_sz, n_segs, socket,
-						msl_idx) < 0)
+			if (eal_memseg_list_init(msl, page_sz, n_segs,
+					socket, msl_idx, true) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (memseg_list_alloc(msl) < 0)
+			if (eal_memseg_list_alloc(msl, 0) < 0)
 				return -1;
 		}
 	}
@@ -1045,182 +1031,6 @@ remap_needed_hugepages(struct hugepage_file *hugepages, int n_pages)
 	return 0;
 }
 
-__rte_unused /* function is unused on 32-bit builds */
-static inline uint64_t
-get_socket_mem_size(int socket)
-{
-	uint64_t size = 0;
-	unsigned i;
-
-	for (i = 0; i < internal_config.num_hugepage_sizes; i++){
-		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
-		size += hpi->hugepage_sz * hpi->num_pages[socket];
-	}
-
-	return size;
-}
-
-/*
- * This function is a NUMA-aware equivalent of calc_num_pages.
- * It takes in the list of hugepage sizes and the
- * number of pages thereof, and calculates the best number of
- * pages of each size to fulfill the request for <memory> ram
- */
-static int
-calc_num_pages_per_socket(uint64_t * memory,
-		struct hugepage_info *hp_info,
-		struct hugepage_info *hp_used,
-		unsigned num_hp_info)
-{
-	unsigned socket, j, i = 0;
-	unsigned requested, available;
-	int total_num_pages = 0;
-	uint64_t remaining_mem, cur_mem;
-	uint64_t total_mem = internal_config.memory;
-
-	if (num_hp_info == 0)
-		return -1;
-
-	/* if specific memory amounts per socket weren't requested */
-	if (internal_config.force_sockets == 0) {
-		size_t total_size;
-#ifdef RTE_ARCH_64
-		int cpu_per_socket[RTE_MAX_NUMA_NODES];
-		size_t default_size;
-		unsigned lcore_id;
-
-		/* Compute number of cores per socket */
-		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
-		RTE_LCORE_FOREACH(lcore_id) {
-			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
-		}
-
-		/*
-		 * Automatically spread requested memory amongst detected sockets according
-		 * to number of cores from cpu mask present on each socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-
-			/* Set memory amount per socket */
-			default_size = (internal_config.memory * cpu_per_socket[socket])
-					/ rte_lcore_count();
-
-			/* Limit to maximum available memory on socket */
-			default_size = RTE_MIN(default_size, get_socket_mem_size(socket));
-
-			/* Update sizes */
-			memory[socket] = default_size;
-			total_size -= default_size;
-		}
-
-		/*
-		 * If some memory is remaining, try to allocate it by getting all
-		 * available memory from sockets, one after the other
-		 */
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-			/* take whatever is available */
-			default_size = RTE_MIN(get_socket_mem_size(socket) - memory[socket],
-					       total_size);
-
-			/* Update sizes */
-			memory[socket] += default_size;
-			total_size -= default_size;
-		}
-#else
-		/* in 32-bit mode, allocate all of the memory only on master
-		 * lcore socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
-				socket++) {
-			struct rte_config *cfg = rte_eal_get_configuration();
-			unsigned int master_lcore_socket;
-
-			master_lcore_socket =
-				rte_lcore_to_socket_id(cfg->master_lcore);
-
-			if (master_lcore_socket != socket)
-				continue;
-
-			/* Update sizes */
-			memory[socket] = total_size;
-			break;
-		}
-#endif
-	}
-
-	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0; socket++) {
-		/* skips if the memory on specific socket wasn't requested */
-		for (i = 0; i < num_hp_info && memory[socket] != 0; i++){
-			strlcpy(hp_used[i].hugedir, hp_info[i].hugedir,
-				sizeof(hp_used[i].hugedir));
-			hp_used[i].num_pages[socket] = RTE_MIN(
-					memory[socket] / hp_info[i].hugepage_sz,
-					hp_info[i].num_pages[socket]);
-
-			cur_mem = hp_used[i].num_pages[socket] *
-					hp_used[i].hugepage_sz;
-
-			memory[socket] -= cur_mem;
-			total_mem -= cur_mem;
-
-			total_num_pages += hp_used[i].num_pages[socket];
-
-			/* check if we have met all memory requests */
-			if (memory[socket] == 0)
-				break;
-
-			/* check if we have any more pages left at this size, if so
-			 * move on to next size */
-			if (hp_used[i].num_pages[socket] == hp_info[i].num_pages[socket])
-				continue;
-			/* At this point we know that there are more pages available that are
-			 * bigger than the memory we want, so lets see if we can get enough
-			 * from other page sizes.
-			 */
-			remaining_mem = 0;
-			for (j = i+1; j < num_hp_info; j++)
-				remaining_mem += hp_info[j].hugepage_sz *
-				hp_info[j].num_pages[socket];
-
-			/* is there enough other memory, if not allocate another page and quit */
-			if (remaining_mem < memory[socket]){
-				cur_mem = RTE_MIN(memory[socket],
-						hp_info[i].hugepage_sz);
-				memory[socket] -= cur_mem;
-				total_mem -= cur_mem;
-				hp_used[i].num_pages[socket]++;
-				total_num_pages++;
-				break; /* we are done with this socket*/
-			}
-		}
-		/* if we didn't satisfy all memory requirements per socket */
-		if (memory[socket] > 0 &&
-				internal_config.socket_mem[socket] != 0) {
-			/* to prevent icc errors */
-			requested = (unsigned) (internal_config.socket_mem[socket] /
-					0x100000);
-			available = requested -
-					((unsigned) (memory[socket] / 0x100000));
-			RTE_LOG(ERR, EAL, "Not enough memory available on socket %u! "
-					"Requested: %uMB, available: %uMB\n", socket,
-					requested, available);
-			return -1;
-		}
-	}
-
-	/* if we didn't satisfy total memory requirements */
-	if (total_mem > 0) {
-		requested = (unsigned) (internal_config.memory / 0x100000);
-		available = requested - (unsigned) (total_mem / 0x100000);
-		RTE_LOG(ERR, EAL, "Not enough memory available! Requested: %uMB,"
-				" available: %uMB\n", requested, available);
-		return -1;
-	}
-	return total_num_pages;
-}
-
 static inline size_t
 eal_get_hugepage_mem_size(void)
 {
@@ -1524,7 +1334,7 @@ eal_legacy_hugepage_init(void)
 		memory[i] = internal_config.socket_mem[i];
 
 	/* calculate final number of pages */
-	nr_hugepages = calc_num_pages_per_socket(memory,
+	nr_hugepages = eal_dynmem_calc_num_pages_per_socket(memory,
 			internal_config.hugepage_info, used_hp,
 			internal_config.num_hugepage_sizes);
 
@@ -1651,140 +1461,6 @@ eal_legacy_hugepage_init(void)
 	return -1;
 }
 
-static int __rte_unused
-hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
-{
-	struct hugepage_info *hpi = arg;
-
-	if (msl->page_sz != hpi->hugepage_sz)
-		return 0;
-
-	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
-	return 0;
-}
-
-static int
-limits_callback(int socket_id, size_t cur_limit, size_t new_len)
-{
-	RTE_SET_USED(socket_id);
-	RTE_SET_USED(cur_limit);
-	RTE_SET_USED(new_len);
-	return -1;
-}
-
-static int
-eal_hugepage_init(void)
-{
-	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	uint64_t memory[RTE_MAX_NUMA_NODES];
-	int hp_sz_idx, socket_id;
-
-	memset(used_hp, 0, sizeof(used_hp));
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-#ifndef RTE_ARCH_64
-		struct hugepage_info dummy;
-		unsigned int i;
-#endif
-		/* also initialize used_hp hugepage sizes in used_hp */
-		struct hugepage_info *hpi;
-		hpi = &internal_config.hugepage_info[hp_sz_idx];
-		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
-
-#ifndef RTE_ARCH_64
-		/* for 32-bit, limit number of pages on socket to whatever we've
-		 * preallocated, as we cannot allocate more.
-		 */
-		memset(&dummy, 0, sizeof(dummy));
-		dummy.hugepage_sz = hpi->hugepage_sz;
-		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
-			return -1;
-
-		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
-			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
-					dummy.num_pages[i]);
-		}
-#endif
-	}
-
-	/* make a copy of socket_mem, needed for balanced allocation. */
-	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
-		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
-
-	/* calculate final number of pages */
-	if (calc_num_pages_per_socket(memory,
-			internal_config.hugepage_info, used_hp,
-			internal_config.num_hugepage_sizes) < 0)
-		return -1;
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
-				socket_id++) {
-			struct rte_memseg **pages;
-			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
-			unsigned int num_pages = hpi->num_pages[socket_id];
-			unsigned int num_pages_alloc;
-
-			if (num_pages == 0)
-				continue;
-
-			RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %" PRIu64 "M on socket %i\n",
-				num_pages, hpi->hugepage_sz >> 20, socket_id);
-
-			/* we may not be able to allocate all pages in one go,
-			 * because we break up our memory map into multiple
-			 * memseg lists. therefore, try allocating multiple
-			 * times and see if we can get the desired number of
-			 * pages from multiple allocations.
-			 */
-
-			num_pages_alloc = 0;
-			do {
-				int i, cur_pages, needed;
-
-				needed = num_pages - num_pages_alloc;
-
-				pages = malloc(sizeof(*pages) * needed);
-
-				/* do not request exact number of pages */
-				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
-						needed, hpi->hugepage_sz,
-						socket_id, false);
-				if (cur_pages <= 0) {
-					free(pages);
-					return -1;
-				}
-
-				/* mark preallocated pages as unfreeable */
-				for (i = 0; i < cur_pages; i++) {
-					struct rte_memseg *ms = pages[i];
-					ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE;
-				}
-				free(pages);
-
-				num_pages_alloc += cur_pages;
-			} while (num_pages_alloc != num_pages);
-		}
-	}
-	/* if socket limits were specified, set them */
-	if (internal_config.force_socket_limits) {
-		unsigned int i;
-		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
-			uint64_t limit = internal_config.socket_limit[i];
-			if (limit == 0)
-				continue;
-			if (rte_mem_alloc_validator_register("socket-limit",
-					limits_callback, i, limit))
-				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
-		}
-	}
-	return 0;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -1943,7 +1619,7 @@ rte_eal_hugepage_init(void)
 {
 	return internal_config.legacy_mem ?
 			eal_legacy_hugepage_init() :
-			eal_hugepage_init();
+			eal_dynmem_hugepage_init();
 }
 
 int
@@ -2122,8 +1798,9 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (memseg_list_init(msl, hugepage_sz, n_segs,
-						socket_id, type_msl_idx)) {
+				if (eal_memseg_list_init(msl, hugepage_sz,
+						n_segs, socket_id, type_msl_idx,
+						true)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
 					 */
@@ -2131,7 +1808,7 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (memseg_list_alloc(msl)) {
+				if (eal_memseg_list_alloc(msl, 0)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
@@ -2162,185 +1839,7 @@ memseg_primary_init_32(void)
 static int __rte_unused
 memseg_primary_init(void)
 {
-	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
-	struct memtype {
-		uint64_t page_sz;
-		int socket_id;
-	} *memtypes = NULL;
-	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
-	struct rte_memseg_list *msl;
-	uint64_t max_mem, max_mem_per_type;
-	unsigned int max_seglists_per_type;
-	unsigned int n_memtypes, cur_type;
-
-	/* no-huge does not need this at all */
-	if (internal_config.no_hugetlbfs)
-		return 0;
-
-	/*
-	 * figuring out amount of memory we're going to have is a long and very
-	 * involved process. the basic element we're operating with is a memory
-	 * type, defined as a combination of NUMA node ID and page size (so that
-	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
-	 *
-	 * deciding amount of memory going towards each memory type is a
-	 * balancing act between maximum segments per type, maximum memory per
-	 * type, and number of detected NUMA nodes. the goal is to make sure
-	 * each memory type gets at least one memseg list.
-	 *
-	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
-	 *
-	 * the total amount of memory per type is limited by either
-	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
-	 * of detected NUMA nodes. additionally, maximum number of segments per
-	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
-	 * smaller page sizes, it can take hundreds of thousands of segments to
-	 * reach the above specified per-type memory limits.
-	 *
-	 * additionally, each type may have multiple memseg lists associated
-	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
-	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
-	 *
-	 * the number of memseg lists per type is decided based on the above
-	 * limits, and also taking number of detected NUMA nodes, to make sure
-	 * that we don't run out of memseg lists before we populate all NUMA
-	 * nodes with memory.
-	 *
-	 * we do this in three stages. first, we collect the number of types.
-	 * then, we figure out memory constraints and populate the list of
-	 * would-be memseg lists. then, we go ahead and allocate the memseg
-	 * lists.
-	 */
-
-	/* create space for mem types */
-	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
-	memtypes = calloc(n_memtypes, sizeof(*memtypes));
-	if (memtypes == NULL) {
-		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
-		return -1;
-	}
-
-	/* populate mem types */
-	cur_type = 0;
-	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
-			hpi_idx++) {
-		struct hugepage_info *hpi;
-		uint64_t hugepage_sz;
-
-		hpi = &internal_config.hugepage_info[hpi_idx];
-		hugepage_sz = hpi->hugepage_sz;
-
-		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
-			int socket_id = rte_socket_id_by_idx(i);
-
-#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
-			/* we can still sort pages by socket in legacy mode */
-			if (!internal_config.legacy_mem && socket_id > 0)
-				break;
-#endif
-			memtypes[cur_type].page_sz = hugepage_sz;
-			memtypes[cur_type].socket_id = socket_id;
-
-			RTE_LOG(DEBUG, EAL, "Detected memory type: "
-				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
-				socket_id, hugepage_sz);
-		}
-	}
-	/* number of memtypes could have been lower due to no NUMA support */
-	n_memtypes = cur_type;
-
-	/* set up limits for types */
-	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
-	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
-			max_mem / n_memtypes);
-	/*
-	 * limit maximum number of segment lists per type to ensure there's
-	 * space for memseg lists for all NUMA nodes with all page sizes
-	 */
-	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
-
-	if (max_seglists_per_type == 0) {
-		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
-			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-		goto out;
-	}
-
-	/* go through all mem types and create segment lists */
-	msl_idx = 0;
-	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
-		unsigned int cur_seglist, n_seglists, n_segs;
-		unsigned int max_segs_per_type, max_segs_per_list;
-		struct memtype *type = &memtypes[cur_type];
-		uint64_t max_mem_per_list, pagesz;
-		int socket_id;
-
-		pagesz = type->page_sz;
-		socket_id = type->socket_id;
-
-		/*
-		 * we need to create segment lists for this type. we must take
-		 * into account the following things:
-		 *
-		 * 1. total amount of memory we can use for this memory type
-		 * 2. total amount of memory per memseg list allowed
-		 * 3. number of segments needed to fit the amount of memory
-		 * 4. number of segments allowed per type
-		 * 5. number of segments allowed per memseg list
-		 * 6. number of memseg lists we are allowed to take up
-		 */
-
-		/* calculate how much segments we will need in total */
-		max_segs_per_type = max_mem_per_type / pagesz;
-		/* limit number of segments to maximum allowed per type */
-		max_segs_per_type = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
-		/* limit number of segments to maximum allowed per list */
-		max_segs_per_list = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
-
-		/* calculate how much memory we can have per segment list */
-		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
-				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
-
-		/* calculate how many segments each segment list will have */
-		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
-
-		/* calculate how many segment lists we can have */
-		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
-				max_mem_per_type / max_mem_per_list);
-
-		/* limit number of segment lists according to our maximum */
-		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
-
-		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
-				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
-			n_seglists, n_segs, socket_id, pagesz);
-
-		/* create all segment lists */
-		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
-			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
-				RTE_LOG(ERR, EAL,
-					"No more space in memseg lists, please increase %s\n",
-					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-				goto out;
-			}
-			msl = &mcfg->memsegs[msl_idx++];
-
-			if (memseg_list_init(msl, pagesz, n_segs,
-					socket_id, cur_seglist))
-				goto out;
-
-			if (memseg_list_alloc(msl)) {
-				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
-				goto out;
-			}
-		}
-	}
-	/* we're successful */
-	ret = 0;
-out:
-	free(memtypes);
-	return ret;
+	return eal_dynmem_memseg_lists_init();
 }
 
 static int
@@ -2364,7 +1863,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (memseg_list_alloc(msl)) {
+		if (eal_memseg_list_alloc(msl, 0)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 06/11] trace: add size_t field emitter
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (4 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-03  3:29           ` Jerin Jacob
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
                           ` (5 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, Sunil Kumar Kori,
	Olivier Matz, Andrew Rybchenko

It is not guaranteed that sizeof(long) == sizeof(size_t). On Windows,
sizeof(long) == 4 and sizeof(size_t) == 8 for 64-bit programs.
Tracepoints using "long" field emitter are therefore invalid there.
Add dedicated field emitter for size_t and use it to store size_t values
in all existing tracepoints.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/include/rte_eal_trace.h   |  8 ++++----
 lib/librte_eal/include/rte_trace_point.h |  3 +++
 lib/librte_mempool/rte_mempool_trace.h   | 10 +++++-----
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/include/rte_eal_trace.h b/lib/librte_eal/include/rte_eal_trace.h
index 1ebb2905a..bcfef0cfa 100644
--- a/lib/librte_eal/include/rte_eal_trace.h
+++ b/lib/librte_eal/include/rte_eal_trace.h
@@ -143,7 +143,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -154,7 +154,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -164,7 +164,7 @@ RTE_TRACE_POINT(
 	rte_eal_trace_mem_realloc,
 	RTE_TRACE_POINT_ARGS(size_t size, unsigned int align, int socket,
 		void *ptr),
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -183,7 +183,7 @@ RTE_TRACE_POINT(
 		unsigned int flags, unsigned int align, unsigned int bound,
 		const void *mz),
 	rte_trace_point_emit_string(name);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_int(socket_id);
 	rte_trace_point_emit_u32(flags);
 	rte_trace_point_emit_u32(align);
diff --git a/lib/librte_eal/include/rte_trace_point.h b/lib/librte_eal/include/rte_trace_point.h
index b45171275..377c2414a 100644
--- a/lib/librte_eal/include/rte_trace_point.h
+++ b/lib/librte_eal/include/rte_trace_point.h
@@ -138,6 +138,8 @@ _tp _args \
 #define rte_trace_point_emit_int(val)
 /** Tracepoint function payload for long datatype */
 #define rte_trace_point_emit_long(val)
+/** Tracepoint function payload for size_t datatype */
+#define rte_trace_point_emit_size_t(val)
 /** Tracepoint function payload for float datatype */
 #define rte_trace_point_emit_float(val)
 /** Tracepoint function payload for double datatype */
@@ -395,6 +397,7 @@ do { \
 #define rte_trace_point_emit_i8(in) __rte_trace_point_emit(in, int8_t)
 #define rte_trace_point_emit_int(in) __rte_trace_point_emit(in, int32_t)
 #define rte_trace_point_emit_long(in) __rte_trace_point_emit(in, long)
+#define rte_trace_point_emit_size_t(in) __rte_trace_point_emit(in, size_t)
 #define rte_trace_point_emit_float(in) __rte_trace_point_emit(in, float)
 #define rte_trace_point_emit_double(in) __rte_trace_point_emit(in, double)
 #define rte_trace_point_emit_ptr(in) __rte_trace_point_emit(in, uintptr_t)
diff --git a/lib/librte_mempool/rte_mempool_trace.h b/lib/librte_mempool/rte_mempool_trace.h
index e776df0a6..087c913c8 100644
--- a/lib/librte_mempool/rte_mempool_trace.h
+++ b/lib/librte_mempool/rte_mempool_trace.h
@@ -72,7 +72,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -84,8 +84,8 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(addr);
-	rte_trace_point_emit_long(len);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(len);
+	rte_trace_point_emit_size_t(pg_sz);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -126,7 +126,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(struct rte_mempool *mempool, size_t pg_sz),
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(pg_sz);
 )
 
 RTE_TRACE_POINT(
@@ -139,7 +139,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_u32(max_objs);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(obj_cb);
 	rte_trace_point_emit_ptr(obj_cb_arg);
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 07/11] eal/windows: add tracing support stubs
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (5 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 06/11] trace: add size_t field emitter Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                           ` (4 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon

EAL common code depends on tracepoint calls, but generic implementation
cannot be enabled on Windows due to missing standard library facilities.
Add stub functions to support tracepoint compilation, so that common
code does not have to conditionally include tracepoints until proper
support is added.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_thread.c |  5 +---
 lib/librte_eal/common/meson.build         |  1 +
 lib/librte_eal/windows/eal.c              | 34 ++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_thread.c b/lib/librte_eal/common/eal_common_thread.c
index f9f588c17..370bb1b63 100644
--- a/lib/librte_eal/common/eal_common_thread.c
+++ b/lib/librte_eal/common/eal_common_thread.c
@@ -15,9 +15,7 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 #include <rte_log.h>
-#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_trace_point.h>
-#endif
 
 #include "eal_internal_cfg.h"
 #include "eal_private.h"
@@ -169,9 +167,8 @@ static void *rte_thread_init(void *arg)
 		free(params);
 	}
 
-#ifndef RTE_EXEC_ENV_WINDOWS
 	__rte_trace_mem_per_thread_alloc();
-#endif
+
 	return start_routine(routine_arg);
 }
 
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index d91c22220..4e9208129 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -14,6 +14,7 @@ if is_windows
 		'eal_common_log.c',
 		'eal_common_options.c',
 		'eal_common_thread.c',
+		'eal_common_trace_points.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index d084606a6..e7461f731 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -17,6 +17,7 @@
 #include <eal_filesystem.h>
 #include <eal_options.h>
 #include <eal_private.h>
+#include <rte_trace_point.h>
 
 #include "eal_windows.h"
 
@@ -221,7 +222,38 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
- /* Launch threads, called at application init(). */
+/* Stubs to enable EAL trace point compilation
+ * until eal_common_trace.c can be compiled.
+ */
+
+RTE_DEFINE_PER_LCORE(volatile int, trace_point_sz);
+RTE_DEFINE_PER_LCORE(void *, trace_mem);
+
+void
+__rte_trace_mem_per_thread_alloc(void)
+{
+}
+
+void
+__rte_trace_point_emit_field(size_t sz, const char *field,
+	const char *type)
+{
+	RTE_SET_USED(sz);
+	RTE_SET_USED(field);
+	RTE_SET_USED(type);
+}
+
+int
+__rte_trace_point_register(rte_trace_point_t *trace, const char *name,
+	void (*register_fn)(void))
+{
+	RTE_SET_USED(trace);
+	RTE_SET_USED(name);
+	RTE_SET_USED(register_fn);
+	return -ENOTSUP;
+}
+
+/* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (6 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                           ` (3 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 09/11] eal/windows: improve CPU and NUMA node detection
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (7 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
                           ` (2 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal_lcore.c   | 185 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..9d931d50a 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() may not be available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 10/11] eal/windows: initialize hugepage info
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (8 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup. Only 2MB
hugepages are supported and their number is estimated roughly
due to the lack or unstable status of suitable OS APIs.
Assign myself as maintainer for the implementation file.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                            |   4 +
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/build_dpdk.rst  |  20 -----
 doc/guides/windows_gsg/index.rst       |   1 +
 doc/guides/windows_gsg/run_apps.rst    |  47 +++++++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 8 files changed, 177 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a1722ca73..19b818f69 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -336,6 +336,10 @@ F: lib/librte_eal/windows/
 F: lib/librte_eal/rte_eal_exports.def
 F: doc/guides/windows_gsg/
 
+Windows memory allocation
+M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
+F: lib/librte_eal/eal_hugepages.c
+
 
 Core Libraries
 --------------
diff --git a/config/meson.build b/config/meson.build
index 43ab11310..c1e80de4b 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -268,6 +268,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index e7461f731..7c2fcc860 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -19,8 +19,11 @@
 #include <eal_private.h>
 #include <rte_trace_point.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -276,6 +279,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..61d0dcd3c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available on Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem on Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index adfc8b9b7..52978e9d7 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
 	'eal_thread.c',
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 11/11] eal/windows: implement basic memory management
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (9 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-06-02 23:03         ` Dmitry Kozlyuk
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-02 23:03 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode. Multi-process mode is not
implemented and is forcefully disabled at startup. Assign myself as a
maintainer for Windows file and memory management implementation.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                   |   1 +
 config/meson.build                            |  12 +-
 doc/guides/windows_gsg/run_apps.rst           |  54 +-
 lib/librte_eal/common/meson.build             |  11 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/windows/eal.c                  |  63 +-
 lib/librte_eal/windows/eal_file.c             | 123 +++
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  75 ++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   6 +
 18 files changed, 1769 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 19b818f69..5140756b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -339,6 +339,7 @@ F: doc/guides/windows_gsg/
 Windows memory allocation
 M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
 F: lib/librte_eal/eal_hugepages.c
+F: lib/librte_eal/eal_mem*
 
 
 Core Libraries
diff --git a/config/meson.build b/config/meson.build
index c1e80de4b..d3f05f878 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -261,15 +261,21 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..78e5a614f 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -7,10 +7,10 @@ Running DPDK Applications
 Grant *Lock pages in memory* Privilege
 --------------------------------------
 
-Use of hugepages ("large pages" in Windows terminolocy) requires
+Use of hugepages ("large pages" in Windows terminology) requires
 ``SeLockMemoryPrivilege`` for the user running an application.
 
-1. Open *Local Security Policy* snap in, either:
+1. Open *Local Security Policy* snap-in, either:
 
    * Control Panel / Computer Management / Local Security Policy;
    * or Win+R, type ``secpol``, press Enter.
@@ -24,7 +24,55 @@ Use of hugepages ("large pages" in Windows terminolocy) requires
 
 See `Large-Page Support`_ in MSDN for details.
 
-.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+.. _Large-Page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation. This driver is not signed, so signature
+checking must be disabled to load it.
+
+.. warning::
+
+    Disabling driver signature enforcement weakens OS security.
+    It is discouraged in production environments.
+
+Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``. It can be installed as follows
+from Elevated Command Prompt:
+
+.. code-block:: console
+
+    pnputil /add-driver Z:\path\to\virt2phys.inf /install
+
+On Windows Server additional steps are required:
+
+1. From Device Manager, Action menu, select "Add legacy hardware".
+2. It will launch the "Add Hardware Wizard". Click "Next".
+3. Select second option "Install the hardware that I manually select
+   from a list (Advanced)".
+4. On the next screen, "Kernel bypass" will be shown as a device class.
+5. Select it, and click "Next".
+6. The previously installed drivers will now be installed for the
+   "Virtual to physical address translator" device.
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+Installed driver persists across reboots.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
 
 
 Run the ``helloworld`` Example
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 4e9208129..310844269 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -8,13 +8,24 @@ if is_windows
 		'eal_common_bus.c',
 		'eal_common_class.c',
 		'eal_common_devargs.c',
+		'eal_common_dynmem.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_string_fns.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
 		'eal_common_trace_points.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index f1b73168b..9d39e58c0 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -20,6 +20,7 @@
 #include <rte_lcore.h>
 #include <rte_common.h>
 #include <rte_spinlock.h>
+
 #include <rte_eal_trace.h>
 
 #include <rte_malloc.h>
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..e2eb24f01 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,9 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
 	rte_log
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
 	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_page_size
+	rte_mem_unmap
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 7c2fcc860..a43649abc 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -94,6 +94,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -256,7 +274,7 @@ __rte_trace_point_register(rte_trace_point_t *trace, const char *name,
 	return -ENOTSUP;
 }
 
-/* Launch threads, called at application init(). */
+ /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
@@ -279,6 +297,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.in_memory == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.in_memory = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -290,6 +315,42 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_file.c b/lib/librte_eal/windows/eal_file.c
new file mode 100644
index 000000000..dbb08456c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_file.c
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <fcntl.h>
+#include <io.h>
+#include <share.h>
+#include <sys/stat.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_file_create(const char *path)
+{
+	int fd, ret;
+
+	ret = _sopen_s(&fd, path, _O_CREAT | _O_RDWR, _SH_DENYNO, _S_IWRITE);
+	if (ret) {
+		rte_errno = ret;
+		return -1;
+	}
+
+	return fd;
+}
+
+int
+eal_file_open(const char *path, bool writable)
+{
+	int fd, ret, flags;
+
+	flags = writable ? _O_RDWR : _O_RDONLY;
+	ret = _sopen_s(&fd, path, flags, _SH_DENYNO, 0);
+	if (ret < 0) {
+		rte_errno = errno;
+		return -1;
+	}
+
+	return fd;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..a7452b6e1
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,441 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory from OS. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+
+		/* During commitment, memory is temporary freed and might
+		 * be allocated by different non-EAL thread. This is a fatal
+		 * error, because it breaks MSL assumptions.
+		 */
+		if ((addr != NULL) && (addr != requested_addr)) {
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				requested_addr);
+			return -1;
+		}
+
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu) on socket %d\n",
+				requested_addr, alloc_sz, socket_id);
+			return -1;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx(%p)", addr);
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of a hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	if (eal_mem_decommit(addr, alloc_sz) && (rte_errno == EADDRNOTAVAIL)) {
+		/* During decommitment, memory is temporarily returned
+		 * to the system and the address may become unavailable.
+		 */
+		RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+			" allocation - MSL is not VA-contiguous!\n", addr);
+	}
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len)) {
+		if (rte_errno == EADDRNOTAVAIL) {
+			/* See alloc_seg() for explanation. */
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				ms->addr);
+		}
+		return -1;
+	}
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..2739da346
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,710 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <inttypes.h>
+#include <io.h>
+
+#include <rte_eal_memory.h>
+#include <rte_errno.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+#include <rte_virt2phys.h>
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ *
+ * MEM_RESERVE_PLACEHOLDER being defined means VirtualAlloc2() is present too.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags,
+	MemExtendedParameterMax
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	/* Contrary to the docs, VirtualAlloc2() is not in kernel32.dll,
+	 * see https://github.com/MicrosoftDocs/feedback/issues/1129.
+	 */
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+
+		/* Contrary to the docs, Server 2016 is not supported. */
+		RTE_LOG(ERR, EAL, "Windows 10 or Windows Server 2019 "
+			" is required for memory management\n");
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* defined(MEM_RESERVE_PLACEHOLDER) */
+
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static void
+set_errno_from_win32_alloc_error(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		rte_errno = 0;
+		break;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		rte_errno = ENOMEM;
+		break;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		rte_errno = EINVAL;
+		break;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	HANDLE process;
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	process = GetCurrentProcess();
+
+	virt = VirtualAlloc2(process, requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFreeEx(process, virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void *
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	HANDLE process;
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	process = GetCurrentProcess();
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+
+		if (VirtualQueryEx(process, requested_addr, &info,
+				sizeof(info)) != sizeof(info)) {
+			RTE_LOG_WIN32_ERR("VirtualQuery(%p)", requested_addr);
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) && !VirtualFreeEx(
+				process, requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR(
+				"VirtualFreeEx(%p, %zu, preserve placeholder)",
+				requested_addr, size);
+			return NULL;
+		}
+
+		/* Temporarily release the region to be committed.
+		 *
+		 * There is an inherent race for this memory range
+		 * if another thread allocates memory via OS API.
+		 * However, VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)
+		 * doesn't work with MEM_LARGE_PAGES on Windows Server.
+		 */
+		if (!VirtualFreeEx(process, requested_addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				requested_addr);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAlloc2(process, requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		/* Logging may overwrite GetLastError() result. */
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, commit large pages)",
+			requested_addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((requested_addr != NULL) && (addr != requested_addr)) {
+		/* We lost the race for the requested_addr. */
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", addr);
+
+		rte_errno = EADDRNOTAVAIL;
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	HANDLE process;
+	void *stub;
+	DWORD flags;
+
+	process = GetCurrentProcess();
+
+	/* Hugepages cannot be decommited on Windows,
+	 * so free them and replace the block with a placeholder.
+	 * There is a race for VA in this block until VirtualAlloc2 call.
+	 */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	flags = MEM_RESERVE | MEM_RESERVE_PLACEHOLDER;
+	stub = VirtualAlloc2(
+		process, addr, size, flags, PAGE_NOACCESS, NULL, 0);
+	if (stub == NULL) {
+		/* We lost the race for the VA. */
+		if (!VirtualFreeEx(process, stub, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", stub);
+		rte_errno = EADDRNOTAVAIL;
+		return -1;
+	}
+
+	/* No need to join reserved regions adjascent to the freed one:
+	 * eal_mem_commit() will just pick up the page-size placeholder
+	 * created here.
+	 */
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	HANDLE process;
+
+	process = GetCurrentProcess();
+
+	if (VirtualQueryEx(
+			process, addr, &info, sizeof(info)) != sizeof(info)) {
+		RTE_LOG_WIN32_ERR("VirtualQueryEx(%p)", addr);
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFreeEx(process, addr, size,
+			MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR(
+			"VirtualFreeEx(%p, %zu, preserve placeholder)",
+			addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	RTE_SET_USED(virt);
+	RTE_SET_USED(size);
+	RTE_SET_USED(dump);
+
+	/* Windows does not dump reserved memory by default.
+	 *
+	 * There is <werapi.h> to include or exclude regions from the dump,
+	 * but this is not currently required by EAL.
+	 */
+
+	rte_errno = ENOTSUP;
+	return -1;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* There is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!UnmapViewOfFile(virt))
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		virt = NULL;
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static SYSTEM_INFO info;
+
+	if (info.dwPageSize == 0)
+		GetSystemInfo(&info);
+
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock(%p %#zx)", virt, size);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		EAL_LOG_NOT_IMPLEMENTED();
+		return -1;
+	}
+
+	return eal_dynmem_memseg_lists_init();
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs;
+	uint64_t mem_sz, page_sz;
+	void *addr;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	msl = &mcfg->memsegs[0];
+
+	mem_sz = internal_config.memory;
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = mem_sz / page_sz;
+
+	if (eal_memseg_list_init_named(
+			msl, "nohugemem", page_sz, n_segs, 0, true)) {
+		return -1;
+	}
+
+	addr = VirtualAlloc(
+		NULL, mem_sz, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+	if (addr == NULL) {
+		RTE_LOG_WIN32_ERR("VirtualAlloc(size=%#zx)", mem_sz);
+		RTE_LOG(ERR, EAL, "Cannot allocate memory\n");
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	eal_memseg_list_populate(msl, addr, n_segs);
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_dynmem_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index 390d2fd66..caabffedf 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,4 +52,63 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with eal_mem_reserve()
+ * or decommitted from hugepages by eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and rte_errno is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit, must be the size of a page
+ *  (hugepage or regular one).
+ *
+ * The *addr* and *size* must match location and size
+ * of a previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..cb10d6494 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -14,6 +14,7 @@
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -36,6 +37,9 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
@@ -46,6 +50,7 @@ extern "C" {
 typedef long long ssize_t;
 
 #ifndef RTE_TOOLCHAIN_GCC
+
 static inline int
 asprintf(char **buffer, const char *format, ...)
 {
@@ -72,6 +77,18 @@ asprintf(char **buffer, const char *format, ...)
 	}
 	return ret;
 }
+
+static inline const char *
+eal_strerror(int code)
+{
+	static char buffer[128];
+
+	strerror_s(buffer, sizeof(buffer), code);
+	return buffer;
+}
+
+#define strerror eal_strerror
+
 #endif /* RTE_TOOLCHAIN_GCC */
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 52978e9d7..ded5a2b80 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,10 +6,16 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_file.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
+	'eal_memalloc.c',
+	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'fnmatch.c',
 	'getopt.c',
 )
+
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 01/11] eal: replace rte_page_sizes with a set of constants
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
@ 2020-06-03  1:59           ` Stephen Hemminger
  0 siblings, 0 replies; 218+ messages in thread
From: Stephen Hemminger @ 2020-06-03  1:59 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Jerin Jacob, John McNamara, Marko Kovacevic,
	Anatoly Burakov

On Wed,  3 Jun 2020 02:03:19 +0300
Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:

> Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
> Enum rte_page_sizes has members valued above this limit, which get
> wrapped to zero, resulting in compilation error (duplicate values in
> enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.
> 
> Remove rte_page_sizes and replace its values with #define's.
> This enumeration is not used in public API, so there's no ABI breakage.
> Announce API changes for 20.08 in documentation.
> 
> Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>

In this case #define makes more sense.

Acked-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 06/11] trace: add size_t field emitter
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 06/11] trace: add size_t field emitter Dmitry Kozlyuk
@ 2020-06-03  3:29           ` Jerin Jacob
  0 siblings, 0 replies; 218+ messages in thread
From: Jerin Jacob @ 2020-06-03  3:29 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dpdk-dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Jerin Jacob, Sunil Kumar Kori, Olivier Matz,
	Andrew Rybchenko

On Wed, Jun 3, 2020 at 4:35 AM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
>
> It is not guaranteed that sizeof(long) == sizeof(size_t). On Windows,
> sizeof(long) == 4 and sizeof(size_t) == 8 for 64-bit programs.
> Tracepoints using "long" field emitter are therefore invalid there.
> Add dedicated field emitter for size_t and use it to store size_t values
> in all existing tracepoints.
>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>

Reviewed-by: Jerin Jacob <jerinj@marvell.com>



> ---
>  lib/librte_eal/include/rte_eal_trace.h   |  8 ++++----
>  lib/librte_eal/include/rte_trace_point.h |  3 +++
>  lib/librte_mempool/rte_mempool_trace.h   | 10 +++++-----
>  3 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/lib/librte_eal/include/rte_eal_trace.h b/lib/librte_eal/include/rte_eal_trace.h
> index 1ebb2905a..bcfef0cfa 100644
> --- a/lib/librte_eal/include/rte_eal_trace.h
> +++ b/lib/librte_eal/include/rte_eal_trace.h
> @@ -143,7 +143,7 @@ RTE_TRACE_POINT(
>         RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
>                 int socket, void *ptr),
>         rte_trace_point_emit_string(type);
> -       rte_trace_point_emit_long(size);
> +       rte_trace_point_emit_size_t(size);
>         rte_trace_point_emit_u32(align);
>         rte_trace_point_emit_int(socket);
>         rte_trace_point_emit_ptr(ptr);
> @@ -154,7 +154,7 @@ RTE_TRACE_POINT(
>         RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
>                 int socket, void *ptr),
>         rte_trace_point_emit_string(type);
> -       rte_trace_point_emit_long(size);
> +       rte_trace_point_emit_size_t(size);
>         rte_trace_point_emit_u32(align);
>         rte_trace_point_emit_int(socket);
>         rte_trace_point_emit_ptr(ptr);
> @@ -164,7 +164,7 @@ RTE_TRACE_POINT(
>         rte_eal_trace_mem_realloc,
>         RTE_TRACE_POINT_ARGS(size_t size, unsigned int align, int socket,
>                 void *ptr),
> -       rte_trace_point_emit_long(size);
> +       rte_trace_point_emit_size_t(size);
>         rte_trace_point_emit_u32(align);
>         rte_trace_point_emit_int(socket);
>         rte_trace_point_emit_ptr(ptr);
> @@ -183,7 +183,7 @@ RTE_TRACE_POINT(
>                 unsigned int flags, unsigned int align, unsigned int bound,
>                 const void *mz),
>         rte_trace_point_emit_string(name);
> -       rte_trace_point_emit_long(len);
> +       rte_trace_point_emit_size_t(len);
>         rte_trace_point_emit_int(socket_id);
>         rte_trace_point_emit_u32(flags);
>         rte_trace_point_emit_u32(align);
> diff --git a/lib/librte_eal/include/rte_trace_point.h b/lib/librte_eal/include/rte_trace_point.h
> index b45171275..377c2414a 100644
> --- a/lib/librte_eal/include/rte_trace_point.h
> +++ b/lib/librte_eal/include/rte_trace_point.h
> @@ -138,6 +138,8 @@ _tp _args \
>  #define rte_trace_point_emit_int(val)
>  /** Tracepoint function payload for long datatype */
>  #define rte_trace_point_emit_long(val)
> +/** Tracepoint function payload for size_t datatype */
> +#define rte_trace_point_emit_size_t(val)
>  /** Tracepoint function payload for float datatype */
>  #define rte_trace_point_emit_float(val)
>  /** Tracepoint function payload for double datatype */
> @@ -395,6 +397,7 @@ do { \
>  #define rte_trace_point_emit_i8(in) __rte_trace_point_emit(in, int8_t)
>  #define rte_trace_point_emit_int(in) __rte_trace_point_emit(in, int32_t)
>  #define rte_trace_point_emit_long(in) __rte_trace_point_emit(in, long)
> +#define rte_trace_point_emit_size_t(in) __rte_trace_point_emit(in, size_t)
>  #define rte_trace_point_emit_float(in) __rte_trace_point_emit(in, float)
>  #define rte_trace_point_emit_double(in) __rte_trace_point_emit(in, double)
>  #define rte_trace_point_emit_ptr(in) __rte_trace_point_emit(in, uintptr_t)
> diff --git a/lib/librte_mempool/rte_mempool_trace.h b/lib/librte_mempool/rte_mempool_trace.h
> index e776df0a6..087c913c8 100644
> --- a/lib/librte_mempool/rte_mempool_trace.h
> +++ b/lib/librte_mempool/rte_mempool_trace.h
> @@ -72,7 +72,7 @@ RTE_TRACE_POINT(
>         rte_trace_point_emit_string(mempool->name);
>         rte_trace_point_emit_ptr(vaddr);
>         rte_trace_point_emit_u64(iova);
> -       rte_trace_point_emit_long(len);
> +       rte_trace_point_emit_size_t(len);
>         rte_trace_point_emit_ptr(free_cb);
>         rte_trace_point_emit_ptr(opaque);
>  )
> @@ -84,8 +84,8 @@ RTE_TRACE_POINT(
>         rte_trace_point_emit_ptr(mempool);
>         rte_trace_point_emit_string(mempool->name);
>         rte_trace_point_emit_ptr(addr);
> -       rte_trace_point_emit_long(len);
> -       rte_trace_point_emit_long(pg_sz);
> +       rte_trace_point_emit_size_t(len);
> +       rte_trace_point_emit_size_t(pg_sz);
>         rte_trace_point_emit_ptr(free_cb);
>         rte_trace_point_emit_ptr(opaque);
>  )
> @@ -126,7 +126,7 @@ RTE_TRACE_POINT(
>         RTE_TRACE_POINT_ARGS(struct rte_mempool *mempool, size_t pg_sz),
>         rte_trace_point_emit_ptr(mempool);
>         rte_trace_point_emit_string(mempool->name);
> -       rte_trace_point_emit_long(pg_sz);
> +       rte_trace_point_emit_size_t(pg_sz);
>  )
>
>  RTE_TRACE_POINT(
> @@ -139,7 +139,7 @@ RTE_TRACE_POINT(
>         rte_trace_point_emit_u32(max_objs);
>         rte_trace_point_emit_ptr(vaddr);
>         rte_trace_point_emit_u64(iova);
> -       rte_trace_point_emit_long(len);
> +       rte_trace_point_emit_size_t(len);
>         rte_trace_point_emit_ptr(obj_cb);
>         rte_trace_point_emit_ptr(obj_cb_arg);
>  )
> --
> 2.25.4
>

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-06-03 12:07           ` Neil Horman
  2020-06-03 12:34             ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Neil Horman @ 2020-06-03 12:07 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Thomas Monjalon, Anatoly Burakov,
	Bruce Richardson

On Wed, Jun 03, 2020 at 02:03:20AM +0300, Dmitry Kozlyuk wrote:
> Introduce OS-independent wrappers in order to support common EAL code
> on Unix and Windows:
> 
> * eal_file_create: open an existing file.
> * eal_file_open: create a file or open it if exists.
> * eal_file_lock: lock or unlock an open file.
> * eal_file_truncate: enforce a given size for an open file.
> 
> Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
> which is intended for common code between the two. These thin wrappers
> require no special maintenance.
> 
> Common code supporting multi-process doesn't use the new wrappers,
> because it is inherently Unix-specific and would impose excessive
> requirements on the wrappers.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  MAINTAINERS                                |  3 +
>  lib/librte_eal/common/eal_common_fbarray.c | 31 ++++-----
>  lib/librte_eal/common/eal_private.h        | 74 +++++++++++++++++++++
>  lib/librte_eal/freebsd/Makefile            |  4 ++
>  lib/librte_eal/linux/Makefile              |  4 ++
>  lib/librte_eal/meson.build                 |  4 ++
>  lib/librte_eal/unix/eal_file.c             | 76 ++++++++++++++++++++++
>  lib/librte_eal/unix/meson.build            |  6 ++
>  8 files changed, 183 insertions(+), 19 deletions(-)
>  create mode 100644 lib/librte_eal/unix/eal_file.c
>  create mode 100644 lib/librte_eal/unix/meson.build
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d2b286701..1d9aff26d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -323,6 +323,9 @@ FreeBSD UIO
>  M: Bruce Richardson <bruce.richardson@intel.com>
>  F: kernel/freebsd/nic_uio/
>  
> +Unix shared files
> +F: lib/librte_eal/unix/
> +
>  Windows support
>  M: Harini Ramakrishnan <harini.ramakrishnan@microsoft.com>
>  M: Omar Cardona <ocardona@microsoft.com>
> diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
> index 4f8f1af73..81ce4bd42 100644
> --- a/lib/librte_eal/common/eal_common_fbarray.c
> +++ b/lib/librte_eal/common/eal_common_fbarray.c
> @@ -8,8 +8,8 @@
>  #include <sys/mman.h>
>  #include <stdint.h>
>  #include <errno.h>
> -#include <sys/file.h>
>  #include <string.h>
> +#include <unistd.h>
>  
>  #include <rte_common.h>
>  #include <rte_log.h>
> @@ -85,10 +85,8 @@ resize_and_map(int fd, void *addr, size_t len)
>  	char path[PATH_MAX];
>  	void *map_addr;
>  
> -	if (ftruncate(fd, len)) {
> +	if (eal_file_truncate(fd, len)) {
>  		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
> -		/* pass errno up the chain */
> -		rte_errno = errno;
>  		return -1;
>  	}
>  
> @@ -772,15 +770,15 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  		 * and see if we succeed. If we don't, someone else is using it
>  		 * already.
>  		 */
> -		fd = open(path, O_CREAT | O_RDWR, 0600);
> +		fd = eal_file_create(path);
>  		if (fd < 0) {
>  			RTE_LOG(DEBUG, EAL, "%s(): couldn't open %s: %s\n",
> -					__func__, path, strerror(errno));
> -			rte_errno = errno;
> +				__func__, path, rte_strerror(rte_errno));
>  			goto fail;
> -		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
> +		} else if (eal_file_lock(
> +				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
>  			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
> -					__func__, path, strerror(errno));
> +				__func__, path, rte_strerror(rte_errno));
>  			rte_errno = EBUSY;
>  			goto fail;
>  		}
> @@ -789,10 +787,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  		 * still attach to it, but no other process could reinitialize
>  		 * it.
>  		 */
> -		if (flock(fd, LOCK_SH | LOCK_NB)) {
> -			rte_errno = errno;
> +		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
>  			goto fail;
> -		}
>  
>  		if (resize_and_map(fd, data, mmap_len))
>  			goto fail;
> @@ -888,17 +884,14 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  
>  	eal_get_fbarray_path(path, sizeof(path), arr->name);
>  
> -	fd = open(path, O_RDWR);
> +	fd = eal_file_open(path, true);
>  	if (fd < 0) {
> -		rte_errno = errno;
>  		goto fail;
>  	}
>  
>  	/* lock the file, to let others know we're using it */
> -	if (flock(fd, LOCK_SH | LOCK_NB)) {
> -		rte_errno = errno;
> +	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
>  		goto fail;
> -	}
>  
>  	if (resize_and_map(fd, data, mmap_len))
>  		goto fail;
> @@ -1025,7 +1018,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  		 * has been detached by all other processes
>  		 */
>  		fd = tmp->fd;
> -		if (flock(fd, LOCK_EX | LOCK_NB)) {
> +		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
>  			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
>  			rte_errno = EBUSY;
>  			ret = -1;
> @@ -1042,7 +1035,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  			 * we're still holding an exclusive lock, so drop it to
>  			 * shared.
>  			 */
> -			flock(fd, LOCK_SH | LOCK_NB);
> +			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
>  
>  			ret = -1;
>  			goto out;
> diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
> index 869ce183a..727f26881 100644
> --- a/lib/librte_eal/common/eal_private.h
> +++ b/lib/librte_eal/common/eal_private.h
> @@ -420,4 +420,78 @@ eal_malloc_no_trace(const char *type, size_t size, unsigned int align);
>  
>  void eal_free_no_trace(void *addr);
>  
> +/**
> + * Create a file or open it if exits.
> + *
> + * Newly created file is only accessible to the owner (0600 equivalent).
> + * Returned descriptor is always read/write.
> + *
> + * @param path
> + *  Path to the file.
> + * @return
> + *  Open file descriptor on success, (-1) on failure and rte_errno is set.
> + */
> +int
> +eal_file_create(const char *path);
> +
> +/**
> + * Open an existing file.
> + *
> + * @param path
> + *  Path to the file.
> + * @param writable
> + *  Whether to open file read/write or read-only.
> + * @return
> + *  Open file descriptor on success, (-1) on failure and rte_errno is set.
> + */
> +int
> +eal_file_open(const char *path, bool writable);
> +
> +/** File locking operation. */
> +enum eal_flock_op {
> +	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
> +	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
> +	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
> +};
> +
> +/** Behavior on file locking conflict. */
> +enum eal_flock_mode {
> +	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
> +	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
> +};
> +
> +/**
> + * Lock or unlock the file.
> + *
> + * On failure @code rte_errno @endcode is set to the error code
> + * specified by POSIX flock(3) description.
> + *
> + * @param fd
> + *  Opened file descriptor.
> + * @param op
> + *  Operation to perform.
> + * @param mode
> + *  Behavior on conflict.
> + * @return
> + *  0 on success, (-1) on failure.
> + */
> +int
> +eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
> +
> +/**
> + * Truncate or extend the file to the specified size.
> + *
> + * On failure @code rte_errno @endcode is set to the error code
> + * specified by POSIX ftruncate(3) description.
> + *
> + * @param fd
> + *  Opened file descriptor.
> + * @param size
> + *  Desired file size.
> + * @return
> + *  0 on success, (-1) on failure.
> + */
> +int
> +eal_file_truncate(int fd, ssize_t size);
> +
>  #endif /* _EAL_PRIVATE_H_ */
> diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
> index af95386d4..0f8741d96 100644
> --- a/lib/librte_eal/freebsd/Makefile
> +++ b/lib/librte_eal/freebsd/Makefile
> @@ -7,6 +7,7 @@ LIB = librte_eal.a
>  
>  ARCH_DIR ?= $(RTE_ARCH)
>  VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
> +VPATH += $(RTE_SDK)/lib/librte_eal/unix
>  VPATH += $(RTE_SDK)/lib/librte_eal/common
>  
>  CFLAGS += -I$(SRCDIR)/include
> @@ -74,6 +75,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_service.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_random.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
>  
> +# from unix dir
> +SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
> +
>  # from arch dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_hypervisor.c
> diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
> index 48cc34844..331489f99 100644
> --- a/lib/librte_eal/linux/Makefile
> +++ b/lib/librte_eal/linux/Makefile
> @@ -7,6 +7,7 @@ LIB = librte_eal.a
>  
>  ARCH_DIR ?= $(RTE_ARCH)
>  VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
> +VPATH += $(RTE_SDK)/lib/librte_eal/unix
>  VPATH += $(RTE_SDK)/lib/librte_eal/common
>  
>  CFLAGS += -I$(SRCDIR)/include
> @@ -81,6 +82,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_service.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_random.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
>  
> +# from unix dir
> +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
> +
>  # from arch dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_hypervisor.c
> diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
> index e301f4558..8d492897d 100644
> --- a/lib/librte_eal/meson.build
> +++ b/lib/librte_eal/meson.build
> @@ -6,6 +6,10 @@ subdir('include')
>  
>  subdir('common')
>  
> +if not is_windows
> +	subdir('unix')
> +endif
> +
>  dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
>  subdir(exec_env)
>  
> diff --git a/lib/librte_eal/unix/eal_file.c b/lib/librte_eal/unix/eal_file.c
> new file mode 100644
> index 000000000..7b3ffa629
> --- /dev/null
> +++ b/lib/librte_eal/unix/eal_file.c
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <sys/file.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <rte_errno.h>
> +
> +#include "eal_private.h"
> +
> +int
> +eal_file_create(const char *path)
> +{
> +	int ret;
> +
> +	ret = open(path, O_CREAT | O_RDWR, 0600);
> +	if (ret < 0)
> +		rte_errno = errno;
> +
> +	return ret;
> +}
> +
You don't need this call if you support the oflags option in the open call
below.

> +int
> +eal_file_open(const char *path, bool writable)
> +{
> +	int ret, flags;
> +
> +	flags = writable ? O_RDWR : O_RDONLY;
> +	ret = open(path, flags);
> +	if (ret < 0)
> +		rte_errno = errno;
> +
> +	return ret;
> +}
> +
why are you changing this api from the posix file format (with oflags
specified).  As far as I can see both unix and windows platforms support that

> +int
> +eal_file_truncate(int fd, ssize_t size)
> +{
> +	int ret;
> +
> +	ret = ftruncate(fd, size);
> +	if (ret)
> +		rte_errno = errno;
> +
> +	return ret;
> +}
> +
> +int
> +eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
> +{
> +	int sys_flags = 0;
> +	int ret;
> +
> +	if (mode == EAL_FLOCK_RETURN)
> +		sys_flags |= LOCK_NB;
> +
> +	switch (op) {
> +	case EAL_FLOCK_EXCLUSIVE:
> +		sys_flags |= LOCK_EX;
> +		break;
> +	case EAL_FLOCK_SHARED:
> +		sys_flags |= LOCK_SH;
> +		break;
> +	case EAL_FLOCK_UNLOCK:
> +		sys_flags |= LOCK_UN;
> +		break;
> +	}
> +
> +	ret = flock(fd, sys_flags);
> +	if (ret)
> +		rte_errno = errno;
> +
> +	return ret;
> +}
> diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
> new file mode 100644
> index 000000000..21029ba1a
> --- /dev/null
> +++ b/lib/librte_eal/unix/meson.build
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2020 Dmitry Kozlyuk
> +
> +sources += files(
> +	'eal_file.c',
> +)
> -- 
> 2.25.4
> 
> 

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations
  2020-06-03 12:07           ` Neil Horman
@ 2020-06-03 12:34             ` Dmitry Kozlyuk
  2020-06-04 21:07               ` Neil Horman
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-03 12:34 UTC (permalink / raw)
  To: Neil Horman
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Thomas Monjalon, Anatoly Burakov,
	Bruce Richardson

On Wed, 3 Jun 2020 08:07:59 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

[snip]
> > +int
> > +eal_file_create(const char *path)
> > +{
> > +	int ret;
> > +
> > +	ret = open(path, O_CREAT | O_RDWR, 0600);
> > +	if (ret < 0)
> > +		rte_errno = errno;
> > +
> > +	return ret;
> > +}
> > +  
> You don't need this call if you support the oflags option in the open call
> below.

See below.

> > +int
> > +eal_file_open(const char *path, bool writable)
> > +{
> > +	int ret, flags;
> > +
> > +	flags = writable ? O_RDWR : O_RDONLY;
> > +	ret = open(path, flags);
> > +	if (ret < 0)
> > +		rte_errno = errno;
> > +
> > +	return ret;
> > +}
> > +  
> why are you changing this api from the posix file format (with oflags
> specified).  As far as I can see both unix and windows platforms support that

There is a number of caveats, which IMO make this approach better:

1. Filesystem permissions on Windows are complicated. Supporting anything
other than 0600 would add a lot of code, while EAL doesn't really need it.
Microsoft's open() takes not permission bits, but a set of flags.

2. Restricted interface prevents EAL developers from accidentally using
features not supported on all platforms via a seemingly rich API.

3. Microsoft CRT (the one Clang is using) deprecates open() in favor of
_sopen_s() and issues a warning, and we're targeting -Werror. Disabling all
such warnings (_CRT_SECURE_NO_DEPRECATE) doesn't seem right when CRT vendor
encourages using alternatives. This is the primary reason for open()
wrappers in v6.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations
  2020-06-03 12:34             ` Dmitry Kozlyuk
@ 2020-06-04 21:07               ` Neil Horman
  2020-06-05  0:16                 ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Neil Horman @ 2020-06-04 21:07 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Thomas Monjalon, Anatoly Burakov,
	Bruce Richardson

On Wed, Jun 03, 2020 at 03:34:03PM +0300, Dmitry Kozlyuk wrote:
> On Wed, 3 Jun 2020 08:07:59 -0400
> Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> [snip]
> > > +int
> > > +eal_file_create(const char *path)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = open(path, O_CREAT | O_RDWR, 0600);
> > > +	if (ret < 0)
> > > +		rte_errno = errno;
> > > +
> > > +	return ret;
> > > +}
> > > +  
> > You don't need this call if you support the oflags option in the open call
> > below.
> 
> See below.
> 
> > > +int
> > > +eal_file_open(const char *path, bool writable)
> > > +{
> > > +	int ret, flags;
> > > +
> > > +	flags = writable ? O_RDWR : O_RDONLY;
> > > +	ret = open(path, flags);
> > > +	if (ret < 0)
> > > +		rte_errno = errno;
> > > +
> > > +	return ret;
> > > +}
> > > +  
> > why are you changing this api from the posix file format (with oflags
> > specified).  As far as I can see both unix and windows platforms support that
> 
> There is a number of caveats, which IMO make this approach better:
> 
> 1. Filesystem permissions on Windows are complicated. Supporting anything
> other than 0600 would add a lot of code, while EAL doesn't really need it.
> Microsoft's open() takes not permission bits, but a set of flags.
> 
> 2. Restricted interface prevents EAL developers from accidentally using
> features not supported on all platforms via a seemingly rich API.
> 
> 3. Microsoft CRT (the one Clang is using) deprecates open() in favor of
> _sopen_s() and issues a warning, and we're targeting -Werror. Disabling all
> such warnings (_CRT_SECURE_NO_DEPRECATE) doesn't seem right when CRT vendor
> encourages using alternatives. This is the primary reason for open()
> wrappers in v6.
> 

that seems a bit shortsighted to me.  By creating wrappers that restrict
functionality to the least common demoninator of supported platforms restricts
what all platforms are capable of.  For example, theres no reason that the eal
library shouldn't be able to open a file O_TRUNC or O_SYNC just because its
complex to do it on a single platform.  

The API should be written to support the full range of functionality on all
platforms, and the individual implementations should write the code to make that
happen, or return an error that its unsupported on this particular platform.

I'm not saying that you have to implement everything now, but you shouldn't
restrict the API from being able to do so in the future.  Otherwise, in the
future, if someone wants to implement O_TRUNC support (just to site an example),
they're going to have to make a change to the API above, and alter the
implementation for all the platforms anyway.  You may as well make the API
robust enough to support that now.

Neil
> -- 
> Dmitry Kozlyuk
> 

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations
  2020-06-04 21:07               ` Neil Horman
@ 2020-06-05  0:16                 ` Dmitry Kozlyuk
  2020-06-05 11:19                   ` Neil Horman
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-05  0:16 UTC (permalink / raw)
  To: Neil Horman
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Thomas Monjalon, Anatoly Burakov,
	Bruce Richardson

On Thu, 4 Jun 2020 17:07:07 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> On Wed, Jun 03, 2020 at 03:34:03PM +0300, Dmitry Kozlyuk wrote:
> > On Wed, 3 Jun 2020 08:07:59 -0400
> > Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > [snip]  
> > > > +int
> > > > +eal_file_create(const char *path)
> > > > +{
> > > > +	int ret;
> > > > +
> > > > +	ret = open(path, O_CREAT | O_RDWR, 0600);
> > > > +	if (ret < 0)
> > > > +		rte_errno = errno;
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +    
> > > You don't need this call if you support the oflags option in the open call
> > > below.  
> > 
> > See below.
> >   
> > > > +int
> > > > +eal_file_open(const char *path, bool writable)
> > > > +{
> > > > +	int ret, flags;
> > > > +
> > > > +	flags = writable ? O_RDWR : O_RDONLY;
> > > > +	ret = open(path, flags);
> > > > +	if (ret < 0)
> > > > +		rte_errno = errno;
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +    
> > > why are you changing this api from the posix file format (with oflags
> > > specified).  As far as I can see both unix and windows platforms support that  
> > 
> > There is a number of caveats, which IMO make this approach better:
> > 
> > 1. Filesystem permissions on Windows are complicated. Supporting anything
> > other than 0600 would add a lot of code, while EAL doesn't really need it.
> > Microsoft's open() takes not permission bits, but a set of flags.
> > 
> > 2. Restricted interface prevents EAL developers from accidentally using
> > features not supported on all platforms via a seemingly rich API.
> > 
> > 3. Microsoft CRT (the one Clang is using) deprecates open() in favor of
> > _sopen_s() and issues a warning, and we're targeting -Werror. Disabling all
> > such warnings (_CRT_SECURE_NO_DEPRECATE) doesn't seem right when CRT vendor
> > encourages using alternatives. This is the primary reason for open()
> > wrappers in v6.
> >   
> 
> that seems a bit shortsighted to me.  By creating wrappers that restrict
> functionality to the least common demoninator of supported platforms restricts
> what all platforms are capable of.  For example, theres no reason that the eal
> library shouldn't be able to open a file O_TRUNC or O_SYNC just because its
> complex to do it on a single platform.  

The purpose of these wrappers is to maximize reuse of common code. It doesn't
require POSIX par se, it's just implemented in terms of API that had been
available on all supported OSes until Windows target was introduced. Wrapper
interface is derived from common code requirements.

> The API should be written to support the full range of functionality on all
> platforms, and the individual implementations should write the code to make that
> happen, or return an error that its unsupported on this particular platform.

IMO, common code, by definition, should avoid partial support of anything.

> I'm not saying that you have to implement everything now, but you shouldn't
> restrict the API from being able to do so in the future.  Otherwise, in the
> future, if someone wants to implement O_TRUNC support (just to site an example),
> they're going to have to make a change to the API above, and alter the
> implementation for all the platforms anyway.  You may as well make the API
> robust enough to support that now.

I agree that these particular wrappers can have a lot more options, so
probably flags would be better. However, I wouldn't add parameters that
have partial support, namely, permissions. What do you think of the following
(names shortened)?

enum mode {
	RO = 0,	/* write-only is not portable */
	RW = 1,
	CREATE = 2	/* always 0600 equivalent */
};

eal_file_open(const char *path, int mode);

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations
  2020-06-05  0:16                 ` Dmitry Kozlyuk
@ 2020-06-05 11:19                   ` Neil Horman
  0 siblings, 0 replies; 218+ messages in thread
From: Neil Horman @ 2020-06-05 11:19 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Thomas Monjalon, Anatoly Burakov,
	Bruce Richardson

On Fri, Jun 05, 2020 at 03:16:03AM +0300, Dmitry Kozlyuk wrote:
> On Thu, 4 Jun 2020 17:07:07 -0400
> Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Wed, Jun 03, 2020 at 03:34:03PM +0300, Dmitry Kozlyuk wrote:
> > > On Wed, 3 Jun 2020 08:07:59 -0400
> > > Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > [snip]  
> > > > > +int
> > > > > +eal_file_create(const char *path)
> > > > > +{
> > > > > +	int ret;
> > > > > +
> > > > > +	ret = open(path, O_CREAT | O_RDWR, 0600);
> > > > > +	if (ret < 0)
> > > > > +		rte_errno = errno;
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > > +    
> > > > You don't need this call if you support the oflags option in the open call
> > > > below.  
> > > 
> > > See below.
> > >   
> > > > > +int
> > > > > +eal_file_open(const char *path, bool writable)
> > > > > +{
> > > > > +	int ret, flags;
> > > > > +
> > > > > +	flags = writable ? O_RDWR : O_RDONLY;
> > > > > +	ret = open(path, flags);
> > > > > +	if (ret < 0)
> > > > > +		rte_errno = errno;
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > > +    
> > > > why are you changing this api from the posix file format (with oflags
> > > > specified).  As far as I can see both unix and windows platforms support that  
> > > 
> > > There is a number of caveats, which IMO make this approach better:
> > > 
> > > 1. Filesystem permissions on Windows are complicated. Supporting anything
> > > other than 0600 would add a lot of code, while EAL doesn't really need it.
> > > Microsoft's open() takes not permission bits, but a set of flags.
> > > 
> > > 2. Restricted interface prevents EAL developers from accidentally using
> > > features not supported on all platforms via a seemingly rich API.
> > > 
> > > 3. Microsoft CRT (the one Clang is using) deprecates open() in favor of
> > > _sopen_s() and issues a warning, and we're targeting -Werror. Disabling all
> > > such warnings (_CRT_SECURE_NO_DEPRECATE) doesn't seem right when CRT vendor
> > > encourages using alternatives. This is the primary reason for open()
> > > wrappers in v6.
> > >   
> > 
> > that seems a bit shortsighted to me.  By creating wrappers that restrict
> > functionality to the least common demoninator of supported platforms restricts
> > what all platforms are capable of.  For example, theres no reason that the eal
> > library shouldn't be able to open a file O_TRUNC or O_SYNC just because its
> > complex to do it on a single platform.  
> 
> The purpose of these wrappers is to maximize reuse of common code. It doesn't
> require POSIX par se, it's just implemented in terms of API that had been
> available on all supported OSes until Windows target was introduced. Wrapper
> interface is derived from common code requirements.
> 
Sure, and I'm fine with that.  What I'm concerned about is implementing wrappers
that define their APIs in terms of whats currently implemented.  Theres no
reason that the existing in use feature set won't be built upon in the future,
and the API should be able to handle that.

> > The API should be written to support the full range of functionality on all
> > platforms, and the individual implementations should write the code to make that
> > happen, or return an error that its unsupported on this particular platform.
> 
> IMO, common code, by definition, should avoid partial support of anything.
> 
I disagree.  Anytime you abstract an implementation to a more generic api, you
have the possibility that a given implementation won't offer full support for
all of the APIs features, and thats ok, as long as there are no users of the
features, and a given implmentation properly returns an error when their usage
is attempted.  The expectation then is, that the user of the feature will add
the feature to all implementations, so that the code can remain portable.  What
you shouldn't do is define the API such that those features can't be implemented
without having to change the API, as that runs the potential risk of having to
modify the ABI.  Thats probably not the case here, but the notion stands.  If
you write the API to encompass the superset of supported platforms features, the
rest is implementation details.

> > I'm not saying that you have to implement everything now, but you shouldn't
> > restrict the API from being able to do so in the future.  Otherwise, in the
> > future, if someone wants to implement O_TRUNC support (just to site an example),
> > they're going to have to make a change to the API above, and alter the
> > implementation for all the platforms anyway.  You may as well make the API
> > robust enough to support that now.
> 
> I agree that these particular wrappers can have a lot more options, so
> probably flags would be better. However, I wouldn't add parameters that
> have partial support, namely, permissions.
But windows does offer file and folder permissions:
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/bb727008(v=technet.10)?redirectedfrom=MSDN

As you said, they are complicated, and the security model is different, but
those are details that can be worked out/emulated when the need arises.

> What do you think of the following
> (names shortened)?
> 
> enum mode {
> 	RO = 0,	/* write-only is not portable */
> 	RW = 1,
> 	CREATE = 2	/* always 0600 equivalent */
> };
> 
> eal_file_open(const char *path, int mode);
> 
Yeah, that makes sense to me. I'd be good with that.

Neil

> -- 
> Dmitry Kozlyuk
> 

^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v6 00/11] Windows basic memory management
  2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                           ` (10 preceding siblings ...)
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-06-08  7:41         ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
                             ` (11 more replies)
  11 siblings, 12 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing without IOVA is available.

Testing revealed Windows Server 2019 does not allow allocating hugepage
memory at a reserved address, despite advertised API.  So allocator has
to temporary free the region to be allocated.  This creates in inherent
race condition. This issue is being discussed with Microsoft privately.

New EAL public functions for memory mapping are introduced to mitigate
OS differences in DPDK libraries and applications: rte_mem_map,
rte_mem_unmap, rte_mem_lock, rte_mem_page_size.

To support common MM routines, internal wrappers for low-level memory
reservation and file management are introduced. These changes affect
Linux and FreeBSD EAL. Shared code is placed unded /unix/ subdirectory
(suggested by Thomas).

To avoid code duplication between Linux and Windows EAL, common code
for EALs supporting dynamic memory allocation is extracted
(discussed with Anatoly Burakov in v4 thread). This is a separate
patch to ease the review, but it can be merged with the previous one.

EAL tracepoints save size_t values as long, which is invalid on Windows.
New size_t emitter for tracepoints is introduced (suggested by Jerin
Jacob to Fady Bader, see [1]). Also, to avoid workaround in every file
using the tracepoints, stubs are added to Windows EAL.

Entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.
The file is imported as-is, so it causes a bunch of checkpatch warnings.

[1]: http://mails.dpdk.org/archives/dev/2020-May/168076.html

---

v7:
    * Change EAL internal file management API (Neil Horman).

v6:
    * Fix 32-bit build on x86 (CI).
    * Fix Makefile build (Anatoly Burakov, Thomas Monjalon).
    * Restore 32-bit common code (Anatoly Burakov).
    * Fix error reporting in memory management (Anatoly Burakov).
    * Add Doxygen comment for size_t tracepoint emitter (Jerin Jacob).
    * Update MAINTAINERS for new files and new code (Thomas Monjalon).
    * Rename rte_get_page_size to rte_mem_page_size.
    * Mark DPDK-only wrappers internal, move them to separate file.
    * Get rid of warnings in enabled common code with Clang on Windows.

v5:
    * Fix allocation and deallocation on Windows Server (Fady Bader).
    * Replace remaining VirtualFree with VirtualFreeEx (Ranjit Menon).
    * Fix errors in eal_get_virtual_area (Anatoly Burakov).
    * Fix error handling and documentation for rte_mem_lock (Anatoly Burakov).
    * Extract common code for EALs w/dynamic allocation (Anatoly Burakov).
    * Use POSIX value for rte_errno after rte_mem_unmap() on Windows.
    * Add stubs to use tracing functions without workarounds.

v4:
    * Rebase on ToT, drop patches merged into master.
    * Rearrange patches to split Windows code (Jerin).
    * Fix Linux and FreeBSD build with make (Ophir).
    * Use int instead of enum to hold a set of flags (Anatoly).
    * Rename eal_mem_reserve items and fix their description (Anatoly).
    * Add eal_mem_set_dump() wrapper around madvise (Anatoly).
    * Don't claim Windows Server 2016 support due to lack of API (Tal).
    * Replace enum rte_page_sizes with a set of #defines (Jerin).
    * Fix documentation, SPDX tags, logging (Thomas).

v3:
    * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
    * Fix logic and error handling while allocating segments.
    * Fix Unix rte_mem_map(): return NULL on failure.
    * Fix some checkpatch.sh issues:
        * Do not return positive errno, use DWORD for GetLastError().
        * Make dpdk-kmods source files non-executable.
    * Improve GSG for Windows Server (suggested by Ranjit Menon).

v2:
    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.

Dmitry Kozlyuk (11):
  eal: replace rte_page_sizes with a set of constants
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal/mem: extract common code for memseg list initialization
  eal/mem: extract common code for dynamic memory allocation
  trace: add size_t field emitter
  eal/windows: add tracing support stubs
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal/windows: implement basic memory management

 MAINTAINERS                                   |   9 +
 config/meson.build                            |  12 +-
 doc/guides/rel_notes/release_20_08.rst        |   2 +
 doc/guides/windows_gsg/build_dpdk.rst         |  20 -
 doc/guides/windows_gsg/index.rst              |   1 +
 doc/guides/windows_gsg/run_apps.rst           |  95 +++
 lib/librte_eal/common/eal_common_dynmem.c     | 521 +++++++++++++
 lib/librte_eal/common/eal_common_fbarray.c    |  71 +-
 lib/librte_eal/common/eal_common_memory.c     | 158 +++-
 lib/librte_eal/common/eal_common_thread.c     |   5 +-
 lib/librte_eal/common/eal_private.h           | 254 ++++++-
 lib/librte_eal/common/meson.build             |  16 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/freebsd/Makefile               |   5 +
 lib/librte_eal/freebsd/eal_memory.c           |  98 +--
 lib/librte_eal/include/rte_eal_memory.h       |  93 +++
 lib/librte_eal/include/rte_eal_trace.h        |   8 +-
 lib/librte_eal/include/rte_memory.h           |  23 +-
 lib/librte_eal/include/rte_trace_point.h      |   3 +
 lib/librte_eal/linux/Makefile                 |   6 +
 lib/librte_eal/linux/eal_memalloc.c           |   5 +-
 lib/librte_eal/linux/eal_memory.c             | 614 +--------------
 lib/librte_eal/meson.build                    |   4 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/rte_eal_version.map            |   9 +
 lib/librte_eal/unix/eal_file.c                |  80 ++
 lib/librte_eal/unix/eal_unix_memory.c         | 152 ++++
 lib/librte_eal/unix/meson.build               |   7 +
 lib/librte_eal/windows/eal.c                  | 107 +++
 lib/librte_eal/windows/eal_file.c             | 125 +++
 lib/librte_eal/windows/eal_hugepages.c        | 108 +++
 lib/librte_eal/windows/eal_lcore.c            | 185 +++--
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  85 +++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/sys/queue.h    | 663 ++++++++++++++--
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   7 +
 lib/librte_mempool/rte_mempool_trace.h        |  10 +-
 44 files changed, 4053 insertions(+), 939 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c
 create mode 100644 lib/librte_eal/include/rte_eal_memory.h
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 01/11] eal: replace rte_page_sizes with a set of constants
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                             ` (10 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, John McNamara,
	Marko Kovacevic, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_sizes has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Remove rte_page_sizes and replace its values with #define's.
This enumeration is not used in public API, so there's no ABI breakage.
Announce API changes for 20.08 in documentation.

Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 doc/guides/rel_notes/release_20_08.rst |  2 ++
 lib/librte_eal/include/rte_memory.h    | 23 ++++++++++-------------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
index 39064afbe..2041a29b9 100644
--- a/doc/guides/rel_notes/release_20_08.rst
+++ b/doc/guides/rel_notes/release_20_08.rst
@@ -85,6 +85,8 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* ``rte_page_sizes`` enumeration is replaced with ``RTE_PGSIZE_xxx`` defines.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..65374d53a 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -24,19 +24,16 @@ extern "C" {
 #include <rte_config.h>
 #include <rte_fbarray.h>
 
-__extension__
-enum rte_page_sizes {
-	RTE_PGSIZE_4K    = 1ULL << 12,
-	RTE_PGSIZE_64K   = 1ULL << 16,
-	RTE_PGSIZE_256K  = 1ULL << 18,
-	RTE_PGSIZE_2M    = 1ULL << 21,
-	RTE_PGSIZE_16M   = 1ULL << 24,
-	RTE_PGSIZE_256M  = 1ULL << 28,
-	RTE_PGSIZE_512M  = 1ULL << 29,
-	RTE_PGSIZE_1G    = 1ULL << 30,
-	RTE_PGSIZE_4G    = 1ULL << 32,
-	RTE_PGSIZE_16G   = 1ULL << 34,
-};
+#define RTE_PGSIZE_4K   (1ULL << 12)
+#define RTE_PGSIZE_64K  (1ULL << 16)
+#define RTE_PGSIZE_256K (1ULL << 18)
+#define RTE_PGSIZE_2M   (1ULL << 21)
+#define RTE_PGSIZE_16M  (1ULL << 24)
+#define RTE_PGSIZE_256M (1ULL << 28)
+#define RTE_PGSIZE_512M (1ULL << 29)
+#define RTE_PGSIZE_1G   (1ULL << 30)
+#define RTE_PGSIZE_4G   (1ULL << 32)
+#define RTE_PGSIZE_16G  (1ULL << 34)
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 02/11] eal: introduce internal wrappers for file operations
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
                             ` (9 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Neil Horman, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Introduce OS-independent wrappers in order to support common EAL code
on Unix and Windows:

* eal_file_open: open or create a file.
* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. These thin wrappers
require no special maintenance.

Common code supporting multi-process doesn't use the new wrappers,
because it is inherently Unix-specific and would impose excessive
requirements on the wrappers.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                |  3 +
 lib/librte_eal/common/eal_common_fbarray.c | 31 ++++-----
 lib/librte_eal/common/eal_private.h        | 73 ++++++++++++++++++++
 lib/librte_eal/freebsd/Makefile            |  4 ++
 lib/librte_eal/linux/Makefile              |  4 ++
 lib/librte_eal/meson.build                 |  4 ++
 lib/librte_eal/unix/eal_file.c             | 80 ++++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |  6 ++
 8 files changed, 186 insertions(+), 19 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/MAINTAINERS b/MAINTAINERS
index d2b286701..1d9aff26d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -323,6 +323,9 @@ FreeBSD UIO
 M: Bruce Richardson <bruce.richardson@intel.com>
 F: kernel/freebsd/nic_uio/
 
+Unix shared files
+F: lib/librte_eal/unix/
+
 Windows support
 M: Harini Ramakrishnan <harini.ramakrishnan@microsoft.com>
 M: Omar Cardona <ocardona@microsoft.com>
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 4f8f1af73..c52ddb967 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -8,8 +8,8 @@
 #include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -85,10 +85,8 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 
@@ -772,15 +770,15 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * and see if we succeed. If we don't, someone else is using it
 		 * already.
 		 */
-		fd = open(path, O_CREAT | O_RDWR, 0600);
+		fd = eal_file_open(path, EAL_OPEN_CREATE | EAL_OPEN_READWRITE);
 		if (fd < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't open %s: %s\n",
-					__func__, path, strerror(errno));
-			rte_errno = errno;
+				__func__, path, rte_strerror(rte_errno));
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
-					__func__, path, strerror(errno));
+				__func__, path, rte_strerror(rte_errno));
 			rte_errno = EBUSY;
 			goto fail;
 		}
@@ -789,10 +787,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -888,17 +884,14 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 
 	eal_get_fbarray_path(path, sizeof(path), arr->name);
 
-	fd = open(path, O_RDWR);
+	fd = eal_file_open(path, EAL_OPEN_READWRITE);
 	if (fd < 0) {
-		rte_errno = errno;
 		goto fail;
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -1025,7 +1018,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,7 +1035,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 869ce183a..6733a2321 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -420,4 +420,77 @@ eal_malloc_no_trace(const char *type, size_t size, unsigned int align);
 
 void eal_free_no_trace(void *addr);
 
+/** Options for eal_file_open(). */
+enum eal_open_flags {
+	/** Open file for reading. */
+	EAL_OPEN_READONLY = 0x00,
+	/** Open file for reading and writing. */
+	EAL_OPEN_READWRITE = 0x02,
+	/**
+	 * Create the file if it doesn't exist.
+	 * New files are only accessible to the owner (0600 equivalent).
+	 */
+	EAL_OPEN_CREATE = 0x04
+};
+
+/**
+ * Open or create a file.
+ *
+ * @param path
+ *  Path to the file.
+ * @param flags
+ *  A combination of eal_open_flags controlling operation and FD behavior.
+ * @return
+ *  Open file descriptor on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_file_open(const char *path, int flags);
+
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index af95386d4..0f8741d96 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -74,6 +75,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_hypervisor.c
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 48cc34844..331489f99 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -81,6 +82,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_hypervisor.c
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index e301f4558..8d492897d 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal_file.c b/lib/librte_eal/unix/eal_file.c
new file mode 100644
index 000000000..1b26475ba
--- /dev/null
+++ b/lib/librte_eal/unix/eal_file.c
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_open(const char *path, int flags)
+{
+	static const int MODE_MASK = EAL_OPEN_READONLY | EAL_OPEN_READWRITE;
+
+	int ret, sys_flags;
+
+	switch (flags & MODE_MASK) {
+	case EAL_OPEN_READONLY:
+		sys_flags = O_RDONLY;
+		break;
+	case EAL_OPEN_READWRITE:
+		sys_flags = O_RDWR;
+		break;
+	default:
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (flags & EAL_OPEN_CREATE)
+		sys_flags |= O_CREAT;
+
+	ret = open(path, sys_flags, 0600);
+	if (ret < 0)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..21029ba1a
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal_file.c',
+)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 03/11] eal: introduce memory management wrappers
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
                             ` (8 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson, Ray Kinsella, Neil Horman

Introduce OS-independent wrappers for memory management operations used
across DPDK and specifically in common code of EAL:

* rte_mem_map()
* rte_mem_unmap()
* rte_mem_page_size()
* rte_mem_lock()

Windows uses different APIs for memory mapping and reservation, while
Unices reserve memory by mapping it. Introduce EAL private functions to
support memory reservation in common code:

* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive. New symbols are internal. Being thin wrappers, they require
no special maintenance.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_fbarray.c |  40 +++---
 lib/librte_eal/common/eal_common_memory.c  |  61 ++++-----
 lib/librte_eal/common/eal_private.h        |  78 ++++++++++-
 lib/librte_eal/freebsd/Makefile            |   1 +
 lib/librte_eal/include/rte_eal_memory.h    |  93 +++++++++++++
 lib/librte_eal/linux/Makefile              |   1 +
 lib/librte_eal/linux/eal_memalloc.c        |   5 +-
 lib/librte_eal/rte_eal_version.map         |   9 ++
 lib/librte_eal/unix/eal_unix_memory.c      | 152 +++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |   1 +
 10 files changed, 376 insertions(+), 65 deletions(-)
 create mode 100644 lib/librte_eal/include/rte_eal_memory.h
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index c52ddb967..98fd6e1f6 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,16 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
 #include <string.h>
 #include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
+#include <rte_eal_memory.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -90,12 +91,9 @@ resize_and_map(int fd, void *addr, size_t len)
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -733,7 +731,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -754,11 +752,13 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		static const int flags = RTE_MAP_FORCE_ADDRESS |
+			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
+		void *new_data = rte_mem_map(data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
-					__func__, strerror(errno));
+					__func__, rte_strerror(rte_errno));
 			goto fail;
 		}
 	} else {
@@ -820,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -858,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -909,7 +909,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -937,8 +937,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -957,7 +956,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -992,8 +991,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1042,7 +1040,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 4c897a13f..f9fbd3e4e 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,13 +11,13 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
 #include <rte_memory.h>
 #include <rte_eal.h>
 #include <rte_eal_memconfig.h>
+#include <rte_eal_memory.h>
 #include <rte_errno.h>
 #include <rte_log.h>
 
@@ -40,18 +40,10 @@
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
-#ifdef RTE_EXEC_ENV_LINUX
-#define RTE_DONTDUMP MADV_DONTDUMP
-#elif defined RTE_EXEC_ENV_FREEBSD
-#define RTE_DONTDUMP MADV_NOCORE
-#else
-#error "madvise doesn't support this OS"
-#endif
-
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, int reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -59,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_mem_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -105,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
 			*size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, map_sz);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -132,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -161,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -179,19 +166,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	if (!unmap) {
 		/* Exclude these pages from a core dump. */
-		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
-			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
-				strerror(errno));
+		eal_mem_set_dump(aligned_addr, *size, false);
 	}
 
 	return aligned_addr;
@@ -547,10 +532,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	size_t page_size = rte_mem_page_size();
+	uintptr_t aligned = RTE_PTR_ALIGN_FLOOR(virtual, page_size);
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 6733a2321..3173f1d67 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**
+	 * Reserve hugepages. May be unsupported by some platforms.
+	 */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**
+	 * Force reserving memory at the requested address.
+	 * This can be a destructive action depending on the implementation.
+	 *
+	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
+	 *      (although implementations are not required to use it).
+	 */
+	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to eal_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+		size_t page_sz, int flags, int reserve_flags);
 
 /**
  * Get cpu core_id.
@@ -493,4 +512,57 @@ eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
 int
 eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation addressm which must be page-aligned.
+ *  The system might not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options, a combination of eal_mem_reserve_flags.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If *virt* and *size* describe a part of the reserved region,
+ * only this part of the region is freed (accurately up to the system
+ * page size). If *virt* points to allocated memory, *size* must match
+ * the one specified on allocation. The behavior is undefined
+ * if the memory pointed by *virt* is obtained from another source
+ * than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void
+eal_mem_free(void *virt, size_t size);
+
+/**
+ * Configure memory region inclusion into core dumps.
+ *
+ * @param virt
+ *  Starting address of the region.
+ * @param size
+ *  Size of the region.
+ * @param dump
+ *  True to include memory into core dumps, false to exclude.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index 0f8741d96..2374ba0b7 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
diff --git a/lib/librte_eal/include/rte_eal_memory.h b/lib/librte_eal/include/rte_eal_memory.h
new file mode 100644
index 000000000..0c5ef309d
--- /dev/null
+++ b/lib/librte_eal/include/rte_eal_memory.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+/** @file Mamory management wrappers used across DPDK. */
+
+/** Memory protection flags. */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/** Additional flags for memory mapping. */
+enum rte_map_flags {
+	/** Changes to the mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/**
+	 * Force mapping to the requested address. This flag should be used
+	 * with caution, because to fulfill the request implementation
+	 * may remove all other mappings in the requested region. However,
+	 * it is not required to do so, thus mapping with this flag may fail.
+	 */
+	RTE_MAP_FORCE_ADDRESS = 1 << 3
+};
+
+/**
+ * Map a portion of an opened file or the page file into memory.
+ *
+ * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
+ * extension, except for the return value.
+ *
+ * @param requested_addr
+ *  Desired virtual address for mapping. Can be NULL to let OS choose.
+ * @param size
+ *  Size of the mapping in bytes.
+ * @param prot
+ *  Protection flags, a combination of rte_mem_prot values.
+ * @param flags
+ *  Addtional mapping flags, a combination of rte_map_flags.
+ * @param fd
+ *  Mapped file descriptor. Can be negative for anonymous mapping.
+ * @param offset
+ *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
+ * @return
+ *  Mapped address or NULL on failure and rte_errno is set to OS error.
+ */
+__rte_internal
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_internal
+int
+rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never fails.
+ *
+ * @return
+ *   Page size in bytes.
+ */
+__rte_internal
+size_t
+rte_mem_page_size(void);
+
+/**
+ * Lock in physical memory all pages crossed by the address region.
+ *
+ * @param virt
+ *   Base virtual address of the region.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @see rte_mem_page_size() to retrieve the page size.
+ * @see rte_mem_lock_page() to lock an entire single page.
+ */
+__rte_internal
+int
+rte_mem_lock(const void *virt, size_t size);
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 331489f99..8febf2212 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
index 2c717f8bd..bf29b83c6 100644
--- a/lib/librte_eal/linux/eal_memalloc.c
+++ b/lib/librte_eal/linux/eal_memalloc.c
@@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 mapped:
 	munmap(addr, alloc_sz);
 unmapped:
-	flags = MAP_FIXED;
+	flags = EAL_RESERVE_FORCE_ADDRESS;
 	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
 	if (new_addr != addr) {
 		if (new_addr != NULL)
@@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
-		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
+	eal_mem_set_dump(ms->addr, ms->len, false);
 
 	exit_early = false;
 
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d8038749a..196eef5af 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -387,3 +387,12 @@ EXPERIMENTAL {
 	rte_trace_regexp;
 	rte_trace_save;
 };
+
+INTERNAL {
+	global:
+
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_page_size;
+	rte_mem_unmap;
+};
diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
new file mode 100644
index 000000000..4dd891667
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix_memory.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_eal_memory.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+
+#include "eal_private.h"
+
+#ifdef RTE_EXEC_ENV_LINUX
+#define EAL_DONTDUMP MADV_DONTDUMP
+#define EAL_DODUMP   MADV_DODUMP
+#elif defined RTE_EXEC_ENV_FREEBSD
+#define EAL_DONTDUMP MADV_NOCORE
+#define EAL_DODUMP   MADV_CORE
+#else
+#error "madvise doesn't support this OS"
+#endif
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+		return NULL;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+#ifdef MAP_HUGETLB
+		sys_flags |= MAP_HUGETLB;
+#else
+		rte_errno = ENOTSUP;
+		return NULL;
+#endif
+	}
+
+	if (flags & EAL_RESERVE_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
+	int ret = madvise(virt, size, flags);
+	if (ret) {
+		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
+				virt, size, flags, strerror(rte_errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+static int
+mem_rte_to_sys_prot(int prot)
+{
+	int sys_prot = PROT_NONE;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	int sys_flags = 0;
+	int sys_prot;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static size_t page_size;
+
+	if (!page_size)
+		page_size = sysconf(_SC_PAGESIZE);
+
+	return page_size;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	int ret = mlock(virt, size);
+	if (ret)
+		rte_errno = errno;
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index 21029ba1a..e733910a1 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal_file.c',
+	'eal_unix_memory.c',
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (2 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-09 11:14             ` Tal Shnaiderman
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
                             ` (7 subsequent siblings)
  11 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c |  97 ++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  62 ++++++++++++
 lib/librte_eal/freebsd/eal_memory.c       |  94 ++++--------------
 lib/librte_eal/linux/eal_memory.c         | 115 +++++-----------------
 4 files changed, 200 insertions(+), 168 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index f9fbd3e4e..3325d8c35 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -25,6 +25,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -182,6 +183,102 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+		uint64_t page_sz, int n_segs, int socket_id, bool heap)
+{
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL,
+		"Memseg list allocated at socket %i, page size 0x%zxkB\n",
+		socket_id, (size_t)page_sz >> 10);
+
+	return 0;
+}
+
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+
+	return eal_memseg_list_init_named(
+		msl, name, page_sz, n_segs, socket_id, heap);
+}
+
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
+{
+	size_t page_sz, mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(
+		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
+	if (addr == NULL) {
+#ifndef RTE_EXEC_ENV_WINDOWS
+		/* The hint would be misleading on Windows, but this function
+		 * is called from many places, including common code,
+		 * so don't duplicate the message.
+		 */
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+		else
+			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+#endif
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	RTE_LOG(DEBUG, EAL, "VA reserved for memseg list at %p, size %zx\n",
+			addr, mem_sz);
+
+	return 0;
+}
+
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
+{
+	size_t page_sz = msl->page_sz;
+	int i;
+
+	for (i = 0; i < n_segs; i++) {
+		struct rte_fbarray *arr = &msl->memseg_arr;
+		struct rte_memseg *ms = rte_fbarray_get(arr, i);
+
+		if (rte_eal_iova_mode() == RTE_IOVA_VA)
+			ms->iova = (uintptr_t)addr;
+		else
+			ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, i);
+
+		addr = RTE_PTR_ADD(addr, page_sz);
+	}
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 3173f1d67..1ec51b2eb 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -254,6 +254,68 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int reserve_flags);
 
+/**
+ * Initialize a memory segment list and create its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param name
+ *  Name for the backing storage.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+	uint64_t page_sz, int n_segs, int socket_id, bool heap);
+
+/**
+ * Initialize memory segment list and create its backing storage
+ * with a name corresponding to MSL parameters.
+ *
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ *
+ * @see eal_memseg_list_init_named for remaining parameters description.
+ */
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
+/**
+ * Reserve VA space for a memory segment list
+ * previously initialized with eal_memseg_list_init().
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param reserve_flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
+
+/**
+ * Populate MSL, each segment is one page long.
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param addr
+ *  Starting address of list segments.
+ * @param n_segs
+ *  Number of segments to populate.
+ */
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5bc2da160..29c3ed5a9 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -66,53 +66,34 @@ rte_eal_hugepage_init(void)
 		struct rte_memseg_list *msl;
 		struct rte_fbarray *arr;
 		struct rte_memseg *ms;
-		uint64_t page_sz;
+		uint64_t mem_sz, page_sz;
 		int n_segs, cur_seg;
 
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-				sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
-		addr = mmap(NULL, internal_config.memory,
-				PROT_READ | PROT_WRITE,
+		addr = mmap(NULL, mem_sz, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (addr == MAP_FAILED) {
 			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
 					strerror(errno));
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->len = internal_config.memory;
-		msl->socket_id = 0;
-		msl->heap = 1;
-
-		/* populate memsegs. each memseg is 1 page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->len = page_sz;
-			ms->socket_id = 0;
+		msl->base_va = addr;
+		msl->len = mem_sz;
 
-			rte_fbarray_set_used(arr, cur_seg);
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			addr = RTE_PTR_ADD(addr, page_sz);
-		}
 		return 0;
 	}
 
@@ -336,64 +317,25 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
 	int flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
@@ -479,7 +421,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_init(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
@@ -487,7 +429,7 @@ memseg_primary_init(void)
 			total_type_mem = total_segs * hugepage_sz;
 			type_msl_idx++;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				return -1;
 			}
@@ -518,7 +460,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..8b5fe613e 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, 0);
 }
 
 /*
@@ -1009,12 +969,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_init(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_alloc(msl) < 0)
 				return -1;
 		}
 	}
@@ -1323,8 +1283,6 @@ eal_legacy_hugepage_init(void)
 	struct rte_mem_config *mcfg;
 	struct hugepage_file *hugepage = NULL, *tmp_hp = NULL;
 	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	struct rte_fbarray *arr;
-	struct rte_memseg *ms;
 
 	uint64_t memory[RTE_MAX_NUMA_NODES];
 
@@ -1343,7 +1301,7 @@ eal_legacy_hugepage_init(void)
 		void *prealloc_addr;
 		size_t mem_sz;
 		struct rte_memseg_list *msl;
-		int n_segs, cur_seg, fd, flags;
+		int n_segs, fd, flags;
 #ifdef MEMFD_SUPPORTED
 		int memfd;
 #endif
@@ -1358,12 +1316,12 @@ eal_legacy_hugepage_init(void)
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-					sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
@@ -1400,16 +1358,10 @@ eal_legacy_hugepage_init(void)
 		/* preallocate address space for the memory, so that it can be
 		 * fit into the DMA mask.
 		 */
-		mem_sz = internal_config.memory;
-		prealloc_addr = eal_get_virtual_area(
-				NULL, &mem_sz, page_sz, 0, 0);
-		if (prealloc_addr == NULL) {
-			RTE_LOG(ERR, EAL,
-					"%s: reserving memory area failed: "
-					"%s\n",
-					__func__, strerror(errno));
+		if (eal_memseg_list_alloc(msl, 0))
 			return -1;
-		}
+
+		prealloc_addr = msl->base_va;
 		addr = mmap(prealloc_addr, mem_sz, PROT_READ | PROT_WRITE,
 				flags | MAP_FIXED, fd, 0);
 		if (addr == MAP_FAILED || addr != prealloc_addr) {
@@ -1418,11 +1370,6 @@ eal_legacy_hugepage_init(void)
 			munmap(prealloc_addr, mem_sz);
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->socket_id = 0;
-		msl->len = mem_sz;
-		msl->heap = 1;
 
 		/* we're in single-file segments mode, so only the segment list
 		 * fd needs to be set up.
@@ -1434,24 +1381,8 @@ eal_legacy_hugepage_init(void)
 			}
 		}
 
-		/* populate memsegs. each memseg is one page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->socket_id = 0;
-			ms->len = page_sz;
-
-			rte_fbarray_set_used(arr, cur_seg);
-
-			addr = RTE_PTR_ADD(addr, (size_t)page_sz);
-		}
 		if (mcfg->dma_maskbits &&
 		    rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
 			RTE_LOG(ERR, EAL,
@@ -2191,7 +2122,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_init(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2131,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_alloc(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2326,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_init(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2364,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 05/11] eal/mem: extract common code for dynamic memory allocation
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (3 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 06/11] trace: add size_t field emitter Dmitry Kozlyuk
                             ` (6 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Code in Linux EAL that supports dynamic memory allocation (as opposed to
static allocation used by FreeBSD) is not OS-dependent and can be reused
by Windows EAL. Move such code to a file compiled only for the OS that
require it. Keep Anatoly Burakov maintainer of extracted code.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                               |   1 +
 lib/librte_eal/common/eal_common_dynmem.c | 521 +++++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  43 +-
 lib/librte_eal/common/meson.build         |   4 +
 lib/librte_eal/freebsd/eal_memory.c       |  12 +-
 lib/librte_eal/linux/Makefile             |   1 +
 lib/librte_eal/linux/eal_memory.c         | 523 +---------------------
 7 files changed, 582 insertions(+), 523 deletions(-)
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d9aff26d..a1722ca73 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -208,6 +208,7 @@ F: lib/librte_eal/include/rte_fbarray.h
 F: lib/librte_eal/include/rte_mem*
 F: lib/librte_eal/include/rte_malloc.h
 F: lib/librte_eal/common/*malloc*
+F: lib/librte_eal/common/eal_common_dynmem.c
 F: lib/librte_eal/common/eal_common_fbarray.c
 F: lib/librte_eal/common/eal_common_mem*
 F: lib/librte_eal/common/eal_hugepages.h
diff --git a/lib/librte_eal/common/eal_common_dynmem.c b/lib/librte_eal/common/eal_common_dynmem.c
new file mode 100644
index 000000000..6b07672d0
--- /dev/null
+++ b/lib/librte_eal/common/eal_common_dynmem.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation.
+ * Copyright(c) 2013 6WIND S.A.
+ */
+
+#include <inttypes.h>
+#include <string.h>
+
+#include <rte_log.h>
+#include <rte_string_fns.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+
+/** @file Functions common to EALs that support dynamic memory allocation. */
+
+int
+eal_dynmem_memseg_lists_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
+			/* we can still sort pages by socket in legacy mode */
+			if (!internal_config.legacy_mem && socket_id > 0)
+				break;
+#endif
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (eal_memseg_list_init(msl, pagesz, n_segs,
+					socket_id, cur_seglist, true))
+				goto out;
+
+			if (eal_memseg_list_alloc(msl, 0)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int __rte_unused
+hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct hugepage_info *hpi = arg;
+
+	if (msl->page_sz != hpi->hugepage_sz)
+		return 0;
+
+	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
+	return 0;
+}
+
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+int
+eal_dynmem_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+#ifndef RTE_ARCH_64
+		struct hugepage_info dummy;
+		unsigned int i;
+#endif
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+
+#ifndef RTE_ARCH_64
+		/* for 32-bit, limit number of pages on socket to whatever we've
+		 * preallocated, as we cannot allocate more.
+		 */
+		memset(&dummy, 0, sizeof(dummy));
+		dummy.hugepage_sz = hpi->hugepage_sz;
+		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
+			return -1;
+
+		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
+			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
+					dummy.num_pages[i]);
+		}
+#endif
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
+		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
+
+	/* calculate final number of pages */
+	if (eal_dynmem_calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M "
+				"on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+__rte_unused /* function is unused on 32-bit builds */
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+int
+eal_dynmem_calc_num_pages_per_socket(
+	uint64_t *memory, struct hugepage_info *hp_info,
+	struct hugepage_info *hp_used, unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+#ifdef RTE_ARCH_64
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from CPU mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+#else
+		/* in 32-bit mode, allocate all of the memory only on master
+		 * lcore socket
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			struct rte_config *cfg = rte_eal_get_configuration();
+			unsigned int master_lcore_socket;
+
+			master_lcore_socket =
+				rte_lcore_to_socket_id(cfg->master_lcore);
+
+			if (master_lcore_socket != socket)
+				continue;
+
+			/* Update sizes */
+			memory[socket] = total_size;
+			break;
+		}
+#endif
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			rte_strscpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int)(internal_config.memory / 0x100000);
+		available = requested - (unsigned int)(total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 1ec51b2eb..2a780f513 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -13,6 +13,8 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+
 /**
  * Structure storing internal configuration (per-lcore)
  */
@@ -316,6 +318,45 @@ eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
 void
 eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
 
+/**
+ * Distribute available memory between MSLs.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_memseg_lists_init(void);
+
+/**
+ * Preallocate hugepages for dynamic allocation.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_hugepage_init(void);
+
+/**
+ * Given the list of hugepage sizes and the number of pages thereof,
+ * calculate the best number of pages of each size to fulfill the request
+ * for RAM on each NUMA node.
+ *
+ * @param memory
+ *  Amounts of memory requested for each NUMA node of RTE_MAX_NUMA_NODES.
+ * @param hp_info
+ *  Information about hugepages of different size.
+ * @param hp_used
+ *  Receives information about used hugepages of each size.
+ * @param num_hp_info
+ *  Number of elements in hp_info and hp_used.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_calc_num_pages_per_socket(
+		uint64_t *memory, struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used, unsigned int num_hp_info);
+
 /**
  * Get cpu core_id.
  *
@@ -595,7 +636,7 @@ void *
 eal_mem_reserve(void *requested_addr, size_t size, int flags);
 
 /**
- * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ * Free memory obtained by eal_mem_reserve() and possibly allocated.
  *
  * If *virt* and *size* describe a part of the reserved region,
  * only this part of the region is freed (accurately up to the system
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 55aaeb18e..d91c22220 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -56,3 +56,7 @@ sources += files(
 	'rte_reciprocal.c',
 	'rte_service.c',
 )
+
+if is_linux
+	sources += files('eal_common_dynmem.c')
+endif
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 29c3ed5a9..7106b8b84 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -317,14 +317,6 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
-}
-
 static int
 memseg_list_alloc(struct rte_memseg_list *msl)
 {
@@ -421,8 +413,8 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (memseg_list_init(msl, hugepage_sz, n_segs,
-					0, type_msl_idx))
+			if (eal_memseg_list_init(msl, hugepage_sz, n_segs,
+					0, type_msl_idx, false))
 				return -1;
 
 			total_segs += msl->memseg_arr.len;
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 8febf2212..07ce643ba 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -50,6 +50,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_timer.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memzone.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_log.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_launch.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_dynmem.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_mcfg.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memalloc.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memory.c
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 8b5fe613e..12d72f726 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -812,20 +812,6 @@ memseg_list_free(struct rte_memseg_list *msl)
 	return 0;
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
-}
-
-static int
-memseg_list_alloc(struct rte_memseg_list *msl)
-{
-	return eal_memseg_list_alloc(msl, 0);
-}
-
 /*
  * Our VA space is not preallocated yet, so preallocate it here. We need to know
  * how many segments there are in order to map all pages into one address space,
@@ -969,12 +955,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (memseg_list_init(msl, page_sz, n_segs, socket,
-						msl_idx) < 0)
+			if (eal_memseg_list_init(msl, page_sz, n_segs,
+					socket, msl_idx, true) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (memseg_list_alloc(msl) < 0)
+			if (eal_memseg_list_alloc(msl, 0) < 0)
 				return -1;
 		}
 	}
@@ -1045,182 +1031,6 @@ remap_needed_hugepages(struct hugepage_file *hugepages, int n_pages)
 	return 0;
 }
 
-__rte_unused /* function is unused on 32-bit builds */
-static inline uint64_t
-get_socket_mem_size(int socket)
-{
-	uint64_t size = 0;
-	unsigned i;
-
-	for (i = 0; i < internal_config.num_hugepage_sizes; i++){
-		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
-		size += hpi->hugepage_sz * hpi->num_pages[socket];
-	}
-
-	return size;
-}
-
-/*
- * This function is a NUMA-aware equivalent of calc_num_pages.
- * It takes in the list of hugepage sizes and the
- * number of pages thereof, and calculates the best number of
- * pages of each size to fulfill the request for <memory> ram
- */
-static int
-calc_num_pages_per_socket(uint64_t * memory,
-		struct hugepage_info *hp_info,
-		struct hugepage_info *hp_used,
-		unsigned num_hp_info)
-{
-	unsigned socket, j, i = 0;
-	unsigned requested, available;
-	int total_num_pages = 0;
-	uint64_t remaining_mem, cur_mem;
-	uint64_t total_mem = internal_config.memory;
-
-	if (num_hp_info == 0)
-		return -1;
-
-	/* if specific memory amounts per socket weren't requested */
-	if (internal_config.force_sockets == 0) {
-		size_t total_size;
-#ifdef RTE_ARCH_64
-		int cpu_per_socket[RTE_MAX_NUMA_NODES];
-		size_t default_size;
-		unsigned lcore_id;
-
-		/* Compute number of cores per socket */
-		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
-		RTE_LCORE_FOREACH(lcore_id) {
-			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
-		}
-
-		/*
-		 * Automatically spread requested memory amongst detected sockets according
-		 * to number of cores from cpu mask present on each socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-
-			/* Set memory amount per socket */
-			default_size = (internal_config.memory * cpu_per_socket[socket])
-					/ rte_lcore_count();
-
-			/* Limit to maximum available memory on socket */
-			default_size = RTE_MIN(default_size, get_socket_mem_size(socket));
-
-			/* Update sizes */
-			memory[socket] = default_size;
-			total_size -= default_size;
-		}
-
-		/*
-		 * If some memory is remaining, try to allocate it by getting all
-		 * available memory from sockets, one after the other
-		 */
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-			/* take whatever is available */
-			default_size = RTE_MIN(get_socket_mem_size(socket) - memory[socket],
-					       total_size);
-
-			/* Update sizes */
-			memory[socket] += default_size;
-			total_size -= default_size;
-		}
-#else
-		/* in 32-bit mode, allocate all of the memory only on master
-		 * lcore socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
-				socket++) {
-			struct rte_config *cfg = rte_eal_get_configuration();
-			unsigned int master_lcore_socket;
-
-			master_lcore_socket =
-				rte_lcore_to_socket_id(cfg->master_lcore);
-
-			if (master_lcore_socket != socket)
-				continue;
-
-			/* Update sizes */
-			memory[socket] = total_size;
-			break;
-		}
-#endif
-	}
-
-	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0; socket++) {
-		/* skips if the memory on specific socket wasn't requested */
-		for (i = 0; i < num_hp_info && memory[socket] != 0; i++){
-			strlcpy(hp_used[i].hugedir, hp_info[i].hugedir,
-				sizeof(hp_used[i].hugedir));
-			hp_used[i].num_pages[socket] = RTE_MIN(
-					memory[socket] / hp_info[i].hugepage_sz,
-					hp_info[i].num_pages[socket]);
-
-			cur_mem = hp_used[i].num_pages[socket] *
-					hp_used[i].hugepage_sz;
-
-			memory[socket] -= cur_mem;
-			total_mem -= cur_mem;
-
-			total_num_pages += hp_used[i].num_pages[socket];
-
-			/* check if we have met all memory requests */
-			if (memory[socket] == 0)
-				break;
-
-			/* check if we have any more pages left at this size, if so
-			 * move on to next size */
-			if (hp_used[i].num_pages[socket] == hp_info[i].num_pages[socket])
-				continue;
-			/* At this point we know that there are more pages available that are
-			 * bigger than the memory we want, so lets see if we can get enough
-			 * from other page sizes.
-			 */
-			remaining_mem = 0;
-			for (j = i+1; j < num_hp_info; j++)
-				remaining_mem += hp_info[j].hugepage_sz *
-				hp_info[j].num_pages[socket];
-
-			/* is there enough other memory, if not allocate another page and quit */
-			if (remaining_mem < memory[socket]){
-				cur_mem = RTE_MIN(memory[socket],
-						hp_info[i].hugepage_sz);
-				memory[socket] -= cur_mem;
-				total_mem -= cur_mem;
-				hp_used[i].num_pages[socket]++;
-				total_num_pages++;
-				break; /* we are done with this socket*/
-			}
-		}
-		/* if we didn't satisfy all memory requirements per socket */
-		if (memory[socket] > 0 &&
-				internal_config.socket_mem[socket] != 0) {
-			/* to prevent icc errors */
-			requested = (unsigned) (internal_config.socket_mem[socket] /
-					0x100000);
-			available = requested -
-					((unsigned) (memory[socket] / 0x100000));
-			RTE_LOG(ERR, EAL, "Not enough memory available on socket %u! "
-					"Requested: %uMB, available: %uMB\n", socket,
-					requested, available);
-			return -1;
-		}
-	}
-
-	/* if we didn't satisfy total memory requirements */
-	if (total_mem > 0) {
-		requested = (unsigned) (internal_config.memory / 0x100000);
-		available = requested - (unsigned) (total_mem / 0x100000);
-		RTE_LOG(ERR, EAL, "Not enough memory available! Requested: %uMB,"
-				" available: %uMB\n", requested, available);
-		return -1;
-	}
-	return total_num_pages;
-}
-
 static inline size_t
 eal_get_hugepage_mem_size(void)
 {
@@ -1524,7 +1334,7 @@ eal_legacy_hugepage_init(void)
 		memory[i] = internal_config.socket_mem[i];
 
 	/* calculate final number of pages */
-	nr_hugepages = calc_num_pages_per_socket(memory,
+	nr_hugepages = eal_dynmem_calc_num_pages_per_socket(memory,
 			internal_config.hugepage_info, used_hp,
 			internal_config.num_hugepage_sizes);
 
@@ -1651,140 +1461,6 @@ eal_legacy_hugepage_init(void)
 	return -1;
 }
 
-static int __rte_unused
-hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
-{
-	struct hugepage_info *hpi = arg;
-
-	if (msl->page_sz != hpi->hugepage_sz)
-		return 0;
-
-	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
-	return 0;
-}
-
-static int
-limits_callback(int socket_id, size_t cur_limit, size_t new_len)
-{
-	RTE_SET_USED(socket_id);
-	RTE_SET_USED(cur_limit);
-	RTE_SET_USED(new_len);
-	return -1;
-}
-
-static int
-eal_hugepage_init(void)
-{
-	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	uint64_t memory[RTE_MAX_NUMA_NODES];
-	int hp_sz_idx, socket_id;
-
-	memset(used_hp, 0, sizeof(used_hp));
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-#ifndef RTE_ARCH_64
-		struct hugepage_info dummy;
-		unsigned int i;
-#endif
-		/* also initialize used_hp hugepage sizes in used_hp */
-		struct hugepage_info *hpi;
-		hpi = &internal_config.hugepage_info[hp_sz_idx];
-		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
-
-#ifndef RTE_ARCH_64
-		/* for 32-bit, limit number of pages on socket to whatever we've
-		 * preallocated, as we cannot allocate more.
-		 */
-		memset(&dummy, 0, sizeof(dummy));
-		dummy.hugepage_sz = hpi->hugepage_sz;
-		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
-			return -1;
-
-		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
-			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
-					dummy.num_pages[i]);
-		}
-#endif
-	}
-
-	/* make a copy of socket_mem, needed for balanced allocation. */
-	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
-		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
-
-	/* calculate final number of pages */
-	if (calc_num_pages_per_socket(memory,
-			internal_config.hugepage_info, used_hp,
-			internal_config.num_hugepage_sizes) < 0)
-		return -1;
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
-				socket_id++) {
-			struct rte_memseg **pages;
-			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
-			unsigned int num_pages = hpi->num_pages[socket_id];
-			unsigned int num_pages_alloc;
-
-			if (num_pages == 0)
-				continue;
-
-			RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %" PRIu64 "M on socket %i\n",
-				num_pages, hpi->hugepage_sz >> 20, socket_id);
-
-			/* we may not be able to allocate all pages in one go,
-			 * because we break up our memory map into multiple
-			 * memseg lists. therefore, try allocating multiple
-			 * times and see if we can get the desired number of
-			 * pages from multiple allocations.
-			 */
-
-			num_pages_alloc = 0;
-			do {
-				int i, cur_pages, needed;
-
-				needed = num_pages - num_pages_alloc;
-
-				pages = malloc(sizeof(*pages) * needed);
-
-				/* do not request exact number of pages */
-				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
-						needed, hpi->hugepage_sz,
-						socket_id, false);
-				if (cur_pages <= 0) {
-					free(pages);
-					return -1;
-				}
-
-				/* mark preallocated pages as unfreeable */
-				for (i = 0; i < cur_pages; i++) {
-					struct rte_memseg *ms = pages[i];
-					ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE;
-				}
-				free(pages);
-
-				num_pages_alloc += cur_pages;
-			} while (num_pages_alloc != num_pages);
-		}
-	}
-	/* if socket limits were specified, set them */
-	if (internal_config.force_socket_limits) {
-		unsigned int i;
-		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
-			uint64_t limit = internal_config.socket_limit[i];
-			if (limit == 0)
-				continue;
-			if (rte_mem_alloc_validator_register("socket-limit",
-					limits_callback, i, limit))
-				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
-		}
-	}
-	return 0;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -1943,7 +1619,7 @@ rte_eal_hugepage_init(void)
 {
 	return internal_config.legacy_mem ?
 			eal_legacy_hugepage_init() :
-			eal_hugepage_init();
+			eal_dynmem_hugepage_init();
 }
 
 int
@@ -2122,8 +1798,9 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (memseg_list_init(msl, hugepage_sz, n_segs,
-						socket_id, type_msl_idx)) {
+				if (eal_memseg_list_init(msl, hugepage_sz,
+						n_segs, socket_id, type_msl_idx,
+						true)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
 					 */
@@ -2131,7 +1808,7 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (memseg_list_alloc(msl)) {
+				if (eal_memseg_list_alloc(msl, 0)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
@@ -2162,185 +1839,7 @@ memseg_primary_init_32(void)
 static int __rte_unused
 memseg_primary_init(void)
 {
-	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
-	struct memtype {
-		uint64_t page_sz;
-		int socket_id;
-	} *memtypes = NULL;
-	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
-	struct rte_memseg_list *msl;
-	uint64_t max_mem, max_mem_per_type;
-	unsigned int max_seglists_per_type;
-	unsigned int n_memtypes, cur_type;
-
-	/* no-huge does not need this at all */
-	if (internal_config.no_hugetlbfs)
-		return 0;
-
-	/*
-	 * figuring out amount of memory we're going to have is a long and very
-	 * involved process. the basic element we're operating with is a memory
-	 * type, defined as a combination of NUMA node ID and page size (so that
-	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
-	 *
-	 * deciding amount of memory going towards each memory type is a
-	 * balancing act between maximum segments per type, maximum memory per
-	 * type, and number of detected NUMA nodes. the goal is to make sure
-	 * each memory type gets at least one memseg list.
-	 *
-	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
-	 *
-	 * the total amount of memory per type is limited by either
-	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
-	 * of detected NUMA nodes. additionally, maximum number of segments per
-	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
-	 * smaller page sizes, it can take hundreds of thousands of segments to
-	 * reach the above specified per-type memory limits.
-	 *
-	 * additionally, each type may have multiple memseg lists associated
-	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
-	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
-	 *
-	 * the number of memseg lists per type is decided based on the above
-	 * limits, and also taking number of detected NUMA nodes, to make sure
-	 * that we don't run out of memseg lists before we populate all NUMA
-	 * nodes with memory.
-	 *
-	 * we do this in three stages. first, we collect the number of types.
-	 * then, we figure out memory constraints and populate the list of
-	 * would-be memseg lists. then, we go ahead and allocate the memseg
-	 * lists.
-	 */
-
-	/* create space for mem types */
-	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
-	memtypes = calloc(n_memtypes, sizeof(*memtypes));
-	if (memtypes == NULL) {
-		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
-		return -1;
-	}
-
-	/* populate mem types */
-	cur_type = 0;
-	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
-			hpi_idx++) {
-		struct hugepage_info *hpi;
-		uint64_t hugepage_sz;
-
-		hpi = &internal_config.hugepage_info[hpi_idx];
-		hugepage_sz = hpi->hugepage_sz;
-
-		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
-			int socket_id = rte_socket_id_by_idx(i);
-
-#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
-			/* we can still sort pages by socket in legacy mode */
-			if (!internal_config.legacy_mem && socket_id > 0)
-				break;
-#endif
-			memtypes[cur_type].page_sz = hugepage_sz;
-			memtypes[cur_type].socket_id = socket_id;
-
-			RTE_LOG(DEBUG, EAL, "Detected memory type: "
-				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
-				socket_id, hugepage_sz);
-		}
-	}
-	/* number of memtypes could have been lower due to no NUMA support */
-	n_memtypes = cur_type;
-
-	/* set up limits for types */
-	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
-	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
-			max_mem / n_memtypes);
-	/*
-	 * limit maximum number of segment lists per type to ensure there's
-	 * space for memseg lists for all NUMA nodes with all page sizes
-	 */
-	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
-
-	if (max_seglists_per_type == 0) {
-		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
-			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-		goto out;
-	}
-
-	/* go through all mem types and create segment lists */
-	msl_idx = 0;
-	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
-		unsigned int cur_seglist, n_seglists, n_segs;
-		unsigned int max_segs_per_type, max_segs_per_list;
-		struct memtype *type = &memtypes[cur_type];
-		uint64_t max_mem_per_list, pagesz;
-		int socket_id;
-
-		pagesz = type->page_sz;
-		socket_id = type->socket_id;
-
-		/*
-		 * we need to create segment lists for this type. we must take
-		 * into account the following things:
-		 *
-		 * 1. total amount of memory we can use for this memory type
-		 * 2. total amount of memory per memseg list allowed
-		 * 3. number of segments needed to fit the amount of memory
-		 * 4. number of segments allowed per type
-		 * 5. number of segments allowed per memseg list
-		 * 6. number of memseg lists we are allowed to take up
-		 */
-
-		/* calculate how much segments we will need in total */
-		max_segs_per_type = max_mem_per_type / pagesz;
-		/* limit number of segments to maximum allowed per type */
-		max_segs_per_type = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
-		/* limit number of segments to maximum allowed per list */
-		max_segs_per_list = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
-
-		/* calculate how much memory we can have per segment list */
-		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
-				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
-
-		/* calculate how many segments each segment list will have */
-		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
-
-		/* calculate how many segment lists we can have */
-		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
-				max_mem_per_type / max_mem_per_list);
-
-		/* limit number of segment lists according to our maximum */
-		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
-
-		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
-				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
-			n_seglists, n_segs, socket_id, pagesz);
-
-		/* create all segment lists */
-		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
-			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
-				RTE_LOG(ERR, EAL,
-					"No more space in memseg lists, please increase %s\n",
-					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-				goto out;
-			}
-			msl = &mcfg->memsegs[msl_idx++];
-
-			if (memseg_list_init(msl, pagesz, n_segs,
-					socket_id, cur_seglist))
-				goto out;
-
-			if (memseg_list_alloc(msl)) {
-				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
-				goto out;
-			}
-		}
-	}
-	/* we're successful */
-	ret = 0;
-out:
-	free(memtypes);
-	return ret;
+	return eal_dynmem_memseg_lists_init();
 }
 
 static int
@@ -2364,7 +1863,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (memseg_list_alloc(msl)) {
+		if (eal_memseg_list_alloc(msl, 0)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 06/11] trace: add size_t field emitter
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (4 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
                             ` (5 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, Sunil Kumar Kori,
	Olivier Matz, Andrew Rybchenko

It is not guaranteed that sizeof(long) == sizeof(size_t). On Windows,
sizeof(long) == 4 and sizeof(size_t) == 8 for 64-bit programs.
Tracepoints using "long" field emitter are therefore invalid there.
Add dedicated field emitter for size_t and use it to store size_t values
in all existing tracepoints.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/include/rte_eal_trace.h   |  8 ++++----
 lib/librte_eal/include/rte_trace_point.h |  3 +++
 lib/librte_mempool/rte_mempool_trace.h   | 10 +++++-----
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/include/rte_eal_trace.h b/lib/librte_eal/include/rte_eal_trace.h
index 1ebb2905a..bcfef0cfa 100644
--- a/lib/librte_eal/include/rte_eal_trace.h
+++ b/lib/librte_eal/include/rte_eal_trace.h
@@ -143,7 +143,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -154,7 +154,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -164,7 +164,7 @@ RTE_TRACE_POINT(
 	rte_eal_trace_mem_realloc,
 	RTE_TRACE_POINT_ARGS(size_t size, unsigned int align, int socket,
 		void *ptr),
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -183,7 +183,7 @@ RTE_TRACE_POINT(
 		unsigned int flags, unsigned int align, unsigned int bound,
 		const void *mz),
 	rte_trace_point_emit_string(name);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_int(socket_id);
 	rte_trace_point_emit_u32(flags);
 	rte_trace_point_emit_u32(align);
diff --git a/lib/librte_eal/include/rte_trace_point.h b/lib/librte_eal/include/rte_trace_point.h
index b45171275..377c2414a 100644
--- a/lib/librte_eal/include/rte_trace_point.h
+++ b/lib/librte_eal/include/rte_trace_point.h
@@ -138,6 +138,8 @@ _tp _args \
 #define rte_trace_point_emit_int(val)
 /** Tracepoint function payload for long datatype */
 #define rte_trace_point_emit_long(val)
+/** Tracepoint function payload for size_t datatype */
+#define rte_trace_point_emit_size_t(val)
 /** Tracepoint function payload for float datatype */
 #define rte_trace_point_emit_float(val)
 /** Tracepoint function payload for double datatype */
@@ -395,6 +397,7 @@ do { \
 #define rte_trace_point_emit_i8(in) __rte_trace_point_emit(in, int8_t)
 #define rte_trace_point_emit_int(in) __rte_trace_point_emit(in, int32_t)
 #define rte_trace_point_emit_long(in) __rte_trace_point_emit(in, long)
+#define rte_trace_point_emit_size_t(in) __rte_trace_point_emit(in, size_t)
 #define rte_trace_point_emit_float(in) __rte_trace_point_emit(in, float)
 #define rte_trace_point_emit_double(in) __rte_trace_point_emit(in, double)
 #define rte_trace_point_emit_ptr(in) __rte_trace_point_emit(in, uintptr_t)
diff --git a/lib/librte_mempool/rte_mempool_trace.h b/lib/librte_mempool/rte_mempool_trace.h
index e776df0a6..087c913c8 100644
--- a/lib/librte_mempool/rte_mempool_trace.h
+++ b/lib/librte_mempool/rte_mempool_trace.h
@@ -72,7 +72,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -84,8 +84,8 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(addr);
-	rte_trace_point_emit_long(len);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(len);
+	rte_trace_point_emit_size_t(pg_sz);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -126,7 +126,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(struct rte_mempool *mempool, size_t pg_sz),
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(pg_sz);
 )
 
 RTE_TRACE_POINT(
@@ -139,7 +139,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_u32(max_objs);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(obj_cb);
 	rte_trace_point_emit_ptr(obj_cb_arg);
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 07/11] eal/windows: add tracing support stubs
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (5 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 06/11] trace: add size_t field emitter Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                             ` (4 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

EAL common code depends on tracepoint calls, but generic implementation
cannot be enabled on Windows due to missing standard library facilities.
Add stub functions to support tracepoint compilation, so that common
code does not have to conditionally include tracepoints until proper
support is added.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_thread.c |  5 +---
 lib/librte_eal/common/meson.build         |  1 +
 lib/librte_eal/windows/eal.c              | 34 ++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_thread.c b/lib/librte_eal/common/eal_common_thread.c
index f9f588c17..370bb1b63 100644
--- a/lib/librte_eal/common/eal_common_thread.c
+++ b/lib/librte_eal/common/eal_common_thread.c
@@ -15,9 +15,7 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 #include <rte_log.h>
-#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_trace_point.h>
-#endif
 
 #include "eal_internal_cfg.h"
 #include "eal_private.h"
@@ -169,9 +167,8 @@ static void *rte_thread_init(void *arg)
 		free(params);
 	}
 
-#ifndef RTE_EXEC_ENV_WINDOWS
 	__rte_trace_mem_per_thread_alloc();
-#endif
+
 	return start_routine(routine_arg);
 }
 
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index d91c22220..4e9208129 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -14,6 +14,7 @@ if is_windows
 		'eal_common_log.c',
 		'eal_common_options.c',
 		'eal_common_thread.c',
+		'eal_common_trace_points.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index d084606a6..e7461f731 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -17,6 +17,7 @@
 #include <eal_filesystem.h>
 #include <eal_options.h>
 #include <eal_private.h>
+#include <rte_trace_point.h>
 
 #include "eal_windows.h"
 
@@ -221,7 +222,38 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
- /* Launch threads, called at application init(). */
+/* Stubs to enable EAL trace point compilation
+ * until eal_common_trace.c can be compiled.
+ */
+
+RTE_DEFINE_PER_LCORE(volatile int, trace_point_sz);
+RTE_DEFINE_PER_LCORE(void *, trace_mem);
+
+void
+__rte_trace_mem_per_thread_alloc(void)
+{
+}
+
+void
+__rte_trace_point_emit_field(size_t sz, const char *field,
+	const char *type)
+{
+	RTE_SET_USED(sz);
+	RTE_SET_USED(field);
+	RTE_SET_USED(type);
+}
+
+int
+__rte_trace_point_register(rte_trace_point_t *trace, const char *name,
+	void (*register_fn)(void))
+{
+	RTE_SET_USED(trace);
+	RTE_SET_USED(name);
+	RTE_SET_USED(register_fn);
+	return -ENOTSUP;
+}
+
+/* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (6 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                             ` (3 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 09/11] eal/windows: improve CPU and NUMA node detection
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (7 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
                             ` (2 subsequent siblings)
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal_lcore.c   | 185 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..9d931d50a 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() may not be available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 10/11] eal/windows: initialize hugepage info
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (8 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup. Only 2MB
hugepages are supported and their number is estimated roughly
due to the lack or unstable status of suitable OS APIs.
Assign myself as maintainer for the implementation file.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                            |   4 +
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/build_dpdk.rst  |  20 -----
 doc/guides/windows_gsg/index.rst       |   1 +
 doc/guides/windows_gsg/run_apps.rst    |  47 +++++++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 8 files changed, 177 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a1722ca73..19b818f69 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -336,6 +336,10 @@ F: lib/librte_eal/windows/
 F: lib/librte_eal/rte_eal_exports.def
 F: doc/guides/windows_gsg/
 
+Windows memory allocation
+M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
+F: lib/librte_eal/eal_hugepages.c
+
 
 Core Libraries
 --------------
diff --git a/config/meson.build b/config/meson.build
index 43ab11310..c1e80de4b 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -268,6 +268,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index e7461f731..7c2fcc860 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -19,8 +19,11 @@
 #include <eal_private.h>
 #include <rte_trace_point.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -276,6 +279,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..61d0dcd3c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available on Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem on Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index adfc8b9b7..52978e9d7 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
 	'eal_thread.c',
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v7 11/11] eal/windows: implement basic memory management
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (9 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-06-08  7:41           ` Dmitry Kozlyuk
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
  11 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-08  7:41 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode. Multi-process mode is not
implemented and is forcefully disabled at startup. Assign myself as a
maintainer for Windows file and memory management implementation.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                   |   1 +
 config/meson.build                            |  12 +-
 doc/guides/windows_gsg/run_apps.rst           |  54 +-
 lib/librte_eal/common/meson.build             |  11 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/windows/eal.c                  |  63 +-
 lib/librte_eal/windows/eal_file.c             | 125 +++
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  75 ++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   6 +
 18 files changed, 1771 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 19b818f69..5140756b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -339,6 +339,7 @@ F: doc/guides/windows_gsg/
 Windows memory allocation
 M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
 F: lib/librte_eal/eal_hugepages.c
+F: lib/librte_eal/eal_mem*
 
 
 Core Libraries
diff --git a/config/meson.build b/config/meson.build
index c1e80de4b..d3f05f878 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -261,15 +261,21 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..78e5a614f 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -7,10 +7,10 @@ Running DPDK Applications
 Grant *Lock pages in memory* Privilege
 --------------------------------------
 
-Use of hugepages ("large pages" in Windows terminolocy) requires
+Use of hugepages ("large pages" in Windows terminology) requires
 ``SeLockMemoryPrivilege`` for the user running an application.
 
-1. Open *Local Security Policy* snap in, either:
+1. Open *Local Security Policy* snap-in, either:
 
    * Control Panel / Computer Management / Local Security Policy;
    * or Win+R, type ``secpol``, press Enter.
@@ -24,7 +24,55 @@ Use of hugepages ("large pages" in Windows terminolocy) requires
 
 See `Large-Page Support`_ in MSDN for details.
 
-.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+.. _Large-Page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation. This driver is not signed, so signature
+checking must be disabled to load it.
+
+.. warning::
+
+    Disabling driver signature enforcement weakens OS security.
+    It is discouraged in production environments.
+
+Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``. It can be installed as follows
+from Elevated Command Prompt:
+
+.. code-block:: console
+
+    pnputil /add-driver Z:\path\to\virt2phys.inf /install
+
+On Windows Server additional steps are required:
+
+1. From Device Manager, Action menu, select "Add legacy hardware".
+2. It will launch the "Add Hardware Wizard". Click "Next".
+3. Select second option "Install the hardware that I manually select
+   from a list (Advanced)".
+4. On the next screen, "Kernel bypass" will be shown as a device class.
+5. Select it, and click "Next".
+6. The previously installed drivers will now be installed for the
+   "Virtual to physical address translator" device.
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+Installed driver persists across reboots.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
 
 
 Run the ``helloworld`` Example
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 4e9208129..310844269 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -8,13 +8,24 @@ if is_windows
 		'eal_common_bus.c',
 		'eal_common_class.c',
 		'eal_common_devargs.c',
+		'eal_common_dynmem.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_string_fns.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
 		'eal_common_trace_points.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index f1b73168b..9d39e58c0 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -20,6 +20,7 @@
 #include <rte_lcore.h>
 #include <rte_common.h>
 #include <rte_spinlock.h>
+
 #include <rte_eal_trace.h>
 
 #include <rte_malloc.h>
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..e2eb24f01 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,9 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
 	rte_log
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
 	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_page_size
+	rte_mem_unmap
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 7c2fcc860..a43649abc 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -94,6 +94,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -256,7 +274,7 @@ __rte_trace_point_register(rte_trace_point_t *trace, const char *name,
 	return -ENOTSUP;
 }
 
-/* Launch threads, called at application init(). */
+ /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
@@ -279,6 +297,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.in_memory == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.in_memory = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -290,6 +315,42 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_file.c b/lib/librte_eal/windows/eal_file.c
new file mode 100644
index 000000000..dfbe8d311
--- /dev/null
+++ b/lib/librte_eal/windows/eal_file.c
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <fcntl.h>
+#include <io.h>
+#include <share.h>
+#include <sys/stat.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_file_open(const char *path, int flags)
+{
+	static const int MODE_MASK = EAL_OPEN_READONLY | EAL_OPEN_READWRITE;
+
+	int fd, ret, sys_flags;
+
+	switch (flags & MODE_MASK) {
+	case EAL_OPEN_READONLY:
+		sys_flags = _O_RDONLY;
+		break;
+	case EAL_OPEN_READWRITE:
+		sys_flags = _O_RDWR;
+		break;
+	default:
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (flags & EAL_OPEN_CREATE)
+		sys_flags |= _O_CREAT;
+
+	ret = _sopen_s(&fd, path, sys_flags, _SH_DENYNO, _S_IWRITE);
+	if (ret < 0) {
+		rte_errno = errno;
+		return -1;
+	}
+
+	return fd;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..a7452b6e1
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,441 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory from OS. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+
+		/* During commitment, memory is temporary freed and might
+		 * be allocated by different non-EAL thread. This is a fatal
+		 * error, because it breaks MSL assumptions.
+		 */
+		if ((addr != NULL) && (addr != requested_addr)) {
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				requested_addr);
+			return -1;
+		}
+
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu) on socket %d\n",
+				requested_addr, alloc_sz, socket_id);
+			return -1;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx(%p)", addr);
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of a hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	if (eal_mem_decommit(addr, alloc_sz) && (rte_errno == EADDRNOTAVAIL)) {
+		/* During decommitment, memory is temporarily returned
+		 * to the system and the address may become unavailable.
+		 */
+		RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+			" allocation - MSL is not VA-contiguous!\n", addr);
+	}
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len)) {
+		if (rte_errno == EADDRNOTAVAIL) {
+			/* See alloc_seg() for explanation. */
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				ms->addr);
+		}
+		return -1;
+	}
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..2739da346
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,710 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <inttypes.h>
+#include <io.h>
+
+#include <rte_eal_memory.h>
+#include <rte_errno.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+#include <rte_virt2phys.h>
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ *
+ * MEM_RESERVE_PLACEHOLDER being defined means VirtualAlloc2() is present too.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags,
+	MemExtendedParameterMax
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	/* Contrary to the docs, VirtualAlloc2() is not in kernel32.dll,
+	 * see https://github.com/MicrosoftDocs/feedback/issues/1129.
+	 */
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+
+		/* Contrary to the docs, Server 2016 is not supported. */
+		RTE_LOG(ERR, EAL, "Windows 10 or Windows Server 2019 "
+			" is required for memory management\n");
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* defined(MEM_RESERVE_PLACEHOLDER) */
+
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static void
+set_errno_from_win32_alloc_error(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		rte_errno = 0;
+		break;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		rte_errno = ENOMEM;
+		break;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		rte_errno = EINVAL;
+		break;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	HANDLE process;
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	process = GetCurrentProcess();
+
+	virt = VirtualAlloc2(process, requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFreeEx(process, virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void *
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	HANDLE process;
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	process = GetCurrentProcess();
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+
+		if (VirtualQueryEx(process, requested_addr, &info,
+				sizeof(info)) != sizeof(info)) {
+			RTE_LOG_WIN32_ERR("VirtualQuery(%p)", requested_addr);
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) && !VirtualFreeEx(
+				process, requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR(
+				"VirtualFreeEx(%p, %zu, preserve placeholder)",
+				requested_addr, size);
+			return NULL;
+		}
+
+		/* Temporarily release the region to be committed.
+		 *
+		 * There is an inherent race for this memory range
+		 * if another thread allocates memory via OS API.
+		 * However, VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)
+		 * doesn't work with MEM_LARGE_PAGES on Windows Server.
+		 */
+		if (!VirtualFreeEx(process, requested_addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				requested_addr);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAlloc2(process, requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		/* Logging may overwrite GetLastError() result. */
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, commit large pages)",
+			requested_addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((requested_addr != NULL) && (addr != requested_addr)) {
+		/* We lost the race for the requested_addr. */
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", addr);
+
+		rte_errno = EADDRNOTAVAIL;
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	HANDLE process;
+	void *stub;
+	DWORD flags;
+
+	process = GetCurrentProcess();
+
+	/* Hugepages cannot be decommited on Windows,
+	 * so free them and replace the block with a placeholder.
+	 * There is a race for VA in this block until VirtualAlloc2 call.
+	 */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	flags = MEM_RESERVE | MEM_RESERVE_PLACEHOLDER;
+	stub = VirtualAlloc2(
+		process, addr, size, flags, PAGE_NOACCESS, NULL, 0);
+	if (stub == NULL) {
+		/* We lost the race for the VA. */
+		if (!VirtualFreeEx(process, stub, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", stub);
+		rte_errno = EADDRNOTAVAIL;
+		return -1;
+	}
+
+	/* No need to join reserved regions adjascent to the freed one:
+	 * eal_mem_commit() will just pick up the page-size placeholder
+	 * created here.
+	 */
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	HANDLE process;
+
+	process = GetCurrentProcess();
+
+	if (VirtualQueryEx(
+			process, addr, &info, sizeof(info)) != sizeof(info)) {
+		RTE_LOG_WIN32_ERR("VirtualQueryEx(%p)", addr);
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFreeEx(process, addr, size,
+			MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR(
+			"VirtualFreeEx(%p, %zu, preserve placeholder)",
+			addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	RTE_SET_USED(virt);
+	RTE_SET_USED(size);
+	RTE_SET_USED(dump);
+
+	/* Windows does not dump reserved memory by default.
+	 *
+	 * There is <werapi.h> to include or exclude regions from the dump,
+	 * but this is not currently required by EAL.
+	 */
+
+	rte_errno = ENOTSUP;
+	return -1;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* There is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!UnmapViewOfFile(virt))
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		virt = NULL;
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static SYSTEM_INFO info;
+
+	if (info.dwPageSize == 0)
+		GetSystemInfo(&info);
+
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock(%p %#zx)", virt, size);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		EAL_LOG_NOT_IMPLEMENTED();
+		return -1;
+	}
+
+	return eal_dynmem_memseg_lists_init();
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs;
+	uint64_t mem_sz, page_sz;
+	void *addr;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	msl = &mcfg->memsegs[0];
+
+	mem_sz = internal_config.memory;
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = mem_sz / page_sz;
+
+	if (eal_memseg_list_init_named(
+			msl, "nohugemem", page_sz, n_segs, 0, true)) {
+		return -1;
+	}
+
+	addr = VirtualAlloc(
+		NULL, mem_sz, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+	if (addr == NULL) {
+		RTE_LOG_WIN32_ERR("VirtualAlloc(size=%#zx)", mem_sz);
+		RTE_LOG(ERR, EAL, "Cannot allocate memory\n");
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	eal_memseg_list_populate(msl, addr, n_segs);
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_dynmem_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index 390d2fd66..caabffedf 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,4 +52,63 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with eal_mem_reserve()
+ * or decommitted from hugepages by eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and rte_errno is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit, must be the size of a page
+ *  (hugepage or regular one).
+ *
+ * The *addr* and *size* must match location and size
+ * of a previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..cb10d6494 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -14,6 +14,7 @@
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -36,6 +37,9 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
@@ -46,6 +50,7 @@ extern "C" {
 typedef long long ssize_t;
 
 #ifndef RTE_TOOLCHAIN_GCC
+
 static inline int
 asprintf(char **buffer, const char *format, ...)
 {
@@ -72,6 +77,18 @@ asprintf(char **buffer, const char *format, ...)
 	}
 	return ret;
 }
+
+static inline const char *
+eal_strerror(int code)
+{
+	static char buffer[128];
+
+	strerror_s(buffer, sizeof(buffer), code);
+	return buffer;
+}
+
+#define strerror eal_strerror
+
 #endif /* RTE_TOOLCHAIN_GCC */
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 52978e9d7..ded5a2b80 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,10 +6,16 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_file.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
+	'eal_memalloc.c',
+	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'fnmatch.c',
 	'getopt.c',
 )
+
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v7 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-09 11:14             ` Tal Shnaiderman
  2020-06-09 13:49               ` Burakov, Anatoly
  0 siblings, 1 reply; 218+ messages in thread
From: Tal Shnaiderman @ 2020-06-09 11:14 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Anatoly Burakov, Bruce Richardson

> Subject: [PATCH v7 04/11] eal/mem: extract common code for memseg list
> initialization
> 
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>  lib/librte_eal/common/eal_common_memory.c |  97
> ++++++++++++++++++
>  lib/librte_eal/common/eal_private.h       |  62 ++++++++++++
>  lib/librte_eal/freebsd/eal_memory.c       |  94 ++++--------------
>  lib/librte_eal/linux/eal_memory.c         | 115 +++++-----------------
>  4 files changed, 200 insertions(+), 168 deletions(-)
> 
> diff --git a/lib/librte_eal/common/eal_common_memory.c
> b/lib/librte_eal/common/eal_common_memory.c
> index f9fbd3e4e..3325d8c35 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c

[snip]

> +void
> +eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int
> +n_segs) {
> +	size_t page_sz = msl->page_sz;
> +	int i;
> +
> +	for (i = 0; i < n_segs; i++) {
> +		struct rte_fbarray *arr = &msl->memseg_arr;
> +		struct rte_memseg *ms = rte_fbarray_get(arr, i);

Since rte_fbarray_get() can return NULL you should verify *ms isn't dereferenced.

> +
> +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
> +			ms->iova = (uintptr_t)addr;
> +		else
> +			ms->iova = RTE_BAD_IOVA;
> +		ms->addr = addr;
> +		ms->hugepage_sz = page_sz;
> +		ms->socket_id = 0;
> +		ms->len = page_sz;
> +
> +		rte_fbarray_set_used(arr, i);
> +
> +		addr = RTE_PTR_ADD(addr, page_sz);
> +	}
> +}

[snip]

> --
> 2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-09 13:36           ` Burakov, Anatoly
  2020-06-09 14:17             ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-06-09 13:36 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On 03-Jun-20 12:03 AM, Dmitry Kozlyuk wrote:
> All supported OS create memory segment lists (MSL) and reserve VA space
> for them in a nearly identical way. Move common code into EAL private
> functions to reduce duplication.
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---

<snip>

> +int
> +eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
> +{
> +	size_t page_sz, mem_sz;
> +	void *addr;
> +
> +	page_sz = msl->page_sz;
> +	mem_sz = page_sz * msl->memseg_arr.len;
> +
> +	addr = eal_get_virtual_area(
> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
> +	if (addr == NULL) {
> +#ifndef RTE_EXEC_ENV_WINDOWS
> +		/* The hint would be misleading on Windows, but this function
> +		 * is called from many places, including common code,
> +		 * so don't duplicate the message.
> +		 */
> +		if (rte_errno == EADDRNOTAVAIL)
> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> +				(unsigned long long)mem_sz, msl->base_va);
> +		else
> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> +#endif

You're left without any error messages on Windows. How about:

const char *err_str = "Cannot reserve memory\n";
#ifndef RTE_EXEC_ENV_WINDOWS
if (rte_errno == EADDRNOTAVAIL)
    err_str = ...
#endif
RTE_LOG(ERR, EAL, err_str);

or something like that?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v7 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-09 11:14             ` Tal Shnaiderman
@ 2020-06-09 13:49               ` Burakov, Anatoly
  0 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-06-09 13:49 UTC (permalink / raw)
  To: Tal Shnaiderman, Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader, Bruce Richardson

On 09-Jun-20 12:14 PM, Tal Shnaiderman wrote:
>> Subject: [PATCH v7 04/11] eal/mem: extract common code for memseg list
>> initialization
>>
>> All supported OS create memory segment lists (MSL) and reserve VA space
>> for them in a nearly identical way. Move common code into EAL private
>> functions to reduce duplication.
>>
>> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
>> ---
>>   lib/librte_eal/common/eal_common_memory.c |  97
>> ++++++++++++++++++
>>   lib/librte_eal/common/eal_private.h       |  62 ++++++++++++
>>   lib/librte_eal/freebsd/eal_memory.c       |  94 ++++--------------
>>   lib/librte_eal/linux/eal_memory.c         | 115 +++++-----------------
>>   4 files changed, 200 insertions(+), 168 deletions(-)
>>
>> diff --git a/lib/librte_eal/common/eal_common_memory.c
>> b/lib/librte_eal/common/eal_common_memory.c
>> index f9fbd3e4e..3325d8c35 100644
>> --- a/lib/librte_eal/common/eal_common_memory.c
>> +++ b/lib/librte_eal/common/eal_common_memory.c
> 
> [snip]
> 
>> +void
>> +eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int
>> +n_segs) {
>> +	size_t page_sz = msl->page_sz;
>> +	int i;
>> +
>> +	for (i = 0; i < n_segs; i++) {
>> +		struct rte_fbarray *arr = &msl->memseg_arr;
>> +		struct rte_memseg *ms = rte_fbarray_get(arr, i);
> 
> Since rte_fbarray_get() can return NULL you should verify *ms isn't dereferenced.

I don't think it's necessary in this case, since this fbarray was just 
initialized with the n_segs value in the calling code.

> 
>> +
>> +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
>> +			ms->iova = (uintptr_t)addr;
>> +		else
>> +			ms->iova = RTE_BAD_IOVA;
>> +		ms->addr = addr;
>> +		ms->hugepage_sz = page_sz;
>> +		ms->socket_id = 0;
>> +		ms->len = page_sz;
>> +
>> +		rte_fbarray_set_used(arr, i);
>> +
>> +		addr = RTE_PTR_ADD(addr, page_sz);
>> +	}
>> +}
> 
> [snip]
> 
>> --
>> 2.25.4
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-09 13:36           ` Burakov, Anatoly
@ 2020-06-09 14:17             ` Dmitry Kozlyuk
  2020-06-10 10:26               ` Burakov, Anatoly
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-09 14:17 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On Tue, 9 Jun 2020 14:36:10 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> On 03-Jun-20 12:03 AM, Dmitry Kozlyuk wrote:
> > All supported OS create memory segment lists (MSL) and reserve VA space
> > for them in a nearly identical way. Move common code into EAL private
> > functions to reduce duplication.
> > 
> > Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> > ---  
> 
> <snip>
> 
> > +int
> > +eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
> > +{
> > +	size_t page_sz, mem_sz;
> > +	void *addr;
> > +
> > +	page_sz = msl->page_sz;
> > +	mem_sz = page_sz * msl->memseg_arr.len;
> > +
> > +	addr = eal_get_virtual_area(
> > +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
> > +	if (addr == NULL) {
> > +#ifndef RTE_EXEC_ENV_WINDOWS
> > +		/* The hint would be misleading on Windows, but this function
> > +		 * is called from many places, including common code,
> > +		 * so don't duplicate the message.
> > +		 */
> > +		if (rte_errno == EADDRNOTAVAIL)
> > +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> > +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> > +				(unsigned long long)mem_sz, msl->base_va);
> > +		else
> > +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> > +#endif  
> 
> You're left without any error messages on Windows. How about:
> 
> const char *err_str = "Cannot reserve memory\n";
> #ifndef RTE_EXEC_ENV_WINDOWS
> if (rte_errno == EADDRNOTAVAIL)
>     err_str = ...
> #endif
> RTE_LOG(ERR, EAL, err_str);
> 
> or something like that?
> 

How about removing generic error message here completely and printing more
specific messages at call sites? In fact, almost all of them already do this.
It would be more helpful in tracking down errors.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-09 14:17             ` Dmitry Kozlyuk
@ 2020-06-10 10:26               ` Burakov, Anatoly
  2020-06-10 14:31                 ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-06-10 10:26 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On 09-Jun-20 3:17 PM, Dmitry Kozlyuk wrote:
> On Tue, 9 Jun 2020 14:36:10 +0100
> "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:
> 
>> On 03-Jun-20 12:03 AM, Dmitry Kozlyuk wrote:
>>> All supported OS create memory segment lists (MSL) and reserve VA space
>>> for them in a nearly identical way. Move common code into EAL private
>>> functions to reduce duplication.
>>>
>>> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
>>> ---
>>
>> <snip>
>>
>>> +int
>>> +eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
>>> +{
>>> +	size_t page_sz, mem_sz;
>>> +	void *addr;
>>> +
>>> +	page_sz = msl->page_sz;
>>> +	mem_sz = page_sz * msl->memseg_arr.len;
>>> +
>>> +	addr = eal_get_virtual_area(
>>> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
>>> +	if (addr == NULL) {
>>> +#ifndef RTE_EXEC_ENV_WINDOWS
>>> +		/* The hint would be misleading on Windows, but this function
>>> +		 * is called from many places, including common code,
>>> +		 * so don't duplicate the message.
>>> +		 */
>>> +		if (rte_errno == EADDRNOTAVAIL)
>>> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
>>> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
>>> +				(unsigned long long)mem_sz, msl->base_va);
>>> +		else
>>> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
>>> +#endif
>>
>> You're left without any error messages on Windows. How about:
>>
>> const char *err_str = "Cannot reserve memory\n";
>> #ifndef RTE_EXEC_ENV_WINDOWS
>> if (rte_errno == EADDRNOTAVAIL)
>>      err_str = ...
>> #endif
>> RTE_LOG(ERR, EAL, err_str);
>>
>> or something like that?
>>
> 
> How about removing generic error message here completely and printing more
> specific messages at call sites? In fact, almost all of them already do this.
> It would be more helpful in tracking down errors.
> 

Agreed, let's do that :) We do pass up the rte_errno, correct? So, we 
should be able to do that.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 00/11] Windows basic memory management
  2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
                             ` (10 preceding siblings ...)
  2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-06-10 14:27           ` Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
                               ` (12 more replies)
  11 siblings, 13 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing without IOVA is available.

Testing revealed Windows Server 2019 does not allow allocating hugepage
memory at a reserved address, despite advertised API.  So allocator has
to temporary free the region to be allocated.  This creates in inherent
race condition. This issue is being discussed with Microsoft privately.

New EAL public functions for memory mapping are introduced to mitigate
OS differences in DPDK libraries and applications: rte_mem_map,
rte_mem_unmap, rte_mem_lock, rte_mem_page_size.

To support common MM routines, internal wrappers for low-level memory
reservation and file management are introduced. These changes affect
Linux and FreeBSD EAL. Shared code is placed unded /unix/ subdirectory
(suggested by Thomas).

To avoid code duplication between Linux and Windows EAL, common code
for EALs supporting dynamic memory allocation is extracted
(discussed with Anatoly Burakov in v4 thread). This is a separate
patch to ease the review, but it can be merged with the previous one.

EAL tracepoints save size_t values as long, which is invalid on Windows.
New size_t emitter for tracepoints is introduced (suggested by Jerin
Jacob to Fady Bader, see [1]). Also, to avoid workaround in every file
using the tracepoints, stubs are added to Windows EAL.

Entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.
The file is imported as-is, so it causes a bunch of checkpatch warnings.

[1]: http://mails.dpdk.org/archives/dev/2020-May/168076.html

---

v8:
    * Log eal_memseg_list_alloc() failure at caller sites (Anatoly Burakov).

v7:
    * Change EAL internal file management API (Neil Horman).

v6:
    * Fix 32-bit build on x86 (CI).
    * Fix Makefile build (Anatoly Burakov, Thomas Monjalon).
    * Restore 32-bit common code (Anatoly Burakov).
    * Fix error reporting in memory management (Anatoly Burakov).
    * Add Doxygen comment for size_t tracepoint emitter (Jerin Jacob).
    * Update MAINTAINERS for new files and new code (Thomas Monjalon).
    * Rename rte_get_page_size to rte_mem_page_size.
    * Mark DPDK-only wrappers internal, move them to separate file.
    * Get rid of warnings in enabled common code with Clang on Windows.

v5:
    * Fix allocation and deallocation on Windows Server (Fady Bader).
    * Replace remaining VirtualFree with VirtualFreeEx (Ranjit Menon).
    * Fix errors in eal_get_virtual_area (Anatoly Burakov).
    * Fix error handling and documentation for rte_mem_lock (Anatoly Burakov).
    * Extract common code for EALs w/dynamic allocation (Anatoly Burakov).
    * Use POSIX value for rte_errno after rte_mem_unmap() on Windows.
    * Add stubs to use tracing functions without workarounds.

v4:
    * Rebase on ToT, drop patches merged into master.
    * Rearrange patches to split Windows code (Jerin).
    * Fix Linux and FreeBSD build with make (Ophir).
    * Use int instead of enum to hold a set of flags (Anatoly).
    * Rename eal_mem_reserve items and fix their description (Anatoly).
    * Add eal_mem_set_dump() wrapper around madvise (Anatoly).
    * Don't claim Windows Server 2016 support due to lack of API (Tal).
    * Replace enum rte_page_sizes with a set of #defines (Jerin).
    * Fix documentation, SPDX tags, logging (Thomas).

v3:
    * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
    * Fix logic and error handling while allocating segments.
    * Fix Unix rte_mem_map(): return NULL on failure.
    * Fix some checkpatch.sh issues:
        * Do not return positive errno, use DWORD for GetLastError().
        * Make dpdk-kmods source files non-executable.
    * Improve GSG for Windows Server (suggested by Ranjit Menon).

v2:
    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.


*** BLURB HERE ***

Dmitry Kozlyuk (11):
  eal: replace rte_page_sizes with a set of constants
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal/mem: extract common code for memseg list initialization
  eal/mem: extract common code for dynamic memory allocation
  trace: add size_t field emitter
  eal/windows: add tracing support stubs
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: improve CPU and NUMA node detection
  eal/windows: initialize hugepage info
  eal/windows: implement basic memory management

 MAINTAINERS                                   |   9 +
 config/meson.build                            |  12 +-
 doc/guides/rel_notes/release_20_08.rst        |   2 +
 doc/guides/windows_gsg/build_dpdk.rst         |  20 -
 doc/guides/windows_gsg/index.rst              |   1 +
 doc/guides/windows_gsg/run_apps.rst           |  95 +++
 lib/librte_eal/common/eal_common_dynmem.c     | 521 +++++++++++++
 lib/librte_eal/common/eal_common_fbarray.c    |  71 +-
 lib/librte_eal/common/eal_common_memory.c     | 156 +++-
 lib/librte_eal/common/eal_common_thread.c     |   5 +-
 lib/librte_eal/common/eal_private.h           | 254 ++++++-
 lib/librte_eal/common/meson.build             |  16 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/freebsd/Makefile               |   5 +
 lib/librte_eal/freebsd/eal_memory.c           |  98 +--
 lib/librte_eal/include/rte_eal_memory.h       |  93 +++
 lib/librte_eal/include/rte_eal_trace.h        |   8 +-
 lib/librte_eal/include/rte_memory.h           |  23 +-
 lib/librte_eal/include/rte_trace_point.h      |   3 +
 lib/librte_eal/linux/Makefile                 |   6 +
 lib/librte_eal/linux/eal_memalloc.c           |   5 +-
 lib/librte_eal/linux/eal_memory.c             | 618 +--------------
 lib/librte_eal/meson.build                    |   4 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/rte_eal_version.map            |   9 +
 lib/librte_eal/unix/eal_file.c                |  80 ++
 lib/librte_eal/unix/eal_unix_memory.c         | 152 ++++
 lib/librte_eal/unix/meson.build               |   7 +
 lib/librte_eal/windows/eal.c                  | 107 +++
 lib/librte_eal/windows/eal_file.c             | 125 +++
 lib/librte_eal/windows/eal_hugepages.c        | 108 +++
 lib/librte_eal/windows/eal_lcore.c            | 185 +++--
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  85 +++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/sys/queue.h    | 663 ++++++++++++++--
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   7 +
 lib/librte_mempool/rte_mempool_trace.h        |  10 +-
 44 files changed, 4056 insertions(+), 938 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c
 create mode 100644 lib/librte_eal/include/rte_eal_memory.h
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 01/11] eal: replace rte_page_sizes with a set of constants
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                               ` (11 subsequent siblings)
  12 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, John McNamara,
	Marko Kovacevic, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_sizes has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Remove rte_page_sizes and replace its values with #define's.
This enumeration is not used in public API, so there's no ABI breakage.
Announce API changes for 20.08 in documentation.

Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 doc/guides/rel_notes/release_20_08.rst |  2 ++
 lib/librte_eal/include/rte_memory.h    | 23 ++++++++++-------------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
index 39064afbe..2041a29b9 100644
--- a/doc/guides/rel_notes/release_20_08.rst
+++ b/doc/guides/rel_notes/release_20_08.rst
@@ -85,6 +85,8 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* ``rte_page_sizes`` enumeration is replaced with ``RTE_PGSIZE_xxx`` defines.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..65374d53a 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -24,19 +24,16 @@ extern "C" {
 #include <rte_config.h>
 #include <rte_fbarray.h>
 
-__extension__
-enum rte_page_sizes {
-	RTE_PGSIZE_4K    = 1ULL << 12,
-	RTE_PGSIZE_64K   = 1ULL << 16,
-	RTE_PGSIZE_256K  = 1ULL << 18,
-	RTE_PGSIZE_2M    = 1ULL << 21,
-	RTE_PGSIZE_16M   = 1ULL << 24,
-	RTE_PGSIZE_256M  = 1ULL << 28,
-	RTE_PGSIZE_512M  = 1ULL << 29,
-	RTE_PGSIZE_1G    = 1ULL << 30,
-	RTE_PGSIZE_4G    = 1ULL << 32,
-	RTE_PGSIZE_16G   = 1ULL << 34,
-};
+#define RTE_PGSIZE_4K   (1ULL << 12)
+#define RTE_PGSIZE_64K  (1ULL << 16)
+#define RTE_PGSIZE_256K (1ULL << 18)
+#define RTE_PGSIZE_2M   (1ULL << 21)
+#define RTE_PGSIZE_16M  (1ULL << 24)
+#define RTE_PGSIZE_256M (1ULL << 28)
+#define RTE_PGSIZE_512M (1ULL << 29)
+#define RTE_PGSIZE_1G   (1ULL << 30)
+#define RTE_PGSIZE_4G   (1ULL << 32)
+#define RTE_PGSIZE_16G  (1ULL << 34)
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 02/11] eal: introduce internal wrappers for file operations
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-11 17:13               ` Thomas Monjalon
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
                               ` (10 subsequent siblings)
  12 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Introduce OS-independent wrappers in order to support common EAL code
on Unix and Windows:

* eal_file_open: open or create a file.
* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. These thin wrappers
require no special maintenance.

Common code supporting multi-process doesn't use the new wrappers,
because it is inherently Unix-specific and would impose excessive
requirements on the wrappers.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                |  3 +
 lib/librte_eal/common/eal_common_fbarray.c | 31 ++++-----
 lib/librte_eal/common/eal_private.h        | 73 ++++++++++++++++++++
 lib/librte_eal/freebsd/Makefile            |  4 ++
 lib/librte_eal/linux/Makefile              |  4 ++
 lib/librte_eal/meson.build                 |  4 ++
 lib/librte_eal/unix/eal_file.c             | 80 ++++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |  6 ++
 8 files changed, 186 insertions(+), 19 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/MAINTAINERS b/MAINTAINERS
index d2b286701..1d9aff26d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -323,6 +323,9 @@ FreeBSD UIO
 M: Bruce Richardson <bruce.richardson@intel.com>
 F: kernel/freebsd/nic_uio/
 
+Unix shared files
+F: lib/librte_eal/unix/
+
 Windows support
 M: Harini Ramakrishnan <harini.ramakrishnan@microsoft.com>
 M: Omar Cardona <ocardona@microsoft.com>
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 4f8f1af73..c52ddb967 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -8,8 +8,8 @@
 #include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -85,10 +85,8 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 
@@ -772,15 +770,15 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * and see if we succeed. If we don't, someone else is using it
 		 * already.
 		 */
-		fd = open(path, O_CREAT | O_RDWR, 0600);
+		fd = eal_file_open(path, EAL_OPEN_CREATE | EAL_OPEN_READWRITE);
 		if (fd < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't open %s: %s\n",
-					__func__, path, strerror(errno));
-			rte_errno = errno;
+				__func__, path, rte_strerror(rte_errno));
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
-					__func__, path, strerror(errno));
+				__func__, path, rte_strerror(rte_errno));
 			rte_errno = EBUSY;
 			goto fail;
 		}
@@ -789,10 +787,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -888,17 +884,14 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 
 	eal_get_fbarray_path(path, sizeof(path), arr->name);
 
-	fd = open(path, O_RDWR);
+	fd = eal_file_open(path, EAL_OPEN_READWRITE);
 	if (fd < 0) {
-		rte_errno = errno;
 		goto fail;
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -1025,7 +1018,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,7 +1035,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 869ce183a..6733a2321 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -420,4 +420,77 @@ eal_malloc_no_trace(const char *type, size_t size, unsigned int align);
 
 void eal_free_no_trace(void *addr);
 
+/** Options for eal_file_open(). */
+enum eal_open_flags {
+	/** Open file for reading. */
+	EAL_OPEN_READONLY = 0x00,
+	/** Open file for reading and writing. */
+	EAL_OPEN_READWRITE = 0x02,
+	/**
+	 * Create the file if it doesn't exist.
+	 * New files are only accessible to the owner (0600 equivalent).
+	 */
+	EAL_OPEN_CREATE = 0x04
+};
+
+/**
+ * Open or create a file.
+ *
+ * @param path
+ *  Path to the file.
+ * @param flags
+ *  A combination of eal_open_flags controlling operation and FD behavior.
+ * @return
+ *  Open file descriptor on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_file_open(const char *path, int flags);
+
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index af95386d4..0f8741d96 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -74,6 +75,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_hypervisor.c
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 48cc34844..331489f99 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -81,6 +82,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_hypervisor.c
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index e301f4558..8d492897d 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal_file.c b/lib/librte_eal/unix/eal_file.c
new file mode 100644
index 000000000..1b26475ba
--- /dev/null
+++ b/lib/librte_eal/unix/eal_file.c
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_open(const char *path, int flags)
+{
+	static const int MODE_MASK = EAL_OPEN_READONLY | EAL_OPEN_READWRITE;
+
+	int ret, sys_flags;
+
+	switch (flags & MODE_MASK) {
+	case EAL_OPEN_READONLY:
+		sys_flags = O_RDONLY;
+		break;
+	case EAL_OPEN_READWRITE:
+		sys_flags = O_RDWR;
+		break;
+	default:
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (flags & EAL_OPEN_CREATE)
+		sys_flags |= O_CREAT;
+
+	ret = open(path, sys_flags, 0600);
+	if (ret < 0)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..21029ba1a
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal_file.c',
+)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-12 10:47               ` Thomas Monjalon
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
                               ` (9 subsequent siblings)
  12 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson, Ray Kinsella, Neil Horman

Introduce OS-independent wrappers for memory management operations used
across DPDK and specifically in common code of EAL:

* rte_mem_map()
* rte_mem_unmap()
* rte_mem_page_size()
* rte_mem_lock()

Windows uses different APIs for memory mapping and reservation, while
Unices reserve memory by mapping it. Introduce EAL private functions to
support memory reservation in common code:

* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive. New symbols are internal. Being thin wrappers, they require
no special maintenance.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_fbarray.c |  40 +++---
 lib/librte_eal/common/eal_common_memory.c  |  61 ++++-----
 lib/librte_eal/common/eal_private.h        |  78 ++++++++++-
 lib/librte_eal/freebsd/Makefile            |   1 +
 lib/librte_eal/include/rte_eal_memory.h    |  93 +++++++++++++
 lib/librte_eal/linux/Makefile              |   1 +
 lib/librte_eal/linux/eal_memalloc.c        |   5 +-
 lib/librte_eal/rte_eal_version.map         |   9 ++
 lib/librte_eal/unix/eal_unix_memory.c      | 152 +++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |   1 +
 10 files changed, 376 insertions(+), 65 deletions(-)
 create mode 100644 lib/librte_eal/include/rte_eal_memory.h
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index c52ddb967..98fd6e1f6 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,16 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
 #include <string.h>
 #include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
+#include <rte_eal_memory.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -90,12 +91,9 @@ resize_and_map(int fd, void *addr, size_t len)
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -733,7 +731,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -754,11 +752,13 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		static const int flags = RTE_MAP_FORCE_ADDRESS |
+			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
+		void *new_data = rte_mem_map(data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
-					__func__, strerror(errno));
+					__func__, rte_strerror(rte_errno));
 			goto fail;
 		}
 	} else {
@@ -820,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -858,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -909,7 +909,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -937,8 +937,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -957,7 +956,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -992,8 +991,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1042,7 +1040,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 4c897a13f..f9fbd3e4e 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,13 +11,13 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
 #include <rte_memory.h>
 #include <rte_eal.h>
 #include <rte_eal_memconfig.h>
+#include <rte_eal_memory.h>
 #include <rte_errno.h>
 #include <rte_log.h>
 
@@ -40,18 +40,10 @@
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
-#ifdef RTE_EXEC_ENV_LINUX
-#define RTE_DONTDUMP MADV_DONTDUMP
-#elif defined RTE_EXEC_ENV_FREEBSD
-#define RTE_DONTDUMP MADV_NOCORE
-#else
-#error "madvise doesn't support this OS"
-#endif
-
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, int reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -59,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_mem_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -105,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
 			*size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, map_sz);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -132,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -161,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -179,19 +166,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	if (!unmap) {
 		/* Exclude these pages from a core dump. */
-		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
-			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
-				strerror(errno));
+		eal_mem_set_dump(aligned_addr, *size, false);
 	}
 
 	return aligned_addr;
@@ -547,10 +532,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	size_t page_size = rte_mem_page_size();
+	uintptr_t aligned = RTE_PTR_ALIGN_FLOOR(virtual, page_size);
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 6733a2321..3173f1d67 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**
+	 * Reserve hugepages. May be unsupported by some platforms.
+	 */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**
+	 * Force reserving memory at the requested address.
+	 * This can be a destructive action depending on the implementation.
+	 *
+	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
+	 *      (although implementations are not required to use it).
+	 */
+	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to eal_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+		size_t page_sz, int flags, int reserve_flags);
 
 /**
  * Get cpu core_id.
@@ -493,4 +512,57 @@ eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
 int
 eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation addressm which must be page-aligned.
+ *  The system might not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options, a combination of eal_mem_reserve_flags.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If *virt* and *size* describe a part of the reserved region,
+ * only this part of the region is freed (accurately up to the system
+ * page size). If *virt* points to allocated memory, *size* must match
+ * the one specified on allocation. The behavior is undefined
+ * if the memory pointed by *virt* is obtained from another source
+ * than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void
+eal_mem_free(void *virt, size_t size);
+
+/**
+ * Configure memory region inclusion into core dumps.
+ *
+ * @param virt
+ *  Starting address of the region.
+ * @param size
+ *  Size of the region.
+ * @param dump
+ *  True to include memory into core dumps, false to exclude.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index 0f8741d96..2374ba0b7 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
diff --git a/lib/librte_eal/include/rte_eal_memory.h b/lib/librte_eal/include/rte_eal_memory.h
new file mode 100644
index 000000000..0c5ef309d
--- /dev/null
+++ b/lib/librte_eal/include/rte_eal_memory.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+/** @file Mamory management wrappers used across DPDK. */
+
+/** Memory protection flags. */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/** Additional flags for memory mapping. */
+enum rte_map_flags {
+	/** Changes to the mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/**
+	 * Force mapping to the requested address. This flag should be used
+	 * with caution, because to fulfill the request implementation
+	 * may remove all other mappings in the requested region. However,
+	 * it is not required to do so, thus mapping with this flag may fail.
+	 */
+	RTE_MAP_FORCE_ADDRESS = 1 << 3
+};
+
+/**
+ * Map a portion of an opened file or the page file into memory.
+ *
+ * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
+ * extension, except for the return value.
+ *
+ * @param requested_addr
+ *  Desired virtual address for mapping. Can be NULL to let OS choose.
+ * @param size
+ *  Size of the mapping in bytes.
+ * @param prot
+ *  Protection flags, a combination of rte_mem_prot values.
+ * @param flags
+ *  Addtional mapping flags, a combination of rte_map_flags.
+ * @param fd
+ *  Mapped file descriptor. Can be negative for anonymous mapping.
+ * @param offset
+ *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
+ * @return
+ *  Mapped address or NULL on failure and rte_errno is set to OS error.
+ */
+__rte_internal
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_internal
+int
+rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never fails.
+ *
+ * @return
+ *   Page size in bytes.
+ */
+__rte_internal
+size_t
+rte_mem_page_size(void);
+
+/**
+ * Lock in physical memory all pages crossed by the address region.
+ *
+ * @param virt
+ *   Base virtual address of the region.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @see rte_mem_page_size() to retrieve the page size.
+ * @see rte_mem_lock_page() to lock an entire single page.
+ */
+__rte_internal
+int
+rte_mem_lock(const void *virt, size_t size);
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 331489f99..8febf2212 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
index 2c717f8bd..bf29b83c6 100644
--- a/lib/librte_eal/linux/eal_memalloc.c
+++ b/lib/librte_eal/linux/eal_memalloc.c
@@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 mapped:
 	munmap(addr, alloc_sz);
 unmapped:
-	flags = MAP_FIXED;
+	flags = EAL_RESERVE_FORCE_ADDRESS;
 	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
 	if (new_addr != addr) {
 		if (new_addr != NULL)
@@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
-		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
+	eal_mem_set_dump(ms->addr, ms->len, false);
 
 	exit_early = false;
 
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d8038749a..196eef5af 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -387,3 +387,12 @@ EXPERIMENTAL {
 	rte_trace_regexp;
 	rte_trace_save;
 };
+
+INTERNAL {
+	global:
+
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_page_size;
+	rte_mem_unmap;
+};
diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
new file mode 100644
index 000000000..4dd891667
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix_memory.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_eal_memory.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+
+#include "eal_private.h"
+
+#ifdef RTE_EXEC_ENV_LINUX
+#define EAL_DONTDUMP MADV_DONTDUMP
+#define EAL_DODUMP   MADV_DODUMP
+#elif defined RTE_EXEC_ENV_FREEBSD
+#define EAL_DONTDUMP MADV_NOCORE
+#define EAL_DODUMP   MADV_CORE
+#else
+#error "madvise doesn't support this OS"
+#endif
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+		return NULL;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+#ifdef MAP_HUGETLB
+		sys_flags |= MAP_HUGETLB;
+#else
+		rte_errno = ENOTSUP;
+		return NULL;
+#endif
+	}
+
+	if (flags & EAL_RESERVE_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
+	int ret = madvise(virt, size, flags);
+	if (ret) {
+		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
+				virt, size, flags, strerror(rte_errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+static int
+mem_rte_to_sys_prot(int prot)
+{
+	int sys_prot = PROT_NONE;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	int sys_flags = 0;
+	int sys_prot;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static size_t page_size;
+
+	if (!page_size)
+		page_size = sysconf(_SC_PAGESIZE);
+
+	return page_size;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	int ret = mlock(virt, size);
+	if (ret)
+		rte_errno = errno;
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index 21029ba1a..e733910a1 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal_file.c',
+	'eal_unix_memory.c',
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (2 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-12 15:39               ` Thomas Monjalon
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
                               ` (8 subsequent siblings)
  12 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c |  95 +++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  62 +++++++++++
 lib/librte_eal/freebsd/eal_memory.c       |  94 ++++-------------
 lib/librte_eal/linux/eal_memory.c         | 119 +++++-----------------
 4 files changed, 203 insertions(+), 167 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index f9fbd3e4e..76cf87c1f 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -25,6 +25,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -182,6 +183,100 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+		uint64_t page_sz, int n_segs, int socket_id, bool heap)
+{
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL,
+		"Memseg list allocated at socket %i, page size 0x%zxkB\n",
+		socket_id, (size_t)page_sz >> 10);
+
+	return 0;
+}
+
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+
+	return eal_memseg_list_init_named(
+		msl, name, page_sz, n_segs, socket_id, heap);
+}
+
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
+{
+	size_t page_sz, mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(
+		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
+	if (addr == NULL) {
+#ifndef RTE_EXEC_ENV_WINDOWS
+		/* The hint would be misleading on Windows, but this function
+		 * is called from many places, including common code,
+		 * so don't duplicate the message.
+		 */
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+#endif
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	RTE_LOG(DEBUG, EAL, "VA reserved for memseg list at %p, size %zx\n",
+			addr, mem_sz);
+
+	return 0;
+}
+
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
+{
+	size_t page_sz = msl->page_sz;
+	int i;
+
+	for (i = 0; i < n_segs; i++) {
+		struct rte_fbarray *arr = &msl->memseg_arr;
+		struct rte_memseg *ms = rte_fbarray_get(arr, i);
+
+		if (rte_eal_iova_mode() == RTE_IOVA_VA)
+			ms->iova = (uintptr_t)addr;
+		else
+			ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, i);
+
+		addr = RTE_PTR_ADD(addr, page_sz);
+	}
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 3173f1d67..1ec51b2eb 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -254,6 +254,68 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int reserve_flags);
 
+/**
+ * Initialize a memory segment list and create its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param name
+ *  Name for the backing storage.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+	uint64_t page_sz, int n_segs, int socket_id, bool heap);
+
+/**
+ * Initialize memory segment list and create its backing storage
+ * with a name corresponding to MSL parameters.
+ *
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ *
+ * @see eal_memseg_list_init_named for remaining parameters description.
+ */
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
+/**
+ * Reserve VA space for a memory segment list
+ * previously initialized with eal_memseg_list_init().
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param reserve_flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
+
+/**
+ * Populate MSL, each segment is one page long.
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param addr
+ *  Starting address of list segments.
+ * @param n_segs
+ *  Number of segments to populate.
+ */
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5bc2da160..29c3ed5a9 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -66,53 +66,34 @@ rte_eal_hugepage_init(void)
 		struct rte_memseg_list *msl;
 		struct rte_fbarray *arr;
 		struct rte_memseg *ms;
-		uint64_t page_sz;
+		uint64_t mem_sz, page_sz;
 		int n_segs, cur_seg;
 
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-				sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
-		addr = mmap(NULL, internal_config.memory,
-				PROT_READ | PROT_WRITE,
+		addr = mmap(NULL, mem_sz, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (addr == MAP_FAILED) {
 			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
 					strerror(errno));
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->len = internal_config.memory;
-		msl->socket_id = 0;
-		msl->heap = 1;
-
-		/* populate memsegs. each memseg is 1 page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->len = page_sz;
-			ms->socket_id = 0;
+		msl->base_va = addr;
+		msl->len = mem_sz;
 
-			rte_fbarray_set_used(arr, cur_seg);
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			addr = RTE_PTR_ADD(addr, page_sz);
-		}
 		return 0;
 	}
 
@@ -336,64 +317,25 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
 	int flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
@@ -479,7 +421,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_init(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
@@ -487,7 +429,7 @@ memseg_primary_init(void)
 			total_type_mem = total_segs * hugepage_sz;
 			type_msl_idx++;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				return -1;
 			}
@@ -518,7 +460,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..d9de30e8b 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, 0);
 }
 
 /*
@@ -1009,13 +969,17 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_init(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_alloc(msl) < 0) {
+				RTE_LOG(ERR, EAL,
+					"Cannot preallocate %zukB hugepages\n",
+					page_sz >> 10);
 				return -1;
+			}
 		}
 	}
 	return 0;
@@ -1323,8 +1287,6 @@ eal_legacy_hugepage_init(void)
 	struct rte_mem_config *mcfg;
 	struct hugepage_file *hugepage = NULL, *tmp_hp = NULL;
 	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	struct rte_fbarray *arr;
-	struct rte_memseg *ms;
 
 	uint64_t memory[RTE_MAX_NUMA_NODES];
 
@@ -1343,7 +1305,7 @@ eal_legacy_hugepage_init(void)
 		void *prealloc_addr;
 		size_t mem_sz;
 		struct rte_memseg_list *msl;
-		int n_segs, cur_seg, fd, flags;
+		int n_segs, fd, flags;
 #ifdef MEMFD_SUPPORTED
 		int memfd;
 #endif
@@ -1358,12 +1320,12 @@ eal_legacy_hugepage_init(void)
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-					sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
@@ -1400,16 +1362,12 @@ eal_legacy_hugepage_init(void)
 		/* preallocate address space for the memory, so that it can be
 		 * fit into the DMA mask.
 		 */
-		mem_sz = internal_config.memory;
-		prealloc_addr = eal_get_virtual_area(
-				NULL, &mem_sz, page_sz, 0, 0);
-		if (prealloc_addr == NULL) {
-			RTE_LOG(ERR, EAL,
-					"%s: reserving memory area failed: "
-					"%s\n",
-					__func__, strerror(errno));
+		if (eal_memseg_list_alloc(msl, 0)) {
+			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
+
+		prealloc_addr = msl->base_va;
 		addr = mmap(prealloc_addr, mem_sz, PROT_READ | PROT_WRITE,
 				flags | MAP_FIXED, fd, 0);
 		if (addr == MAP_FAILED || addr != prealloc_addr) {
@@ -1418,11 +1376,6 @@ eal_legacy_hugepage_init(void)
 			munmap(prealloc_addr, mem_sz);
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->socket_id = 0;
-		msl->len = mem_sz;
-		msl->heap = 1;
 
 		/* we're in single-file segments mode, so only the segment list
 		 * fd needs to be set up.
@@ -1434,24 +1387,8 @@ eal_legacy_hugepage_init(void)
 			}
 		}
 
-		/* populate memsegs. each memseg is one page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->socket_id = 0;
-			ms->len = page_sz;
-
-			rte_fbarray_set_used(arr, cur_seg);
-
-			addr = RTE_PTR_ADD(addr, (size_t)page_sz);
-		}
 		if (mcfg->dma_maskbits &&
 		    rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
 			RTE_LOG(ERR, EAL,
@@ -2191,7 +2128,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_init(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2137,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_alloc(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2332,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_init(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2370,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 05/11] eal/mem: extract common code for dynamic memory allocation
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (3 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 06/11] trace: add size_t field emitter Dmitry Kozlyuk
                               ` (7 subsequent siblings)
  12 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Code in Linux EAL that supports dynamic memory allocation (as opposed to
static allocation used by FreeBSD) is not OS-dependent and can be reused
by Windows EAL. Move such code to a file compiled only for the OS that
require it. Keep Anatoly Burakov maintainer of extracted code.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                               |   1 +
 lib/librte_eal/common/eal_common_dynmem.c | 521 +++++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  43 +-
 lib/librte_eal/common/meson.build         |   4 +
 lib/librte_eal/freebsd/eal_memory.c       |  12 +-
 lib/librte_eal/linux/Makefile             |   1 +
 lib/librte_eal/linux/eal_memory.c         | 523 +---------------------
 7 files changed, 582 insertions(+), 523 deletions(-)
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d9aff26d..a1722ca73 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -208,6 +208,7 @@ F: lib/librte_eal/include/rte_fbarray.h
 F: lib/librte_eal/include/rte_mem*
 F: lib/librte_eal/include/rte_malloc.h
 F: lib/librte_eal/common/*malloc*
+F: lib/librte_eal/common/eal_common_dynmem.c
 F: lib/librte_eal/common/eal_common_fbarray.c
 F: lib/librte_eal/common/eal_common_mem*
 F: lib/librte_eal/common/eal_hugepages.h
diff --git a/lib/librte_eal/common/eal_common_dynmem.c b/lib/librte_eal/common/eal_common_dynmem.c
new file mode 100644
index 000000000..6b07672d0
--- /dev/null
+++ b/lib/librte_eal/common/eal_common_dynmem.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation.
+ * Copyright(c) 2013 6WIND S.A.
+ */
+
+#include <inttypes.h>
+#include <string.h>
+
+#include <rte_log.h>
+#include <rte_string_fns.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+
+/** @file Functions common to EALs that support dynamic memory allocation. */
+
+int
+eal_dynmem_memseg_lists_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
+			/* we can still sort pages by socket in legacy mode */
+			if (!internal_config.legacy_mem && socket_id > 0)
+				break;
+#endif
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (eal_memseg_list_init(msl, pagesz, n_segs,
+					socket_id, cur_seglist, true))
+				goto out;
+
+			if (eal_memseg_list_alloc(msl, 0)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int __rte_unused
+hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct hugepage_info *hpi = arg;
+
+	if (msl->page_sz != hpi->hugepage_sz)
+		return 0;
+
+	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
+	return 0;
+}
+
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+int
+eal_dynmem_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+#ifndef RTE_ARCH_64
+		struct hugepage_info dummy;
+		unsigned int i;
+#endif
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+
+#ifndef RTE_ARCH_64
+		/* for 32-bit, limit number of pages on socket to whatever we've
+		 * preallocated, as we cannot allocate more.
+		 */
+		memset(&dummy, 0, sizeof(dummy));
+		dummy.hugepage_sz = hpi->hugepage_sz;
+		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
+			return -1;
+
+		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
+			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
+					dummy.num_pages[i]);
+		}
+#endif
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
+		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
+
+	/* calculate final number of pages */
+	if (eal_dynmem_calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M "
+				"on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+__rte_unused /* function is unused on 32-bit builds */
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+int
+eal_dynmem_calc_num_pages_per_socket(
+	uint64_t *memory, struct hugepage_info *hp_info,
+	struct hugepage_info *hp_used, unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+#ifdef RTE_ARCH_64
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from CPU mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+#else
+		/* in 32-bit mode, allocate all of the memory only on master
+		 * lcore socket
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			struct rte_config *cfg = rte_eal_get_configuration();
+			unsigned int master_lcore_socket;
+
+			master_lcore_socket =
+				rte_lcore_to_socket_id(cfg->master_lcore);
+
+			if (master_lcore_socket != socket)
+				continue;
+
+			/* Update sizes */
+			memory[socket] = total_size;
+			break;
+		}
+#endif
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			rte_strscpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int)(internal_config.memory / 0x100000);
+		available = requested - (unsigned int)(total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 1ec51b2eb..2a780f513 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -13,6 +13,8 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+
 /**
  * Structure storing internal configuration (per-lcore)
  */
@@ -316,6 +318,45 @@ eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
 void
 eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
 
+/**
+ * Distribute available memory between MSLs.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_memseg_lists_init(void);
+
+/**
+ * Preallocate hugepages for dynamic allocation.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_hugepage_init(void);
+
+/**
+ * Given the list of hugepage sizes and the number of pages thereof,
+ * calculate the best number of pages of each size to fulfill the request
+ * for RAM on each NUMA node.
+ *
+ * @param memory
+ *  Amounts of memory requested for each NUMA node of RTE_MAX_NUMA_NODES.
+ * @param hp_info
+ *  Information about hugepages of different size.
+ * @param hp_used
+ *  Receives information about used hugepages of each size.
+ * @param num_hp_info
+ *  Number of elements in hp_info and hp_used.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_calc_num_pages_per_socket(
+		uint64_t *memory, struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used, unsigned int num_hp_info);
+
 /**
  * Get cpu core_id.
  *
@@ -595,7 +636,7 @@ void *
 eal_mem_reserve(void *requested_addr, size_t size, int flags);
 
 /**
- * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ * Free memory obtained by eal_mem_reserve() and possibly allocated.
  *
  * If *virt* and *size* describe a part of the reserved region,
  * only this part of the region is freed (accurately up to the system
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 55aaeb18e..d91c22220 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -56,3 +56,7 @@ sources += files(
 	'rte_reciprocal.c',
 	'rte_service.c',
 )
+
+if is_linux
+	sources += files('eal_common_dynmem.c')
+endif
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 29c3ed5a9..7106b8b84 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -317,14 +317,6 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
-}
-
 static int
 memseg_list_alloc(struct rte_memseg_list *msl)
 {
@@ -421,8 +413,8 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (memseg_list_init(msl, hugepage_sz, n_segs,
-					0, type_msl_idx))
+			if (eal_memseg_list_init(msl, hugepage_sz, n_segs,
+					0, type_msl_idx, false))
 				return -1;
 
 			total_segs += msl->memseg_arr.len;
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 8febf2212..07ce643ba 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -50,6 +50,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_timer.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memzone.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_log.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_launch.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_dynmem.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_mcfg.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memalloc.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memory.c
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index d9de30e8b..5e6c844c2 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -812,20 +812,6 @@ memseg_list_free(struct rte_memseg_list *msl)
 	return 0;
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
-}
-
-static int
-memseg_list_alloc(struct rte_memseg_list *msl)
-{
-	return eal_memseg_list_alloc(msl, 0);
-}
-
 /*
  * Our VA space is not preallocated yet, so preallocate it here. We need to know
  * how many segments there are in order to map all pages into one address space,
@@ -969,12 +955,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (memseg_list_init(msl, page_sz, n_segs, socket,
-						msl_idx) < 0)
+			if (eal_memseg_list_init(msl, page_sz, n_segs,
+					socket, msl_idx, true) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (memseg_list_alloc(msl) < 0) {
+			if (eal_memseg_list_alloc(msl, 0) < 0) {
 				RTE_LOG(ERR, EAL,
 					"Cannot preallocate %zukB hugepages\n",
 					page_sz >> 10);
@@ -1049,182 +1035,6 @@ remap_needed_hugepages(struct hugepage_file *hugepages, int n_pages)
 	return 0;
 }
 
-__rte_unused /* function is unused on 32-bit builds */
-static inline uint64_t
-get_socket_mem_size(int socket)
-{
-	uint64_t size = 0;
-	unsigned i;
-
-	for (i = 0; i < internal_config.num_hugepage_sizes; i++){
-		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
-		size += hpi->hugepage_sz * hpi->num_pages[socket];
-	}
-
-	return size;
-}
-
-/*
- * This function is a NUMA-aware equivalent of calc_num_pages.
- * It takes in the list of hugepage sizes and the
- * number of pages thereof, and calculates the best number of
- * pages of each size to fulfill the request for <memory> ram
- */
-static int
-calc_num_pages_per_socket(uint64_t * memory,
-		struct hugepage_info *hp_info,
-		struct hugepage_info *hp_used,
-		unsigned num_hp_info)
-{
-	unsigned socket, j, i = 0;
-	unsigned requested, available;
-	int total_num_pages = 0;
-	uint64_t remaining_mem, cur_mem;
-	uint64_t total_mem = internal_config.memory;
-
-	if (num_hp_info == 0)
-		return -1;
-
-	/* if specific memory amounts per socket weren't requested */
-	if (internal_config.force_sockets == 0) {
-		size_t total_size;
-#ifdef RTE_ARCH_64
-		int cpu_per_socket[RTE_MAX_NUMA_NODES];
-		size_t default_size;
-		unsigned lcore_id;
-
-		/* Compute number of cores per socket */
-		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
-		RTE_LCORE_FOREACH(lcore_id) {
-			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
-		}
-
-		/*
-		 * Automatically spread requested memory amongst detected sockets according
-		 * to number of cores from cpu mask present on each socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-
-			/* Set memory amount per socket */
-			default_size = (internal_config.memory * cpu_per_socket[socket])
-					/ rte_lcore_count();
-
-			/* Limit to maximum available memory on socket */
-			default_size = RTE_MIN(default_size, get_socket_mem_size(socket));
-
-			/* Update sizes */
-			memory[socket] = default_size;
-			total_size -= default_size;
-		}
-
-		/*
-		 * If some memory is remaining, try to allocate it by getting all
-		 * available memory from sockets, one after the other
-		 */
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-			/* take whatever is available */
-			default_size = RTE_MIN(get_socket_mem_size(socket) - memory[socket],
-					       total_size);
-
-			/* Update sizes */
-			memory[socket] += default_size;
-			total_size -= default_size;
-		}
-#else
-		/* in 32-bit mode, allocate all of the memory only on master
-		 * lcore socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
-				socket++) {
-			struct rte_config *cfg = rte_eal_get_configuration();
-			unsigned int master_lcore_socket;
-
-			master_lcore_socket =
-				rte_lcore_to_socket_id(cfg->master_lcore);
-
-			if (master_lcore_socket != socket)
-				continue;
-
-			/* Update sizes */
-			memory[socket] = total_size;
-			break;
-		}
-#endif
-	}
-
-	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0; socket++) {
-		/* skips if the memory on specific socket wasn't requested */
-		for (i = 0; i < num_hp_info && memory[socket] != 0; i++){
-			strlcpy(hp_used[i].hugedir, hp_info[i].hugedir,
-				sizeof(hp_used[i].hugedir));
-			hp_used[i].num_pages[socket] = RTE_MIN(
-					memory[socket] / hp_info[i].hugepage_sz,
-					hp_info[i].num_pages[socket]);
-
-			cur_mem = hp_used[i].num_pages[socket] *
-					hp_used[i].hugepage_sz;
-
-			memory[socket] -= cur_mem;
-			total_mem -= cur_mem;
-
-			total_num_pages += hp_used[i].num_pages[socket];
-
-			/* check if we have met all memory requests */
-			if (memory[socket] == 0)
-				break;
-
-			/* check if we have any more pages left at this size, if so
-			 * move on to next size */
-			if (hp_used[i].num_pages[socket] == hp_info[i].num_pages[socket])
-				continue;
-			/* At this point we know that there are more pages available that are
-			 * bigger than the memory we want, so lets see if we can get enough
-			 * from other page sizes.
-			 */
-			remaining_mem = 0;
-			for (j = i+1; j < num_hp_info; j++)
-				remaining_mem += hp_info[j].hugepage_sz *
-				hp_info[j].num_pages[socket];
-
-			/* is there enough other memory, if not allocate another page and quit */
-			if (remaining_mem < memory[socket]){
-				cur_mem = RTE_MIN(memory[socket],
-						hp_info[i].hugepage_sz);
-				memory[socket] -= cur_mem;
-				total_mem -= cur_mem;
-				hp_used[i].num_pages[socket]++;
-				total_num_pages++;
-				break; /* we are done with this socket*/
-			}
-		}
-		/* if we didn't satisfy all memory requirements per socket */
-		if (memory[socket] > 0 &&
-				internal_config.socket_mem[socket] != 0) {
-			/* to prevent icc errors */
-			requested = (unsigned) (internal_config.socket_mem[socket] /
-					0x100000);
-			available = requested -
-					((unsigned) (memory[socket] / 0x100000));
-			RTE_LOG(ERR, EAL, "Not enough memory available on socket %u! "
-					"Requested: %uMB, available: %uMB\n", socket,
-					requested, available);
-			return -1;
-		}
-	}
-
-	/* if we didn't satisfy total memory requirements */
-	if (total_mem > 0) {
-		requested = (unsigned) (internal_config.memory / 0x100000);
-		available = requested - (unsigned) (total_mem / 0x100000);
-		RTE_LOG(ERR, EAL, "Not enough memory available! Requested: %uMB,"
-				" available: %uMB\n", requested, available);
-		return -1;
-	}
-	return total_num_pages;
-}
-
 static inline size_t
 eal_get_hugepage_mem_size(void)
 {
@@ -1530,7 +1340,7 @@ eal_legacy_hugepage_init(void)
 		memory[i] = internal_config.socket_mem[i];
 
 	/* calculate final number of pages */
-	nr_hugepages = calc_num_pages_per_socket(memory,
+	nr_hugepages = eal_dynmem_calc_num_pages_per_socket(memory,
 			internal_config.hugepage_info, used_hp,
 			internal_config.num_hugepage_sizes);
 
@@ -1657,140 +1467,6 @@ eal_legacy_hugepage_init(void)
 	return -1;
 }
 
-static int __rte_unused
-hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
-{
-	struct hugepage_info *hpi = arg;
-
-	if (msl->page_sz != hpi->hugepage_sz)
-		return 0;
-
-	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
-	return 0;
-}
-
-static int
-limits_callback(int socket_id, size_t cur_limit, size_t new_len)
-{
-	RTE_SET_USED(socket_id);
-	RTE_SET_USED(cur_limit);
-	RTE_SET_USED(new_len);
-	return -1;
-}
-
-static int
-eal_hugepage_init(void)
-{
-	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	uint64_t memory[RTE_MAX_NUMA_NODES];
-	int hp_sz_idx, socket_id;
-
-	memset(used_hp, 0, sizeof(used_hp));
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-#ifndef RTE_ARCH_64
-		struct hugepage_info dummy;
-		unsigned int i;
-#endif
-		/* also initialize used_hp hugepage sizes in used_hp */
-		struct hugepage_info *hpi;
-		hpi = &internal_config.hugepage_info[hp_sz_idx];
-		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
-
-#ifndef RTE_ARCH_64
-		/* for 32-bit, limit number of pages on socket to whatever we've
-		 * preallocated, as we cannot allocate more.
-		 */
-		memset(&dummy, 0, sizeof(dummy));
-		dummy.hugepage_sz = hpi->hugepage_sz;
-		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
-			return -1;
-
-		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
-			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
-					dummy.num_pages[i]);
-		}
-#endif
-	}
-
-	/* make a copy of socket_mem, needed for balanced allocation. */
-	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
-		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
-
-	/* calculate final number of pages */
-	if (calc_num_pages_per_socket(memory,
-			internal_config.hugepage_info, used_hp,
-			internal_config.num_hugepage_sizes) < 0)
-		return -1;
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
-				socket_id++) {
-			struct rte_memseg **pages;
-			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
-			unsigned int num_pages = hpi->num_pages[socket_id];
-			unsigned int num_pages_alloc;
-
-			if (num_pages == 0)
-				continue;
-
-			RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %" PRIu64 "M on socket %i\n",
-				num_pages, hpi->hugepage_sz >> 20, socket_id);
-
-			/* we may not be able to allocate all pages in one go,
-			 * because we break up our memory map into multiple
-			 * memseg lists. therefore, try allocating multiple
-			 * times and see if we can get the desired number of
-			 * pages from multiple allocations.
-			 */
-
-			num_pages_alloc = 0;
-			do {
-				int i, cur_pages, needed;
-
-				needed = num_pages - num_pages_alloc;
-
-				pages = malloc(sizeof(*pages) * needed);
-
-				/* do not request exact number of pages */
-				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
-						needed, hpi->hugepage_sz,
-						socket_id, false);
-				if (cur_pages <= 0) {
-					free(pages);
-					return -1;
-				}
-
-				/* mark preallocated pages as unfreeable */
-				for (i = 0; i < cur_pages; i++) {
-					struct rte_memseg *ms = pages[i];
-					ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE;
-				}
-				free(pages);
-
-				num_pages_alloc += cur_pages;
-			} while (num_pages_alloc != num_pages);
-		}
-	}
-	/* if socket limits were specified, set them */
-	if (internal_config.force_socket_limits) {
-		unsigned int i;
-		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
-			uint64_t limit = internal_config.socket_limit[i];
-			if (limit == 0)
-				continue;
-			if (rte_mem_alloc_validator_register("socket-limit",
-					limits_callback, i, limit))
-				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
-		}
-	}
-	return 0;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -1949,7 +1625,7 @@ rte_eal_hugepage_init(void)
 {
 	return internal_config.legacy_mem ?
 			eal_legacy_hugepage_init() :
-			eal_hugepage_init();
+			eal_dynmem_hugepage_init();
 }
 
 int
@@ -2128,8 +1804,9 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (memseg_list_init(msl, hugepage_sz, n_segs,
-						socket_id, type_msl_idx)) {
+				if (eal_memseg_list_init(msl, hugepage_sz,
+						n_segs, socket_id, type_msl_idx,
+						true)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
 					 */
@@ -2137,7 +1814,7 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (memseg_list_alloc(msl)) {
+				if (eal_memseg_list_alloc(msl, 0)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
@@ -2168,185 +1845,7 @@ memseg_primary_init_32(void)
 static int __rte_unused
 memseg_primary_init(void)
 {
-	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
-	struct memtype {
-		uint64_t page_sz;
-		int socket_id;
-	} *memtypes = NULL;
-	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
-	struct rte_memseg_list *msl;
-	uint64_t max_mem, max_mem_per_type;
-	unsigned int max_seglists_per_type;
-	unsigned int n_memtypes, cur_type;
-
-	/* no-huge does not need this at all */
-	if (internal_config.no_hugetlbfs)
-		return 0;
-
-	/*
-	 * figuring out amount of memory we're going to have is a long and very
-	 * involved process. the basic element we're operating with is a memory
-	 * type, defined as a combination of NUMA node ID and page size (so that
-	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
-	 *
-	 * deciding amount of memory going towards each memory type is a
-	 * balancing act between maximum segments per type, maximum memory per
-	 * type, and number of detected NUMA nodes. the goal is to make sure
-	 * each memory type gets at least one memseg list.
-	 *
-	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
-	 *
-	 * the total amount of memory per type is limited by either
-	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
-	 * of detected NUMA nodes. additionally, maximum number of segments per
-	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
-	 * smaller page sizes, it can take hundreds of thousands of segments to
-	 * reach the above specified per-type memory limits.
-	 *
-	 * additionally, each type may have multiple memseg lists associated
-	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
-	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
-	 *
-	 * the number of memseg lists per type is decided based on the above
-	 * limits, and also taking number of detected NUMA nodes, to make sure
-	 * that we don't run out of memseg lists before we populate all NUMA
-	 * nodes with memory.
-	 *
-	 * we do this in three stages. first, we collect the number of types.
-	 * then, we figure out memory constraints and populate the list of
-	 * would-be memseg lists. then, we go ahead and allocate the memseg
-	 * lists.
-	 */
-
-	/* create space for mem types */
-	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
-	memtypes = calloc(n_memtypes, sizeof(*memtypes));
-	if (memtypes == NULL) {
-		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
-		return -1;
-	}
-
-	/* populate mem types */
-	cur_type = 0;
-	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
-			hpi_idx++) {
-		struct hugepage_info *hpi;
-		uint64_t hugepage_sz;
-
-		hpi = &internal_config.hugepage_info[hpi_idx];
-		hugepage_sz = hpi->hugepage_sz;
-
-		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
-			int socket_id = rte_socket_id_by_idx(i);
-
-#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
-			/* we can still sort pages by socket in legacy mode */
-			if (!internal_config.legacy_mem && socket_id > 0)
-				break;
-#endif
-			memtypes[cur_type].page_sz = hugepage_sz;
-			memtypes[cur_type].socket_id = socket_id;
-
-			RTE_LOG(DEBUG, EAL, "Detected memory type: "
-				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
-				socket_id, hugepage_sz);
-		}
-	}
-	/* number of memtypes could have been lower due to no NUMA support */
-	n_memtypes = cur_type;
-
-	/* set up limits for types */
-	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
-	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
-			max_mem / n_memtypes);
-	/*
-	 * limit maximum number of segment lists per type to ensure there's
-	 * space for memseg lists for all NUMA nodes with all page sizes
-	 */
-	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
-
-	if (max_seglists_per_type == 0) {
-		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
-			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-		goto out;
-	}
-
-	/* go through all mem types and create segment lists */
-	msl_idx = 0;
-	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
-		unsigned int cur_seglist, n_seglists, n_segs;
-		unsigned int max_segs_per_type, max_segs_per_list;
-		struct memtype *type = &memtypes[cur_type];
-		uint64_t max_mem_per_list, pagesz;
-		int socket_id;
-
-		pagesz = type->page_sz;
-		socket_id = type->socket_id;
-
-		/*
-		 * we need to create segment lists for this type. we must take
-		 * into account the following things:
-		 *
-		 * 1. total amount of memory we can use for this memory type
-		 * 2. total amount of memory per memseg list allowed
-		 * 3. number of segments needed to fit the amount of memory
-		 * 4. number of segments allowed per type
-		 * 5. number of segments allowed per memseg list
-		 * 6. number of memseg lists we are allowed to take up
-		 */
-
-		/* calculate how much segments we will need in total */
-		max_segs_per_type = max_mem_per_type / pagesz;
-		/* limit number of segments to maximum allowed per type */
-		max_segs_per_type = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
-		/* limit number of segments to maximum allowed per list */
-		max_segs_per_list = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
-
-		/* calculate how much memory we can have per segment list */
-		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
-				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
-
-		/* calculate how many segments each segment list will have */
-		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
-
-		/* calculate how many segment lists we can have */
-		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
-				max_mem_per_type / max_mem_per_list);
-
-		/* limit number of segment lists according to our maximum */
-		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
-
-		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
-				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
-			n_seglists, n_segs, socket_id, pagesz);
-
-		/* create all segment lists */
-		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
-			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
-				RTE_LOG(ERR, EAL,
-					"No more space in memseg lists, please increase %s\n",
-					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-				goto out;
-			}
-			msl = &mcfg->memsegs[msl_idx++];
-
-			if (memseg_list_init(msl, pagesz, n_segs,
-					socket_id, cur_seglist))
-				goto out;
-
-			if (memseg_list_alloc(msl)) {
-				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
-				goto out;
-			}
-		}
-	}
-	/* we're successful */
-	ret = 0;
-out:
-	free(memtypes);
-	return ret;
+	return eal_dynmem_memseg_lists_init();
 }
 
 static int
@@ -2370,7 +1869,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (memseg_list_alloc(msl)) {
+		if (eal_memseg_list_alloc(msl, 0)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 06/11] trace: add size_t field emitter
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (4 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
                               ` (6 subsequent siblings)
  12 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, Sunil Kumar Kori,
	Olivier Matz, Andrew Rybchenko

It is not guaranteed that sizeof(long) == sizeof(size_t). On Windows,
sizeof(long) == 4 and sizeof(size_t) == 8 for 64-bit programs.
Tracepoints using "long" field emitter are therefore invalid there.
Add dedicated field emitter for size_t and use it to store size_t values
in all existing tracepoints.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/include/rte_eal_trace.h   |  8 ++++----
 lib/librte_eal/include/rte_trace_point.h |  3 +++
 lib/librte_mempool/rte_mempool_trace.h   | 10 +++++-----
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/include/rte_eal_trace.h b/lib/librte_eal/include/rte_eal_trace.h
index 1ebb2905a..bcfef0cfa 100644
--- a/lib/librte_eal/include/rte_eal_trace.h
+++ b/lib/librte_eal/include/rte_eal_trace.h
@@ -143,7 +143,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -154,7 +154,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -164,7 +164,7 @@ RTE_TRACE_POINT(
 	rte_eal_trace_mem_realloc,
 	RTE_TRACE_POINT_ARGS(size_t size, unsigned int align, int socket,
 		void *ptr),
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -183,7 +183,7 @@ RTE_TRACE_POINT(
 		unsigned int flags, unsigned int align, unsigned int bound,
 		const void *mz),
 	rte_trace_point_emit_string(name);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_int(socket_id);
 	rte_trace_point_emit_u32(flags);
 	rte_trace_point_emit_u32(align);
diff --git a/lib/librte_eal/include/rte_trace_point.h b/lib/librte_eal/include/rte_trace_point.h
index b45171275..377c2414a 100644
--- a/lib/librte_eal/include/rte_trace_point.h
+++ b/lib/librte_eal/include/rte_trace_point.h
@@ -138,6 +138,8 @@ _tp _args \
 #define rte_trace_point_emit_int(val)
 /** Tracepoint function payload for long datatype */
 #define rte_trace_point_emit_long(val)
+/** Tracepoint function payload for size_t datatype */
+#define rte_trace_point_emit_size_t(val)
 /** Tracepoint function payload for float datatype */
 #define rte_trace_point_emit_float(val)
 /** Tracepoint function payload for double datatype */
@@ -395,6 +397,7 @@ do { \
 #define rte_trace_point_emit_i8(in) __rte_trace_point_emit(in, int8_t)
 #define rte_trace_point_emit_int(in) __rte_trace_point_emit(in, int32_t)
 #define rte_trace_point_emit_long(in) __rte_trace_point_emit(in, long)
+#define rte_trace_point_emit_size_t(in) __rte_trace_point_emit(in, size_t)
 #define rte_trace_point_emit_float(in) __rte_trace_point_emit(in, float)
 #define rte_trace_point_emit_double(in) __rte_trace_point_emit(in, double)
 #define rte_trace_point_emit_ptr(in) __rte_trace_point_emit(in, uintptr_t)
diff --git a/lib/librte_mempool/rte_mempool_trace.h b/lib/librte_mempool/rte_mempool_trace.h
index e776df0a6..087c913c8 100644
--- a/lib/librte_mempool/rte_mempool_trace.h
+++ b/lib/librte_mempool/rte_mempool_trace.h
@@ -72,7 +72,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -84,8 +84,8 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(addr);
-	rte_trace_point_emit_long(len);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(len);
+	rte_trace_point_emit_size_t(pg_sz);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -126,7 +126,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(struct rte_mempool *mempool, size_t pg_sz),
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(pg_sz);
 )
 
 RTE_TRACE_POINT(
@@ -139,7 +139,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_u32(max_objs);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(obj_cb);
 	rte_trace_point_emit_ptr(obj_cb_arg);
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 07/11] eal/windows: add tracing support stubs
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (5 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 06/11] trace: add size_t field emitter Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                               ` (5 subsequent siblings)
  12 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

EAL common code depends on tracepoint calls, but generic implementation
cannot be enabled on Windows due to missing standard library facilities.
Add stub functions to support tracepoint compilation, so that common
code does not have to conditionally include tracepoints until proper
support is added.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_thread.c |  5 +---
 lib/librte_eal/common/meson.build         |  1 +
 lib/librte_eal/windows/eal.c              | 34 ++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_thread.c b/lib/librte_eal/common/eal_common_thread.c
index f9f588c17..370bb1b63 100644
--- a/lib/librte_eal/common/eal_common_thread.c
+++ b/lib/librte_eal/common/eal_common_thread.c
@@ -15,9 +15,7 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 #include <rte_log.h>
-#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_trace_point.h>
-#endif
 
 #include "eal_internal_cfg.h"
 #include "eal_private.h"
@@ -169,9 +167,8 @@ static void *rte_thread_init(void *arg)
 		free(params);
 	}
 
-#ifndef RTE_EXEC_ENV_WINDOWS
 	__rte_trace_mem_per_thread_alloc();
-#endif
+
 	return start_routine(routine_arg);
 }
 
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index d91c22220..4e9208129 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -14,6 +14,7 @@ if is_windows
 		'eal_common_log.c',
 		'eal_common_options.c',
 		'eal_common_thread.c',
+		'eal_common_trace_points.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index d084606a6..e7461f731 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -17,6 +17,7 @@
 #include <eal_filesystem.h>
 #include <eal_options.h>
 #include <eal_private.h>
+#include <rte_trace_point.h>
 
 #include "eal_windows.h"
 
@@ -221,7 +222,38 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
- /* Launch threads, called at application init(). */
+/* Stubs to enable EAL trace point compilation
+ * until eal_common_trace.c can be compiled.
+ */
+
+RTE_DEFINE_PER_LCORE(volatile int, trace_point_sz);
+RTE_DEFINE_PER_LCORE(void *, trace_mem);
+
+void
+__rte_trace_mem_per_thread_alloc(void)
+{
+}
+
+void
+__rte_trace_point_emit_field(size_t sz, const char *field,
+	const char *type)
+{
+	RTE_SET_USED(sz);
+	RTE_SET_USED(field);
+	RTE_SET_USED(type);
+}
+
+int
+__rte_trace_point_register(rte_trace_point_t *trace, const char *name,
+	void (*register_fn)(void))
+{
+	RTE_SET_USED(trace);
+	RTE_SET_USED(name);
+	RTE_SET_USED(register_fn);
+	return -ENOTSUP;
+}
+
+/* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (6 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                               ` (4 subsequent siblings)
  12 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (7 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-12 21:45               ` Thomas Monjalon
  2020-06-12 22:09               ` Thomas Monjalon
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
                               ` (3 subsequent siblings)
  12 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal_lcore.c   | 185 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  10 ++
 2 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..9d931d50a 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,146 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
 void
-eal_create_cpu_map()
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			rte_panic("cannot get NUMA node info size, error %lu",
+				GetLastError());
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		rte_panic("cannot allocate memory for NUMA node information");
+		return;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		rte_panic("cannot get NUMA node information, error %lu",
+			GetLastError());
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
 			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
+			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
+	}
+
+exit:
+	if (full) {
+		/* RTE_LOG() may not be available, but this is important. */
+		fprintf(stderr, "Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
 	}
+
+	free(infos);
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..390d2fd66 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -26,4 +26,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 10/11] eal/windows: initialize hugepage info
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (8 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-12 21:55               ` Thomas Monjalon
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
                               ` (2 subsequent siblings)
  12 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup. Only 2MB
hugepages are supported and their number is estimated roughly
due to the lack or unstable status of suitable OS APIs.
Assign myself as maintainer for the implementation file.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                            |   4 +
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/build_dpdk.rst  |  20 -----
 doc/guides/windows_gsg/index.rst       |   1 +
 doc/guides/windows_gsg/run_apps.rst    |  47 +++++++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 8 files changed, 177 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a1722ca73..19b818f69 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -336,6 +336,10 @@ F: lib/librte_eal/windows/
 F: lib/librte_eal/rte_eal_exports.def
 F: doc/guides/windows_gsg/
 
+Windows memory allocation
+M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
+F: lib/librte_eal/eal_hugepages.c
+
 
 Core Libraries
 --------------
diff --git a/config/meson.build b/config/meson.build
index 43ab11310..c1e80de4b 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -268,6 +268,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..21ac7f6c1
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,47 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index e7461f731..7c2fcc860 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -19,8 +19,11 @@
 #include <eal_private.h>
 #include <rte_trace_point.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -276,6 +279,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..61d0dcd3c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available on Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem on Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index adfc8b9b7..52978e9d7 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
 	'eal_thread.c',
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v8 11/11] eal/windows: implement basic memory management
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (9 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-06-10 14:27             ` Dmitry Kozlyuk
  2020-06-12 22:12               ` Thomas Monjalon
  2020-06-11 17:29             ` [dpdk-dev] [PATCH v8 00/11] Windows " Thomas Monjalon
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
  12 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:27 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode. Multi-process mode is not
implemented and is forcefully disabled at startup. Assign myself as a
maintainer for Windows file and memory management implementation.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                   |   1 +
 config/meson.build                            |  12 +-
 doc/guides/windows_gsg/run_apps.rst           |  54 +-
 lib/librte_eal/common/meson.build             |  11 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/windows/eal.c                  |  63 +-
 lib/librte_eal/windows/eal_file.c             | 125 +++
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  75 ++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   6 +
 18 files changed, 1771 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 19b818f69..5140756b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -339,6 +339,7 @@ F: doc/guides/windows_gsg/
 Windows memory allocation
 M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
 F: lib/librte_eal/eal_hugepages.c
+F: lib/librte_eal/eal_mem*
 
 
 Core Libraries
diff --git a/config/meson.build b/config/meson.build
index c1e80de4b..d3f05f878 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -261,15 +261,21 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..78e5a614f 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -7,10 +7,10 @@ Running DPDK Applications
 Grant *Lock pages in memory* Privilege
 --------------------------------------
 
-Use of hugepages ("large pages" in Windows terminolocy) requires
+Use of hugepages ("large pages" in Windows terminology) requires
 ``SeLockMemoryPrivilege`` for the user running an application.
 
-1. Open *Local Security Policy* snap in, either:
+1. Open *Local Security Policy* snap-in, either:
 
    * Control Panel / Computer Management / Local Security Policy;
    * or Win+R, type ``secpol``, press Enter.
@@ -24,7 +24,55 @@ Use of hugepages ("large pages" in Windows terminolocy) requires
 
 See `Large-Page Support`_ in MSDN for details.
 
-.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+.. _Large-Page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation. This driver is not signed, so signature
+checking must be disabled to load it.
+
+.. warning::
+
+    Disabling driver signature enforcement weakens OS security.
+    It is discouraged in production environments.
+
+Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``. It can be installed as follows
+from Elevated Command Prompt:
+
+.. code-block:: console
+
+    pnputil /add-driver Z:\path\to\virt2phys.inf /install
+
+On Windows Server additional steps are required:
+
+1. From Device Manager, Action menu, select "Add legacy hardware".
+2. It will launch the "Add Hardware Wizard". Click "Next".
+3. Select second option "Install the hardware that I manually select
+   from a list (Advanced)".
+4. On the next screen, "Kernel bypass" will be shown as a device class.
+5. Select it, and click "Next".
+6. The previously installed drivers will now be installed for the
+   "Virtual to physical address translator" device.
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+Installed driver persists across reboots.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
 
 
 Run the ``helloworld`` Example
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 4e9208129..310844269 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -8,13 +8,24 @@ if is_windows
 		'eal_common_bus.c',
 		'eal_common_class.c',
 		'eal_common_devargs.c',
+		'eal_common_dynmem.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_string_fns.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
 		'eal_common_trace_points.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index f1b73168b..9d39e58c0 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -20,6 +20,7 @@
 #include <rte_lcore.h>
 #include <rte_common.h>
 #include <rte_spinlock.h>
+
 #include <rte_eal_trace.h>
 
 #include <rte_malloc.h>
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..e2eb24f01 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,9 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
 	rte_log
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
 	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_page_size
+	rte_mem_unmap
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 7c2fcc860..a43649abc 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -94,6 +94,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -256,7 +274,7 @@ __rte_trace_point_register(rte_trace_point_t *trace, const char *name,
 	return -ENOTSUP;
 }
 
-/* Launch threads, called at application init(). */
+ /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
@@ -279,6 +297,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.in_memory == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.in_memory = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -290,6 +315,42 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_file.c b/lib/librte_eal/windows/eal_file.c
new file mode 100644
index 000000000..dfbe8d311
--- /dev/null
+++ b/lib/librte_eal/windows/eal_file.c
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <fcntl.h>
+#include <io.h>
+#include <share.h>
+#include <sys/stat.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_file_open(const char *path, int flags)
+{
+	static const int MODE_MASK = EAL_OPEN_READONLY | EAL_OPEN_READWRITE;
+
+	int fd, ret, sys_flags;
+
+	switch (flags & MODE_MASK) {
+	case EAL_OPEN_READONLY:
+		sys_flags = _O_RDONLY;
+		break;
+	case EAL_OPEN_READWRITE:
+		sys_flags = _O_RDWR;
+		break;
+	default:
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (flags & EAL_OPEN_CREATE)
+		sys_flags |= _O_CREAT;
+
+	ret = _sopen_s(&fd, path, sys_flags, _SH_DENYNO, _S_IWRITE);
+	if (ret < 0) {
+		rte_errno = errno;
+		return -1;
+	}
+
+	return fd;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..a7452b6e1
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,441 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory from OS. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+
+		/* During commitment, memory is temporary freed and might
+		 * be allocated by different non-EAL thread. This is a fatal
+		 * error, because it breaks MSL assumptions.
+		 */
+		if ((addr != NULL) && (addr != requested_addr)) {
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				requested_addr);
+			return -1;
+		}
+
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu) on socket %d\n",
+				requested_addr, alloc_sz, socket_id);
+			return -1;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx(%p)", addr);
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of a hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	if (eal_mem_decommit(addr, alloc_sz) && (rte_errno == EADDRNOTAVAIL)) {
+		/* During decommitment, memory is temporarily returned
+		 * to the system and the address may become unavailable.
+		 */
+		RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+			" allocation - MSL is not VA-contiguous!\n", addr);
+	}
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len)) {
+		if (rte_errno == EADDRNOTAVAIL) {
+			/* See alloc_seg() for explanation. */
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				ms->addr);
+		}
+		return -1;
+	}
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..2739da346
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,710 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <inttypes.h>
+#include <io.h>
+
+#include <rte_eal_memory.h>
+#include <rte_errno.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+#include <rte_virt2phys.h>
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ *
+ * MEM_RESERVE_PLACEHOLDER being defined means VirtualAlloc2() is present too.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags,
+	MemExtendedParameterMax
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	/* Contrary to the docs, VirtualAlloc2() is not in kernel32.dll,
+	 * see https://github.com/MicrosoftDocs/feedback/issues/1129.
+	 */
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+
+		/* Contrary to the docs, Server 2016 is not supported. */
+		RTE_LOG(ERR, EAL, "Windows 10 or Windows Server 2019 "
+			" is required for memory management\n");
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* defined(MEM_RESERVE_PLACEHOLDER) */
+
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static void
+set_errno_from_win32_alloc_error(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		rte_errno = 0;
+		break;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occcur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		rte_errno = ENOMEM;
+		break;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		rte_errno = EINVAL;
+		break;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	HANDLE process;
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	process = GetCurrentProcess();
+
+	virt = VirtualAlloc2(process, requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFreeEx(process, virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void *
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	HANDLE process;
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	process = GetCurrentProcess();
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+
+		if (VirtualQueryEx(process, requested_addr, &info,
+				sizeof(info)) != sizeof(info)) {
+			RTE_LOG_WIN32_ERR("VirtualQuery(%p)", requested_addr);
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) && !VirtualFreeEx(
+				process, requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR(
+				"VirtualFreeEx(%p, %zu, preserve placeholder)",
+				requested_addr, size);
+			return NULL;
+		}
+
+		/* Temporarily release the region to be committed.
+		 *
+		 * There is an inherent race for this memory range
+		 * if another thread allocates memory via OS API.
+		 * However, VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)
+		 * doesn't work with MEM_LARGE_PAGES on Windows Server.
+		 */
+		if (!VirtualFreeEx(process, requested_addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				requested_addr);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAlloc2(process, requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		/* Logging may overwrite GetLastError() result. */
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, commit large pages)",
+			requested_addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((requested_addr != NULL) && (addr != requested_addr)) {
+		/* We lost the race for the requested_addr. */
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", addr);
+
+		rte_errno = EADDRNOTAVAIL;
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	HANDLE process;
+	void *stub;
+	DWORD flags;
+
+	process = GetCurrentProcess();
+
+	/* Hugepages cannot be decommited on Windows,
+	 * so free them and replace the block with a placeholder.
+	 * There is a race for VA in this block until VirtualAlloc2 call.
+	 */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	flags = MEM_RESERVE | MEM_RESERVE_PLACEHOLDER;
+	stub = VirtualAlloc2(
+		process, addr, size, flags, PAGE_NOACCESS, NULL, 0);
+	if (stub == NULL) {
+		/* We lost the race for the VA. */
+		if (!VirtualFreeEx(process, stub, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", stub);
+		rte_errno = EADDRNOTAVAIL;
+		return -1;
+	}
+
+	/* No need to join reserved regions adjascent to the freed one:
+	 * eal_mem_commit() will just pick up the page-size placeholder
+	 * created here.
+	 */
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region mut be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	HANDLE process;
+
+	process = GetCurrentProcess();
+
+	if (VirtualQueryEx(
+			process, addr, &info, sizeof(info)) != sizeof(info)) {
+		RTE_LOG_WIN32_ERR("VirtualQueryEx(%p)", addr);
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFreeEx(process, addr, size,
+			MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR(
+			"VirtualFreeEx(%p, %zu, preserve placeholder)",
+			addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	RTE_SET_USED(virt);
+	RTE_SET_USED(size);
+	RTE_SET_USED(dump);
+
+	/* Windows does not dump reserved memory by default.
+	 *
+	 * There is <werapi.h> to include or exclude regions from the dump,
+	 * but this is not currently required by EAL.
+	 */
+
+	rte_errno = ENOTSUP;
+	return -1;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* There is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!UnmapViewOfFile(virt))
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		virt = NULL;
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static SYSTEM_INFO info;
+
+	if (info.dwPageSize == 0)
+		GetSystemInfo(&info);
+
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock(%p %#zx)", virt, size);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		EAL_LOG_NOT_IMPLEMENTED();
+		return -1;
+	}
+
+	return eal_dynmem_memseg_lists_init();
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs;
+	uint64_t mem_sz, page_sz;
+	void *addr;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	msl = &mcfg->memsegs[0];
+
+	mem_sz = internal_config.memory;
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = mem_sz / page_sz;
+
+	if (eal_memseg_list_init_named(
+			msl, "nohugemem", page_sz, n_segs, 0, true)) {
+		return -1;
+	}
+
+	addr = VirtualAlloc(
+		NULL, mem_sz, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+	if (addr == NULL) {
+		RTE_LOG_WIN32_ERR("VirtualAlloc(size=%#zx)", mem_sz);
+		RTE_LOG(ERR, EAL, "Cannot allocate memory\n");
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	eal_memseg_list_populate(msl, addr, n_segs);
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_dynmem_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index 390d2fd66..caabffedf 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  */
@@ -36,4 +52,63 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with eal_mem_reserve()
+ * or decommitted from hugepages by eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and rte_errno is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit, must be the size of a page
+ *  (hugepage or regular one).
+ *
+ * The *addr* and *size* must match location and size
+ * of a previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..cb10d6494 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -14,6 +14,7 @@
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -36,6 +37,9 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
@@ -46,6 +50,7 @@ extern "C" {
 typedef long long ssize_t;
 
 #ifndef RTE_TOOLCHAIN_GCC
+
 static inline int
 asprintf(char **buffer, const char *format, ...)
 {
@@ -72,6 +77,18 @@ asprintf(char **buffer, const char *format, ...)
 	}
 	return ret;
 }
+
+static inline const char *
+eal_strerror(int code)
+{
+	static char buffer[128];
+
+	strerror_s(buffer, sizeof(buffer), code);
+	return buffer;
+}
+
+#define strerror eal_strerror
+
 #endif /* RTE_TOOLCHAIN_GCC */
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 52978e9d7..ded5a2b80 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,10 +6,16 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_file.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
+	'eal_memalloc.c',
+	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'fnmatch.c',
 	'getopt.c',
 )
+
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-10 10:26               ` Burakov, Anatoly
@ 2020-06-10 14:31                 ` Dmitry Kozlyuk
  2020-06-10 15:48                   ` Burakov, Anatoly
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 14:31 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On Wed, 10 Jun 2020 11:26:22 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

[snip]
> >>> +	addr = eal_get_virtual_area(
> >>> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
> >>> +	if (addr == NULL) {
> >>> +#ifndef RTE_EXEC_ENV_WINDOWS
> >>> +		/* The hint would be misleading on Windows, but this function
> >>> +		 * is called from many places, including common code,
> >>> +		 * so don't duplicate the message.
> >>> +		 */
> >>> +		if (rte_errno == EADDRNOTAVAIL)
> >>> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> >>> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> >>> +				(unsigned long long)mem_sz, msl->base_va);
> >>> +		else
> >>> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> >>> +#endif  
> >>
> >> You're left without any error messages on Windows. How about:
> >>
> >> const char *err_str = "Cannot reserve memory\n";
> >> #ifndef RTE_EXEC_ENV_WINDOWS
> >> if (rte_errno == EADDRNOTAVAIL)
> >>      err_str = ...
> >> #endif
> >> RTE_LOG(ERR, EAL, err_str);
> >>
> >> or something like that?
> >>  
> > 
> > How about removing generic error message here completely and printing more
> > specific messages at call sites? In fact, almost all of them already do this.
> > It would be more helpful in tracking down errors.
> >   
> 
> Agreed, let's do that :) We do pass up the rte_errno, correct? So, we 
> should be able to do that.

Actually, callers don't need rte_errno, because we only have to distinguish
EADDRNOTAVAIL here, and eal_get_virtual_area() already prints precise
diagnostics at WARNING and ERR level. rte_errno is preserved, however.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-10 14:31                 ` Dmitry Kozlyuk
@ 2020-06-10 15:48                   ` Burakov, Anatoly
  2020-06-10 16:39                     ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Burakov, Anatoly @ 2020-06-10 15:48 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On 10-Jun-20 3:31 PM, Dmitry Kozlyuk wrote:
> On Wed, 10 Jun 2020 11:26:22 +0100
> "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:
> 
> [snip]
>>>>> +	addr = eal_get_virtual_area(
>>>>> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
>>>>> +	if (addr == NULL) {
>>>>> +#ifndef RTE_EXEC_ENV_WINDOWS
>>>>> +		/* The hint would be misleading on Windows, but this function
>>>>> +		 * is called from many places, including common code,
>>>>> +		 * so don't duplicate the message.
>>>>> +		 */
>>>>> +		if (rte_errno == EADDRNOTAVAIL)
>>>>> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
>>>>> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
>>>>> +				(unsigned long long)mem_sz, msl->base_va);
>>>>> +		else
>>>>> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
>>>>> +#endif
>>>>
>>>> You're left without any error messages on Windows. How about:
>>>>
>>>> const char *err_str = "Cannot reserve memory\n";
>>>> #ifndef RTE_EXEC_ENV_WINDOWS
>>>> if (rte_errno == EADDRNOTAVAIL)
>>>>       err_str = ...
>>>> #endif
>>>> RTE_LOG(ERR, EAL, err_str);
>>>>
>>>> or something like that?
>>>>   
>>>
>>> How about removing generic error message here completely and printing more
>>> specific messages at call sites? In fact, almost all of them already do this.
>>> It would be more helpful in tracking down errors.
>>>    
>>
>> Agreed, let's do that :) We do pass up the rte_errno, correct? So, we
>> should be able to do that.
> 
> Actually, callers don't need rte_errno, because we only have to distinguish
> EADDRNOTAVAIL here, and eal_get_virtual_area() already prints precise
> diagnostics at WARNING and ERR level. rte_errno is preserved, however.
> 

Not sure i agree, we still need the "--base-virtaddr" hint, and we can 
only do that from the caller (without #ifdef-ery here), so we do need 
rte_errno for that.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-10 15:48                   ` Burakov, Anatoly
@ 2020-06-10 16:39                     ` Dmitry Kozlyuk
  2020-06-11  8:59                       ` Burakov, Anatoly
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-10 16:39 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On Wed, 10 Jun 2020 16:48:58 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> On 10-Jun-20 3:31 PM, Dmitry Kozlyuk wrote:
> > On Wed, 10 Jun 2020 11:26:22 +0100
> > "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:
> > 
> > [snip]  
> >>>>> +	addr = eal_get_virtual_area(
> >>>>> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
> >>>>> +	if (addr == NULL) {
> >>>>> +#ifndef RTE_EXEC_ENV_WINDOWS
> >>>>> +		/* The hint would be misleading on Windows, but this function
> >>>>> +		 * is called from many places, including common code,
> >>>>> +		 * so don't duplicate the message.
> >>>>> +		 */
> >>>>> +		if (rte_errno == EADDRNOTAVAIL)
> >>>>> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> >>>>> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> >>>>> +				(unsigned long long)mem_sz, msl->base_va);
> >>>>> +		else
> >>>>> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> >>>>> +#endif  
> >>>>
> >>>> You're left without any error messages on Windows. How about:
> >>>>
> >>>> const char *err_str = "Cannot reserve memory\n";
> >>>> #ifndef RTE_EXEC_ENV_WINDOWS
> >>>> if (rte_errno == EADDRNOTAVAIL)
> >>>>       err_str = ...
> >>>> #endif
> >>>> RTE_LOG(ERR, EAL, err_str);
> >>>>
> >>>> or something like that?
> >>>>     
> >>>
> >>> How about removing generic error message here completely and printing more
> >>> specific messages at call sites? In fact, almost all of them already do this.
> >>> It would be more helpful in tracking down errors.
> >>>      
> >>
> >> Agreed, let's do that :) We do pass up the rte_errno, correct? So, we
> >> should be able to do that.  
> > 
> > Actually, callers don't need rte_errno, because we only have to distinguish
> > EADDRNOTAVAIL here, and eal_get_virtual_area() already prints precise
> > diagnostics at WARNING and ERR level. rte_errno is preserved, however.
> >   
> 
> Not sure i agree, we still need the "--base-virtaddr" hint, and we can 
> only do that from the caller (without #ifdef-ery here), so we do need 
> rte_errno for that.

Maybe we're talking about different things. The "--base-virtaddr" hint is
printed from eal_memseg_list_alloc() on Unices for EADDRNOTAVAIL.
This is handy to avoid duplicating the hint and to provide context, so let's
keep it despite #ifndef.

Otherwise, a generic error is printed from the same function (mistakenly not
on Windows in v6). This generic error adds nothing to eal_get_virtual_area()
logs and also doesn't help to known which exact eal_memseg_list_alloc()
failed. If instead callers printed their own messages, it would be clear
which call failed and in which context. Generic error can than be removed,
eal_memseg_list_alloc() code simplified. Callers can inspect rte_errno if they
ever need it, but really they don't, because hint is printed by
eal_memseg_list_alloc() and eal_get_virtual_area() prints even more precise
logs. This is what I did in v8.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-10 16:39                     ` Dmitry Kozlyuk
@ 2020-06-11  8:59                       ` Burakov, Anatoly
  0 siblings, 0 replies; 218+ messages in thread
From: Burakov, Anatoly @ 2020-06-11  8:59 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Bruce Richardson

On 10-Jun-20 5:39 PM, Dmitry Kozlyuk wrote:
> On Wed, 10 Jun 2020 16:48:58 +0100
> "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:
> 
>> On 10-Jun-20 3:31 PM, Dmitry Kozlyuk wrote:
>>> On Wed, 10 Jun 2020 11:26:22 +0100
>>> "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:
>>>
>>> [snip]
>>>>>>> +	addr = eal_get_virtual_area(
>>>>>>> +		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
>>>>>>> +	if (addr == NULL) {
>>>>>>> +#ifndef RTE_EXEC_ENV_WINDOWS
>>>>>>> +		/* The hint would be misleading on Windows, but this function
>>>>>>> +		 * is called from many places, including common code,
>>>>>>> +		 * so don't duplicate the message.
>>>>>>> +		 */
>>>>>>> +		if (rte_errno == EADDRNOTAVAIL)
>>>>>>> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
>>>>>>> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
>>>>>>> +				(unsigned long long)mem_sz, msl->base_va);
>>>>>>> +		else
>>>>>>> +			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
>>>>>>> +#endif
>>>>>>
>>>>>> You're left without any error messages on Windows. How about:
>>>>>>
>>>>>> const char *err_str = "Cannot reserve memory\n";
>>>>>> #ifndef RTE_EXEC_ENV_WINDOWS
>>>>>> if (rte_errno == EADDRNOTAVAIL)
>>>>>>        err_str = ...
>>>>>> #endif
>>>>>> RTE_LOG(ERR, EAL, err_str);
>>>>>>
>>>>>> or something like that?
>>>>>>      
>>>>>
>>>>> How about removing generic error message here completely and printing more
>>>>> specific messages at call sites? In fact, almost all of them already do this.
>>>>> It would be more helpful in tracking down errors.
>>>>>       
>>>>
>>>> Agreed, let's do that :) We do pass up the rte_errno, correct? So, we
>>>> should be able to do that.
>>>
>>> Actually, callers don't need rte_errno, because we only have to distinguish
>>> EADDRNOTAVAIL here, and eal_get_virtual_area() already prints precise
>>> diagnostics at WARNING and ERR level. rte_errno is preserved, however.
>>>    
>>
>> Not sure i agree, we still need the "--base-virtaddr" hint, and we can
>> only do that from the caller (without #ifdef-ery here), so we do need
>> rte_errno for that.
> 
> Maybe we're talking about different things. The "--base-virtaddr" hint is
> printed from eal_memseg_list_alloc() on Unices for EADDRNOTAVAIL.
> This is handy to avoid duplicating the hint and to provide context, so let's
> keep it despite #ifndef.
> 
> Otherwise, a generic error is printed from the same function (mistakenly not
> on Windows in v6). This generic error adds nothing to eal_get_virtual_area()
> logs and also doesn't help to known which exact eal_memseg_list_alloc()
> failed. If instead callers printed their own messages, it would be clear
> which call failed and in which context. Generic error can than be removed,
> eal_memseg_list_alloc() code simplified. Callers can inspect rte_errno if they
> ever need it, but really they don't, because hint is printed by
> eal_memseg_list_alloc() and eal_get_virtual_area() prints even more precise
> logs. This is what I did in v8.
> 

Right, OK :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 02/11] eal: introduce internal wrappers for file operations
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-06-11 17:13               ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-11 17:13 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

10/06/2020 16:27, Dmitry Kozlyuk:
> Introduce OS-independent wrappers in order to support common EAL code
> on Unix and Windows:
> 
> * eal_file_open: open or create a file.
> * eal_file_lock: lock or unlock an open file.
> * eal_file_truncate: enforce a given size for an open file.
> 
> Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
> which is intended for common code between the two. These thin wrappers
> require no special maintenance.
[...]
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> +Unix shared files
> +F: lib/librte_eal/unix/

Can be moved in "EAL API and common code".




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 00/11] Windows basic memory management
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (10 preceding siblings ...)
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-06-11 17:29             ` Thomas Monjalon
  2020-06-12 22:00               ` Thomas Monjalon
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
  12 siblings, 1 reply; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-11 17:29 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, bruce.richardson

10/06/2020 16:27, Dmitry Kozlyuk:
> This patchset implements basic MM with the following features:

There are some compilation issues on FreeBSD and 32-bit Linux:
http://mails.dpdk.org/archives/test-report/2020-June/135764.html




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-06-12 10:47               ` Thomas Monjalon
  2020-06-12 13:44                 ` Dmitry Kozliuk
  0 siblings, 1 reply; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 10:47 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Ray Kinsella,
	Neil Horman

10/06/2020 16:27, Dmitry Kozlyuk:
> Introduce OS-independent wrappers for memory management operations used
> across DPDK and specifically in common code of EAL:
> 
> * rte_mem_map()
> * rte_mem_unmap()
> * rte_mem_page_size()
> * rte_mem_lock()
> 
> Windows uses different APIs for memory mapping and reservation, while
> Unices reserve memory by mapping it. Introduce EAL private functions to
> support memory reservation in common code:
> 
> * eal_mem_reserve()
> * eal_mem_free()
> * eal_mem_set_dump()
> 
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive. New symbols are internal. Being thin wrappers, they require
> no special maintenance.

[...]
> +/**
> + * Reserve a region of virtual memory.
> + *
> + * Use eal_mem_free() to free reserved memory.
> + *
> + * @param requested_addr
> + *  A desired reservation addressm which must be page-aligned.

Typo: addressm

> + *  The system might not respect it.
> + *  NULL means the address will be chosen by the system.
> + * @param size
> + *  Reservation size. Must be a multiple of system page size.
> + * @param flags
> + *  Reservation options, a combination of eal_mem_reserve_flags.
> + * @returns
> + *  Starting address of the reserved area on success, NULL on failure.
> + *  Callers must not access this memory until remapping it.
> + */
> +void *
> +eal_mem_reserve(void *requested_addr, size_t size, int flags);

[...]
> +/**
> + * Configure memory region inclusion into core dumps.

Not sure about the word "core" here.

> + *
> + * @param virt
> + *  Starting address of the region.
> + * @param size
> + *  Size of the region.
> + * @param dump
> + *  True to include memory into core dumps, false to exclude.
> + * @return
> + *  0 on success, (-1) on failure and rte_errno is set.
> + */
> +int
> +eal_mem_set_dump(void *virt, size_t size, bool dump);

[...]
> --- /dev/null
> +++ b/lib/librte_eal/include/rte_eal_memory.h
> @@ -0,0 +1,93 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_compat.h>
> +
> +/** @file Mamory management wrappers used across DPDK. */

typo on "Mamory"
+
@file must be on a separate line:

/** @file
 *
 * Memory management wrappers used across DPDK.
 */

> +
> +/** Memory protection flags. */
> +enum rte_mem_prot {
> +	RTE_PROT_READ = 1 << 0,   /**< Read access. */
> +	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
> +	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
> +};
> +
> +/** Additional flags for memory mapping. */

Typo on "Addtional"

> +enum rte_map_flags {
> +	/** Changes to the mapped memory are visible to other processes. */
> +	RTE_MAP_SHARED = 1 << 0,
> +	/** Mapping is not backed by a regular file. */
> +	RTE_MAP_ANONYMOUS = 1 << 1,
> +	/** Copy-on-write mapping, changes are invisible to other processes. */
> +	RTE_MAP_PRIVATE = 1 << 2,
> +	/**
> +	 * Force mapping to the requested address. This flag should be used
> +	 * with caution, because to fulfill the request implementation
> +	 * may remove all other mappings in the requested region. However,
> +	 * it is not required to do so, thus mapping with this flag may fail.
> +	 */
> +	RTE_MAP_FORCE_ADDRESS = 1 << 3
> +};

[...]
> +INTERNAL {
> +	global:
> +
> +	rte_mem_lock;
> +	rte_mem_map;
> +	rte_mem_page_size;
> +	rte_mem_unmap;
> +};

Not sure why these functions are internal.
They may be useful for DPDK applications.
We would need to add the file in doxygen index.

If we want to keep them internal, we should add a doxygen marker
@internal.

> +#include <rte_eal_memory.h>

I think we should find a better file name for these wrappers.
"EAL memory" means DPDK memory allocator in my mind.
We need a file name which is about OS-independent wrappers,
or libc wrappers.
What about rte_libc_mem.h? rte_mem_os.h? something else?



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers
  2020-06-12 10:47               ` Thomas Monjalon
@ 2020-06-12 13:44                 ` Dmitry Kozliuk
  2020-06-12 13:54                   ` Thomas Monjalon
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozliuk @ 2020-06-12 13:44 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dpdk-dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Ray Kinsella,
	Neil Horman

> [...]
> > +INTERNAL {
> > +     global:
> > +
> > +     rte_mem_lock;
> > +     rte_mem_map;
> > +     rte_mem_page_size;
> > +     rte_mem_unmap;
> > +};
>
> Not sure why these functions are internal.
> They may be useful for DPDK applications.
> We would need to add the file in doxygen index.
>

Not sure if they are in DPDK scope, apart from rte_mem_lock, which
generalizes rte_mem_lock_page already in rte_memory.h. What may be typical
use cases for data-plane apps? I can see testpmd using mmap for allocating
external memory (because of possible use of hugepages), does it need these
functions exposed?


> If we want to keep them internal, we should add a doxygen marker
> @internal.
>

IIRC, it were you who proposed making them internal instead of
experimental. And internal symbols can always be exposed later.



> > +#include <rte_eal_memory.h>
>
> I think we should find a better file name for these wrappers.
> "EAL memory" means DPDK memory allocator in my mind.
> We need a file name which is about OS-independent wrappers,
> or libc wrappers.
> What about rte_libc_mem.h? rte_mem_os.h? something else?


See above, but anyway, "libc" is non-generic.

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers
  2020-06-12 13:44                 ` Dmitry Kozliuk
@ 2020-06-12 13:54                   ` Thomas Monjalon
  2020-06-12 20:24                     ` Dmitry Kozliuk
  0 siblings, 1 reply; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 13:54 UTC (permalink / raw)
  To: Dmitry Kozliuk
  Cc: dpdk-dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Ray Kinsella,
	Neil Horman

12/06/2020 15:44, Dmitry Kozliuk:
> > [...]
> > > +INTERNAL {
> > > +     global:
> > > +
> > > +     rte_mem_lock;
> > > +     rte_mem_map;
> > > +     rte_mem_page_size;
> > > +     rte_mem_unmap;
> > > +};
> >
> > Not sure why these functions are internal.
> > They may be useful for DPDK applications.
> > We would need to add the file in doxygen index.
> 
> Not sure if they are in DPDK scope, apart from rte_mem_lock, which
> generalizes rte_mem_lock_page already in rte_memory.h. What may be typical
> use cases for data-plane apps? I can see testpmd using mmap for allocating
> external memory (because of possible use of hugepages), does it need these
> functions exposed?

There is a chance the application needs such functions
for another part of its dataplane.

> > If we want to keep them internal, we should add a doxygen marker
> > @internal.
> 
> IIRC, it were you who proposed making them internal instead of
> experimental. And internal symbols can always be exposed later.

They they can be exposed later.
I think it's good to start internal.
Please add the @internal tag in doxygen to make the status clear.

> > > +#include <rte_eal_memory.h>
> >
> > I think we should find a better file name for these wrappers.
> > "EAL memory" means DPDK memory allocator in my mind.
> > We need a file name which is about OS-independent wrappers,
> > or libc wrappers.
> > What about rte_libc_mem.h? rte_mem_os.h? something else?
> 
> See above, but anyway, "libc" is non-generic.

Why libc is not generic?

Which file name can it be?



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 04/11] eal/mem: extract common code for memseg list initialization
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-12 15:39               ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 15:39 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

10/06/2020 16:27, Dmitry Kozlyuk:
> +eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
> +		uint64_t page_sz, int n_segs, int socket_id, bool heap)
> +{
> +	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
> +			sizeof(struct rte_memseg))) {
> +		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
> +			rte_strerror(rte_errno));
> +		return -1;
> +	}
> +
> +	msl->page_sz = page_sz;
> +	msl->socket_id = socket_id;
> +	msl->base_va = NULL;
> +	msl->heap = heap;
> +
> +	RTE_LOG(DEBUG, EAL,
> +		"Memseg list allocated at socket %i, page size 0x%zxkB\n",
> +		socket_id, (size_t)page_sz >> 10);

page_sz is uint64_t, so the right printf specifier is PRIx64.

[...]
> +#ifndef RTE_EXEC_ENV_WINDOWS
> +		/* The hint would be misleading on Windows, but this function

Would be better to explain the reason of misleading
in this comment.

> +		 * is called from many places, including common code,
> +		 * so don't duplicate the message.
> +		 */
> +		if (rte_errno == EADDRNOTAVAIL)
> +			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
> +				"please use '--" OPT_BASE_VIRTADDR "' option\n",
> +				(unsigned long long)mem_sz, msl->base_va);
> +#endif

[...]
> +			if (memseg_list_alloc(msl) < 0) {
> +				RTE_LOG(ERR, EAL,
> +					"Cannot preallocate %zukB hugepages\n",
> +					page_sz >> 10);

page_sz is uint64_t, so the right printf specifier is PRIx64.




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers
  2020-06-12 13:54                   ` Thomas Monjalon
@ 2020-06-12 20:24                     ` Dmitry Kozliuk
  2020-06-12 21:37                       ` Thomas Monjalon
  0 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozliuk @ 2020-06-12 20:24 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dpdk-dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Ray Kinsella,
	Neil Horman

>
>  > Not sure if they are in DPDK scope, apart from rte_mem_lock, which
> > generalizes rte_mem_lock_page already in rte_memory.h. What may be
> typical
> > use cases for data-plane apps? I can see testpmd using mmap for
> allocating
> > external memory (because of possible use of hugepages), does it need
> these
> > functions exposed?
>
> There is a chance the application needs such functions
> for another part of its dataplane.


Such reasoning can justify any API. DPDK needs compatibility layers, but
collecting and providing them is not the goal of the kit. I can only think
of using mmap in data plane as a high-performance IPC with non-DPDK
management app.

Please add the @internal tag in doxygen to make the status clear.
>

OK.


> > > > +#include <rte_eal_memory.h>
> > >
> > > I think we should find a better file name for these wrappers.
> > > "EAL memory" means DPDK memory allocator in my mind.
> > > We need a file name which is about OS-independent wrappers,
> > > or libc wrappers.
> > > What about rte_libc_mem.h? rte_mem_os.h? something else?
> >
> > See above, but anyway, "libc" is non-generic.
>
> Why libc is not generic?
>

It means nothing on Windows. Also, mmap has little to do with libc.

Which file name can it be?
>

Your rte_mem_os.h sounds good, except internal header better be
rte_eal_mem_os.h. Alternative: rte_eal_paging.h.

>

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers
  2020-06-12 20:24                     ` Dmitry Kozliuk
@ 2020-06-12 21:37                       ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 21:37 UTC (permalink / raw)
  To: Dmitry Kozliuk
  Cc: dpdk-dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Ray Kinsella,
	Neil Horman

12/06/2020 22:24, Dmitry Kozliuk:
> > > > > +#include <rte_eal_memory.h>
> > > >
> > > > I think we should find a better file name for these wrappers.
> > > > "EAL memory" means DPDK memory allocator in my mind.
> > > > We need a file name which is about OS-independent wrappers,
> > > > or libc wrappers.
> > > > What about rte_libc_mem.h? rte_mem_os.h? something else?
> > >
> > > See above, but anyway, "libc" is non-generic.
> >
> > Why libc is not generic?
> >
> 
> It means nothing on Windows. Also, mmap has little to do with libc.
> 
> Which file name can it be?
> 
> Your rte_mem_os.h sounds good, except internal header better be
> rte_eal_mem_os.h. Alternative: rte_eal_paging.h.

I don't think we need to use the prefix "eal" if we already have "rte_mem".
I have no strong opinion.
rte_eal_paging looks also good.
Anyway, it is internal, so we could change it if needed.
Please choose one :)





^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-06-12 21:45               ` Thomas Monjalon
  2020-06-12 22:09               ` Thomas Monjalon
  1 sibling, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 21:45 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

10/06/2020 16:27, Dmitry Kozlyuk:
> 1. Map CPU cores to their respective NUMA nodes as reported by system.
> 2. Support systems with more than 64 cores (multiple processor groups).
> 3. Fix magic constants, styling issues, and compiler warnings.
> 4. Add EAL private function to map DPDK socket ID to NUMA node number.
> 
> Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")

Not sure we should consider it as a fix.
This tag is usually required for backporting,
but I don't see any need for backport here.
If you think backport is required, you should add
Cc: stable@dpdk.org.




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 10/11] eal/windows: initialize hugepage info
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-06-12 21:55               ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 21:55 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic

10/06/2020 16:27, Dmitry Kozlyuk:
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> +Windows memory allocation
> +M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> +F: lib/librte_eal/eal_hugepages.c

Is it a typo? You mean lib/librte_eal/windows/eal_hugepages.c ?

[...]
> --- a/doc/guides/windows_gsg/build_dpdk.rst
> +++ b/doc/guides/windows_gsg/build_dpdk.rst
> -Run the helloworld example
> -==========================
> -
> -Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
> -
> -.. code-block:: console
> -
> -    cd C:\Users\me\dpdk\build\examples
> -    dpdk-helloworld.exe
> -    hello from core 1
> -    hello from core 3
> -    hello from core 0
> -    hello from core 2
> -
> -Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
> -by default. To run the example, either add toolchain executables directory
> -to the PATH or copy the library to the working directory.
> -Alternatively, static linking may be used (mind the LGPLv2.1 license).
[...]
> --- /dev/null
> +++ b/doc/guides/windows_gsg/run_apps.rst
> +Run the ``helloworld`` Example
> +------------------------------
> +
> +Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
> +
> +.. code-block:: console
> +
> +    cd C:\Users\me\dpdk\build\examples
> +    dpdk-helloworld.exe
> +    hello from core 1
> +    hello from core 3
> +    hello from core 0
> +    hello from core 2
> +
> +Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
> +by default. To run the example, either add toolchain executables directory
> +to the PATH or copy the library to the working directory.
> +Alternatively, static linking may be used (mind the LGPLv2.1 license).

I tend thinking that such move is better understood
in a separate prior patch.



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 00/11] Windows basic memory management
  2020-06-11 17:29             ` [dpdk-dev] [PATCH v8 00/11] Windows " Thomas Monjalon
@ 2020-06-12 22:00               ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 22:00 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, bruce.richardson, anatoly.burakov,
	david.marchand

11/06/2020 19:29, Thomas Monjalon:
> 10/06/2020 16:27, Dmitry Kozlyuk:
> > This patchset implements basic MM with the following features:
> 
> There are some compilation issues on FreeBSD and 32-bit Linux:
> http://mails.dpdk.org/archives/test-report/2020-June/135764.html

I did more comments about typos, naming, patch splitting, etc.

As soon as these comments are addressed in a v9, I think I can merge.
I did not see any public formal approval, but there is no objection,
so it looks good to go, and there are a lot of patches in the backlog
which depend on this series.

Thanks for all your work Dmitry.



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
  2020-06-12 21:45               ` Thomas Monjalon
@ 2020-06-12 22:09               ` Thomas Monjalon
  1 sibling, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 22:09 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, david.marchand, bruce.richardson

10/06/2020 16:27, Dmitry Kozlyuk:
> 1. Map CPU cores to their respective NUMA nodes as reported by system.
> 2. Support systems with more than 64 cores (multiple processor groups).
> 3. Fix magic constants, styling issues, and compiler warnings.
> 4. Add EAL private function to map DPDK socket ID to NUMA node number.
> 
> Fixes: 53ffd9f080fc ("eal/windows: add minimum viable code")
> 
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
> +eal_create_cpu_map(void)
> +	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
> +	DWORD infos_size;
> +	bool full = false;
> +
> +	infos_size = 0;
> +	if (!GetLogicalProcessorInformationEx(
> +			RelationNumaNode, NULL, &infos_size)) {
> +		DWORD error = GetLastError();
> +		if (error != ERROR_INSUFFICIENT_BUFFER) {
> +			rte_panic("cannot get NUMA node info size, error %lu",
> +				GetLastError());
> +		}
> +	}
> +
> +	infos = malloc(infos_size);
> +	if (infos == NULL) {
> +		rte_panic("cannot allocate memory for NUMA node information");
> +		return;
> +	}
> +
> +	if (!GetLogicalProcessorInformationEx(
> +			RelationNumaNode, infos, &infos_size)) {
> +		rte_panic("cannot get NUMA node information, error %lu",
> +			GetLastError());
> +	}

rte_panic addition is forbidden in the libraries.
An application may want to manage the error and shutdown
the DPDK part gracefully.
Please can you try to return an error to rte_eal_init()?



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v8 11/11] eal/windows: implement basic memory management
  2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-06-12 22:12               ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-12 22:12 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon, John McNamara, Marko Kovacevic,
	Anatoly Burakov

2 typos below found with checkpatch.

10/06/2020 16:27, Dmitry Kozlyuk:
> +		/* May occcur when committing regular memory. */

Typo: occcur

> +	/* No need to join reserved regions adjascent to the freed one:

Typo: adjascent



^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 00/12] Windows basic memory management
  2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
                               ` (11 preceding siblings ...)
  2020-06-11 17:29             ` [dpdk-dev] [PATCH v8 00/11] Windows " Thomas Monjalon
@ 2020-06-15  0:43             ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 01/12] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
                                 ` (13 more replies)
  12 siblings, 14 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk

Note for v9:
rte_eal_memory.h renamed, dependent patchsets have to be updated.

This patchset implements basic MM with the following features:

* Hugepages are dynamically allocated in user-mode.
* Only 2MB hugepages are supported.
* IOVA is always PA, obtained through kernel-mode driver.
* No 32-bit support (presumably not demanded).
* Ni multi-process support (it is forcefully disabled).
* No-huge mode for testing without IOVA is available.

Testing revealed Windows Server 2019 does not allow allocating hugepage
memory at a reserved address, despite advertised API.  So allocator has
to temporary free the region to be allocated.  This creates in inherent
race condition. This issue is being discussed with Microsoft privately.

New EAL public functions for memory mapping are introduced to mitigate
OS differences in DPDK libraries and applications: rte_mem_map,
rte_mem_unmap, rte_mem_lock, rte_mem_page_size.

To support common MM routines, internal wrappers for low-level memory
reservation and file management are introduced. These changes affect
Linux and FreeBSD EAL. Shared code is placed unded /unix/ subdirectory
(suggested by Thomas).

To avoid code duplication between Linux and Windows EAL, common code
for EALs supporting dynamic memory allocation is extracted
(discussed with Anatoly Burakov in v4 thread). This is a separate
patch to ease the review, but it can be merged with the previous one.

EAL tracepoints save size_t values as long, which is invalid on Windows.
New size_t emitter for tracepoints is introduced (suggested by Jerin
Jacob to Fady Bader, see [1]). Also, to avoid workaround in every file
using the tracepoints, stubs are added to Windows EAL.

Entire <sys/queue.h> is imported from FreeBSD, replacing existing
partial import. There is already a license exception for this file.
The file is imported as-is, so it causes a bunch of checkpatch warnings.

[1]: http://mails.dpdk.org/archives/dev/2020-May/168076.html

---

v9:
    * Fix build on 32-bit and FreeBSD.
    * Rename rte_eal_memory.h to rte_eal_paging.h.
    * Do not use rte_panic() in library code.
    * Fix typos, comments, string formatting.
    * Split documentation commits.

v8:
    * Log eal_memseg_list_alloc() failure at caller sites (Anatoly Burakov).

v7:
    * Change EAL internal file management API (Neil Horman).

v6:
    * Fix 32-bit build on x86 (CI).
    * Fix Makefile build (Anatoly Burakov, Thomas Monjalon).
    * Restore 32-bit common code (Anatoly Burakov).
    * Fix error reporting in memory management (Anatoly Burakov).
    * Add Doxygen comment for size_t tracepoint emitter (Jerin Jacob).
    * Update MAINTAINERS for new files and new code (Thomas Monjalon).
    * Rename rte_get_page_size to rte_mem_page_size.
    * Mark DPDK-only wrappers internal, move them to separate file.
    * Get rid of warnings in enabled common code with Clang on Windows.

v5:
    * Fix allocation and deallocation on Windows Server (Fady Bader).
    * Replace remaining VirtualFree with VirtualFreeEx (Ranjit Menon).
    * Fix errors in eal_get_virtual_area (Anatoly Burakov).
    * Fix error handling and documentation for rte_mem_lock (Anatoly Burakov).
    * Extract common code for EALs w/dynamic allocation (Anatoly Burakov).
    * Use POSIX value for rte_errno after rte_mem_unmap() on Windows.
    * Add stubs to use tracing functions without workarounds.

v4:
    * Rebase on ToT, drop patches merged into master.
    * Rearrange patches to split Windows code (Jerin).
    * Fix Linux and FreeBSD build with make (Ophir).
    * Use int instead of enum to hold a set of flags (Anatoly).
    * Rename eal_mem_reserve items and fix their description (Anatoly).
    * Add eal_mem_set_dump() wrapper around madvise (Anatoly).
    * Don't claim Windows Server 2016 support due to lack of API (Tal).
    * Replace enum rte_page_sizes with a set of #defines (Jerin).
    * Fix documentation, SPDX tags, logging (Thomas).

v3:
    * Fix Linux build on and aarch64 and 32-bit x86 (reported by CI).
    * Fix logic and error handling while allocating segments.
    * Fix Unix rte_mem_map(): return NULL on failure.
    * Fix some checkpatch.sh issues:
        * Do not return positive errno, use DWORD for GetLastError().
        * Make dpdk-kmods source files non-executable.
    * Improve GSG for Windows Server (suggested by Ranjit Menon).

v2:
    * Rebase on ToT. Move all new code shared between Linux and FreeBSD
      to /unix/ subdirectory, also factor out some existing code there.
    * Improve description of Clang issue with rte_page_sizes on Windows.
      Restore -fstrict-enum for EAL. Check running, not target compiler.
    * Use EAL prefix for private facilities instead if RTE.
    * Improve documentation comments for new functions.
    * Remove co-installer for virt2phys. Add a typecast for clarity.
    * Document virt2phys in user guide, improve its own README.
    * Explicitly and forcefully disable multi-process.

Dmitry Kozlyuk (12):
  eal: replace rte_page_sizes with a set of constants
  eal: introduce internal wrappers for file operations
  eal: introduce memory management wrappers
  eal/mem: extract common code for memseg list initialization
  eal/mem: extract common code for dynamic memory allocation
  trace: add size_t field emitter
  eal/windows: add tracing support stubs
  eal/windows: replace sys/queue.h with a complete one from FreeBSD
  eal/windows: improve CPU and NUMA node detection
  doc/windows: split build and run instructions
  eal/windows: initialize hugepage info
  eal/windows: implement basic memory management

 MAINTAINERS                                   |   7 +
 config/meson.build                            |  12 +-
 doc/guides/rel_notes/release_20_08.rst        |   2 +
 doc/guides/windows_gsg/build_dpdk.rst         |  20 -
 doc/guides/windows_gsg/index.rst              |   1 +
 doc/guides/windows_gsg/run_apps.rst           |  95 +++
 lib/librte_eal/common/eal_common_dynmem.c     | 521 +++++++++++++
 lib/librte_eal/common/eal_common_fbarray.c    |  71 +-
 lib/librte_eal/common/eal_common_memory.c     | 157 +++-
 lib/librte_eal/common/eal_common_thread.c     |   5 +-
 lib/librte_eal/common/eal_private.h           | 254 ++++++-
 lib/librte_eal/common/meson.build             |  16 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/freebsd/Makefile               |   5 +
 lib/librte_eal/freebsd/eal_memory.c           | 102 +--
 lib/librte_eal/include/rte_eal_paging.h       |  98 +++
 lib/librte_eal/include/rte_eal_trace.h        |   8 +-
 lib/librte_eal/include/rte_memory.h           |  23 +-
 lib/librte_eal/include/rte_trace_point.h      |   3 +
 lib/librte_eal/linux/Makefile                 |   6 +
 lib/librte_eal/linux/eal_memalloc.c           |   5 +-
 lib/librte_eal/linux/eal_memory.c             | 617 +--------------
 lib/librte_eal/meson.build                    |   4 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/rte_eal_version.map            |   9 +
 lib/librte_eal/unix/eal_file.c                |  80 ++
 lib/librte_eal/unix/eal_unix_memory.c         | 152 ++++
 lib/librte_eal/unix/meson.build               |   7 +
 lib/librte_eal/windows/eal.c                  | 114 ++-
 lib/librte_eal/windows/eal_file.c             | 125 +++
 lib/librte_eal/windows/eal_hugepages.c        | 108 +++
 lib/librte_eal/windows/eal_lcore.c            | 205 +++--
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  90 ++-
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/sys/queue.h    | 663 ++++++++++++++--
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   7 +
 lib/librte_mempool/rte_mempool_trace.h        |  10 +-
 44 files changed, 4088 insertions(+), 945 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c
 create mode 100644 lib/librte_eal/include/rte_eal_paging.h
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
 create mode 100644 lib/librte_eal/unix/meson.build
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 01/12] eal: replace rte_page_sizes with a set of constants
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 02/12] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
                                 ` (12 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, John McNamara,
	Marko Kovacevic, Anatoly Burakov

Clang on Windows follows MS ABI where enum values are limited to 2^31-1.
Enum rte_page_sizes has members valued above this limit, which get
wrapped to zero, resulting in compilation error (duplicate values in
enum). Using MS ABI is mandatory for Windows EAL to call Win32 APIs.

Remove rte_page_sizes and replace its values with #define's.
This enumeration is not used in public API, so there's no ABI breakage.
Announce API changes for 20.08 in documentation.

Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 doc/guides/rel_notes/release_20_08.rst |  2 ++
 lib/librte_eal/include/rte_memory.h    | 23 ++++++++++-------------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
index dee4ccbb5..86d240213 100644
--- a/doc/guides/rel_notes/release_20_08.rst
+++ b/doc/guides/rel_notes/release_20_08.rst
@@ -91,6 +91,8 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* ``rte_page_sizes`` enumeration is replaced with ``RTE_PGSIZE_xxx`` defines.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_eal/include/rte_memory.h b/lib/librte_eal/include/rte_memory.h
index 3d8d0bd69..65374d53a 100644
--- a/lib/librte_eal/include/rte_memory.h
+++ b/lib/librte_eal/include/rte_memory.h
@@ -24,19 +24,16 @@ extern "C" {
 #include <rte_config.h>
 #include <rte_fbarray.h>
 
-__extension__
-enum rte_page_sizes {
-	RTE_PGSIZE_4K    = 1ULL << 12,
-	RTE_PGSIZE_64K   = 1ULL << 16,
-	RTE_PGSIZE_256K  = 1ULL << 18,
-	RTE_PGSIZE_2M    = 1ULL << 21,
-	RTE_PGSIZE_16M   = 1ULL << 24,
-	RTE_PGSIZE_256M  = 1ULL << 28,
-	RTE_PGSIZE_512M  = 1ULL << 29,
-	RTE_PGSIZE_1G    = 1ULL << 30,
-	RTE_PGSIZE_4G    = 1ULL << 32,
-	RTE_PGSIZE_16G   = 1ULL << 34,
-};
+#define RTE_PGSIZE_4K   (1ULL << 12)
+#define RTE_PGSIZE_64K  (1ULL << 16)
+#define RTE_PGSIZE_256K (1ULL << 18)
+#define RTE_PGSIZE_2M   (1ULL << 21)
+#define RTE_PGSIZE_16M  (1ULL << 24)
+#define RTE_PGSIZE_256M (1ULL << 28)
+#define RTE_PGSIZE_512M (1ULL << 29)
+#define RTE_PGSIZE_1G   (1ULL << 30)
+#define RTE_PGSIZE_4G   (1ULL << 32)
+#define RTE_PGSIZE_16G  (1ULL << 34)
 
 #define SOCKET_ID_ANY -1                    /**< Any NUMA socket. */
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 02/12] eal: introduce internal wrappers for file operations
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 01/12] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers Dmitry Kozlyuk
                                 ` (11 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Introduce OS-independent wrappers in order to support common EAL code
on Unix and Windows:

* eal_file_open: open or create a file.
* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.

Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. These thin wrappers
require no special maintenance.

Common code supporting multi-process doesn't use the new wrappers,
because it is inherently Unix-specific and would impose excessive
requirements on the wrappers.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                |  1 +
 lib/librte_eal/common/eal_common_fbarray.c | 31 ++++-----
 lib/librte_eal/common/eal_private.h        | 73 ++++++++++++++++++++
 lib/librte_eal/freebsd/Makefile            |  4 ++
 lib/librte_eal/linux/Makefile              |  4 ++
 lib/librte_eal/meson.build                 |  4 ++
 lib/librte_eal/unix/eal_file.c             | 80 ++++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |  6 ++
 8 files changed, 184 insertions(+), 19 deletions(-)
 create mode 100644 lib/librte_eal/unix/eal_file.c
 create mode 100644 lib/librte_eal/unix/meson.build

diff --git a/MAINTAINERS b/MAINTAINERS
index e739b87ea..4d162efd6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -170,6 +170,7 @@ EAL API and common code
 F: lib/librte_eal/common/
 F: lib/librte_eal/include/
 F: lib/librte_eal/rte_eal_version.map
+F: lib/librte_eal/unix/
 F: doc/guides/prog_guide/env_abstraction_layer.rst
 F: app/test/test_alarm.c
 F: app/test/test_atomic.c
diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index 4f8f1af73..c52ddb967 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -8,8 +8,8 @@
 #include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
-#include <sys/file.h>
 #include <string.h>
+#include <unistd.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -85,10 +85,8 @@ resize_and_map(int fd, void *addr, size_t len)
 	char path[PATH_MAX];
 	void *map_addr;
 
-	if (ftruncate(fd, len)) {
+	if (eal_file_truncate(fd, len)) {
 		RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path);
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 
@@ -772,15 +770,15 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * and see if we succeed. If we don't, someone else is using it
 		 * already.
 		 */
-		fd = open(path, O_CREAT | O_RDWR, 0600);
+		fd = eal_file_open(path, EAL_OPEN_CREATE | EAL_OPEN_READWRITE);
 		if (fd < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't open %s: %s\n",
-					__func__, path, strerror(errno));
-			rte_errno = errno;
+				__func__, path, rte_strerror(rte_errno));
 			goto fail;
-		} else if (flock(fd, LOCK_EX | LOCK_NB)) {
+		} else if (eal_file_lock(
+				fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
-					__func__, path, strerror(errno));
+				__func__, path, rte_strerror(rte_errno));
 			rte_errno = EBUSY;
 			goto fail;
 		}
@@ -789,10 +787,8 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		 * still attach to it, but no other process could reinitialize
 		 * it.
 		 */
-		if (flock(fd, LOCK_SH | LOCK_NB)) {
-			rte_errno = errno;
+		if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 			goto fail;
-		}
 
 		if (resize_and_map(fd, data, mmap_len))
 			goto fail;
@@ -888,17 +884,14 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 
 	eal_get_fbarray_path(path, sizeof(path), arr->name);
 
-	fd = open(path, O_RDWR);
+	fd = eal_file_open(path, EAL_OPEN_READWRITE);
 	if (fd < 0) {
-		rte_errno = errno;
 		goto fail;
 	}
 
 	/* lock the file, to let others know we're using it */
-	if (flock(fd, LOCK_SH | LOCK_NB)) {
-		rte_errno = errno;
+	if (eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN))
 		goto fail;
-	}
 
 	if (resize_and_map(fd, data, mmap_len))
 		goto fail;
@@ -1025,7 +1018,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		 * has been detached by all other processes
 		 */
 		fd = tmp->fd;
-		if (flock(fd, LOCK_EX | LOCK_NB)) {
+		if (eal_file_lock(fd, EAL_FLOCK_EXCLUSIVE, EAL_FLOCK_RETURN)) {
 			RTE_LOG(DEBUG, EAL, "Cannot destroy fbarray - another process is using it\n");
 			rte_errno = EBUSY;
 			ret = -1;
@@ -1042,7 +1035,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 			 * we're still holding an exclusive lock, so drop it to
 			 * shared.
 			 */
-			flock(fd, LOCK_SH | LOCK_NB);
+			eal_file_lock(fd, EAL_FLOCK_SHARED, EAL_FLOCK_RETURN);
 
 			ret = -1;
 			goto out;
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 869ce183a..6733a2321 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -420,4 +420,77 @@ eal_malloc_no_trace(const char *type, size_t size, unsigned int align);
 
 void eal_free_no_trace(void *addr);
 
+/** Options for eal_file_open(). */
+enum eal_open_flags {
+	/** Open file for reading. */
+	EAL_OPEN_READONLY = 0x00,
+	/** Open file for reading and writing. */
+	EAL_OPEN_READWRITE = 0x02,
+	/**
+	 * Create the file if it doesn't exist.
+	 * New files are only accessible to the owner (0600 equivalent).
+	 */
+	EAL_OPEN_CREATE = 0x04
+};
+
+/**
+ * Open or create a file.
+ *
+ * @param path
+ *  Path to the file.
+ * @param flags
+ *  A combination of eal_open_flags controlling operation and FD behavior.
+ * @return
+ *  Open file descriptor on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_file_open(const char *path, int flags);
+
+/** File locking operation. */
+enum eal_flock_op {
+	EAL_FLOCK_SHARED,    /**< Acquire a shared lock. */
+	EAL_FLOCK_EXCLUSIVE, /**< Acquire an exclusive lock. */
+	EAL_FLOCK_UNLOCK     /**< Release a previously taken lock. */
+};
+
+/** Behavior on file locking conflict. */
+enum eal_flock_mode {
+	EAL_FLOCK_WAIT,  /**< Wait until the file gets unlocked to lock it. */
+	EAL_FLOCK_RETURN /**< Return immediately if the file is locked. */
+};
+
+/**
+ * Lock or unlock the file.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX flock(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param op
+ *  Operation to perform.
+ * @param mode
+ *  Behavior on conflict.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
+
+/**
+ * Truncate or extend the file to the specified size.
+ *
+ * On failure @code rte_errno @endcode is set to the error code
+ * specified by POSIX ftruncate(3) description.
+ *
+ * @param fd
+ *  Opened file descriptor.
+ * @param size
+ *  Desired file size.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_file_truncate(int fd, ssize_t size);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index af95386d4..0f8741d96 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -74,6 +75,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_hypervisor.c
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 48cc34844..331489f99 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -7,6 +7,7 @@ LIB = librte_eal.a
 
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/unix
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
@@ -81,6 +82,9 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_random.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
+# from unix dir
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_hypervisor.c
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index e301f4558..8d492897d 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -6,6 +6,10 @@ subdir('include')
 
 subdir('common')
 
+if not is_windows
+	subdir('unix')
+endif
+
 dpdk_conf.set('RTE_EXEC_ENV_' + exec_env.to_upper(), 1)
 subdir(exec_env)
 
diff --git a/lib/librte_eal/unix/eal_file.c b/lib/librte_eal/unix/eal_file.c
new file mode 100644
index 000000000..1b26475ba
--- /dev/null
+++ b/lib/librte_eal/unix/eal_file.c
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <sys/file.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_errno.h>
+
+#include "eal_private.h"
+
+int
+eal_file_open(const char *path, int flags)
+{
+	static const int MODE_MASK = EAL_OPEN_READONLY | EAL_OPEN_READWRITE;
+
+	int ret, sys_flags;
+
+	switch (flags & MODE_MASK) {
+	case EAL_OPEN_READONLY:
+		sys_flags = O_RDONLY;
+		break;
+	case EAL_OPEN_READWRITE:
+		sys_flags = O_RDWR;
+		break;
+	default:
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (flags & EAL_OPEN_CREATE)
+		sys_flags |= O_CREAT;
+
+	ret = open(path, sys_flags, 0600);
+	if (ret < 0)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	int ret;
+
+	ret = ftruncate(fd, size);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	int sys_flags = 0;
+	int ret;
+
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCK_NB;
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+		sys_flags |= LOCK_EX;
+		break;
+	case EAL_FLOCK_SHARED:
+		sys_flags |= LOCK_SH;
+		break;
+	case EAL_FLOCK_UNLOCK:
+		sys_flags |= LOCK_UN;
+		break;
+	}
+
+	ret = flock(fd, sys_flags);
+	if (ret)
+		rte_errno = errno;
+
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
new file mode 100644
index 000000000..21029ba1a
--- /dev/null
+++ b/lib/librte_eal/unix/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2020 Dmitry Kozlyuk
+
+sources += files(
+	'eal_file.c',
+)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 01/12] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 02/12] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  6:03                 ` Kinsella, Ray
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 04/12] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
                                 ` (10 subsequent siblings)
  13 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson, Ray Kinsella, Neil Horman

Introduce OS-independent wrappers for memory management operations used
across DPDK and specifically in common code of EAL:

* rte_mem_map()
* rte_mem_unmap()
* rte_mem_page_size()
* rte_mem_lock()

Windows uses different APIs for memory mapping and reservation, while
Unices reserve memory by mapping it. Introduce EAL private functions to
support memory reservation in common code:

* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()

Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive. New symbols are internal. Being thin wrappers, they require
no special maintenance.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---

Not adding rte_eal_paging.h to Doxygen index because, to my
understanding, it only contains public API, and it was decided to keep
rte_eal_paging.h functions private.

 lib/librte_eal/common/eal_common_fbarray.c |  40 +++---
 lib/librte_eal/common/eal_common_memory.c  |  61 ++++-----
 lib/librte_eal/common/eal_private.h        |  78 ++++++++++-
 lib/librte_eal/freebsd/Makefile            |   1 +
 lib/librte_eal/include/rte_eal_paging.h    |  98 +++++++++++++
 lib/librte_eal/linux/Makefile              |   1 +
 lib/librte_eal/linux/eal_memalloc.c        |   5 +-
 lib/librte_eal/rte_eal_version.map         |   9 ++
 lib/librte_eal/unix/eal_unix_memory.c      | 152 +++++++++++++++++++++
 lib/librte_eal/unix/meson.build            |   1 +
 10 files changed, 381 insertions(+), 65 deletions(-)
 create mode 100644 lib/librte_eal/include/rte_eal_paging.h
 create mode 100644 lib/librte_eal/unix/eal_unix_memory.c

diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
index c52ddb967..fd0292a64 100644
--- a/lib/librte_eal/common/eal_common_fbarray.c
+++ b/lib/librte_eal/common/eal_common_fbarray.c
@@ -5,15 +5,16 @@
 #include <fcntl.h>
 #include <inttypes.h>
 #include <limits.h>
-#include <sys/mman.h>
 #include <stdint.h>
 #include <errno.h>
 #include <string.h>
 #include <unistd.h>
 
 #include <rte_common.h>
-#include <rte_log.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
 
@@ -90,12 +91,9 @@ resize_and_map(int fd, void *addr, size_t len)
 		return -1;
 	}
 
-	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
-			MAP_SHARED | MAP_FIXED, fd, 0);
+	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
+			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
 	if (map_addr != addr) {
-		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
-		/* pass errno up the chain */
-		rte_errno = errno;
 		return -1;
 	}
 	return 0;
@@ -733,7 +731,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -754,11 +752,13 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 
 	if (internal_config.no_shconf) {
 		/* remap virtual area as writable */
-		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
-		if (new_data == MAP_FAILED) {
+		static const int flags = RTE_MAP_FORCE_ADDRESS |
+			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
+		void *new_data = rte_mem_map(data, mmap_len,
+			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
+		if (new_data == NULL) {
 			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
-					__func__, strerror(errno));
+					__func__, rte_strerror(rte_errno));
 			goto fail;
 		}
 	} else {
@@ -820,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -858,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 		return -1;
 	}
 
-	page_sz = sysconf(_SC_PAGESIZE);
+	page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1) {
 		free(ma);
 		return -1;
@@ -909,7 +909,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
 	return 0;
 fail:
 	if (data)
-		munmap(data, mmap_len);
+		rte_mem_unmap(data, mmap_len);
 	if (fd >= 0)
 		close(fd);
 	free(ma);
@@ -937,8 +937,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -957,7 +956,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
 		goto out;
 	}
 
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, close fd and remove the tailq entry */
 	if (tmp->fd >= 0)
@@ -992,8 +991,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 	 * really do anything about it, things will blow up either way.
 	 */
 
-	size_t page_sz = sysconf(_SC_PAGESIZE);
-
+	size_t page_sz = rte_mem_page_size();
 	if (page_sz == (size_t)-1)
 		return -1;
 
@@ -1042,7 +1040,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
 		}
 		close(fd);
 	}
-	munmap(arr->data, mmap_len);
+	rte_mem_unmap(arr->data, mmap_len);
 
 	/* area is unmapped, remove the tailq entry */
 	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 4c897a13f..aa377990f 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -11,13 +11,13 @@
 #include <string.h>
 #include <unistd.h>
 #include <inttypes.h>
-#include <sys/mman.h>
 #include <sys/queue.h>
 
 #include <rte_fbarray.h>
 #include <rte_memory.h>
 #include <rte_eal.h>
 #include <rte_eal_memconfig.h>
+#include <rte_eal_paging.h>
 #include <rte_errno.h>
 #include <rte_log.h>
 
@@ -40,18 +40,10 @@
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
-#ifdef RTE_EXEC_ENV_LINUX
-#define RTE_DONTDUMP MADV_DONTDUMP
-#elif defined RTE_EXEC_ENV_FREEBSD
-#define RTE_DONTDUMP MADV_NOCORE
-#else
-#error "madvise doesn't support this OS"
-#endif
-
 #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags)
+	size_t page_sz, int flags, int reserve_flags)
 {
 	bool addr_is_hint, allow_shrink, unmap, no_align;
 	uint64_t map_sz;
@@ -59,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	uint8_t try = 0;
 
 	if (system_page_sz == 0)
-		system_page_sz = sysconf(_SC_PAGESIZE);
-
-	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+		system_page_sz = rte_mem_page_size();
 
 	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
 
@@ -105,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 			return NULL;
 		}
 
-		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
-				mmap_flags, -1, 0);
-		if (mapped_addr == MAP_FAILED && allow_shrink)
+		mapped_addr = eal_mem_reserve(
+			requested_addr, (size_t)map_sz, reserve_flags);
+		if ((mapped_addr == NULL) && allow_shrink)
 			*size -= page_sz;
 
-		if (mapped_addr != MAP_FAILED && addr_is_hint &&
-		    mapped_addr != requested_addr) {
+		if ((mapped_addr != NULL) && addr_is_hint &&
+				(mapped_addr != requested_addr)) {
 			try++;
 			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
 			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
 				/* hint was not used. Try with another offset */
-				munmap(mapped_addr, map_sz);
-				mapped_addr = MAP_FAILED;
+				eal_mem_free(mapped_addr, map_sz);
+				mapped_addr = NULL;
 				requested_addr = next_baseaddr;
 			}
 		}
 	} while ((allow_shrink || addr_is_hint) &&
-		 mapped_addr == MAP_FAILED && *size > 0);
+		(mapped_addr == NULL) && (*size > 0));
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.
@@ -132,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 
 	if (*size == 0) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
-			strerror(errno));
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
-	} else if (mapped_addr == MAP_FAILED) {
+	} else if (mapped_addr == NULL) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
-			strerror(errno));
-		/* pass errno up the call chain */
-		rte_errno = errno;
+			rte_strerror(rte_errno));
 		return NULL;
 	} else if (requested_addr != NULL && !addr_is_hint &&
 			aligned_addr != requested_addr) {
 		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
 			requested_addr, aligned_addr);
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 		rte_errno = EADDRNOTAVAIL;
 		return NULL;
 	} else if (requested_addr != NULL && addr_is_hint &&
@@ -161,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		aligned_addr, *size);
 
 	if (unmap) {
-		munmap(mapped_addr, map_sz);
+		eal_mem_free(mapped_addr, map_sz);
 	} else if (!no_align) {
 		void *map_end, *aligned_end;
 		size_t before_len, after_len;
@@ -179,19 +166,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 		/* unmap space before aligned mmap address */
 		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
 		if (before_len > 0)
-			munmap(mapped_addr, before_len);
+			eal_mem_free(mapped_addr, before_len);
 
 		/* unmap space after aligned end mmap address */
 		after_len = RTE_PTR_DIFF(map_end, aligned_end);
 		if (after_len > 0)
-			munmap(aligned_end, after_len);
+			eal_mem_free(aligned_end, after_len);
 	}
 
 	if (!unmap) {
 		/* Exclude these pages from a core dump. */
-		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
-			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
-				strerror(errno));
+		eal_mem_set_dump(aligned_addr, *size, false);
 	}
 
 	return aligned_addr;
@@ -547,10 +532,10 @@ rte_eal_memdevice_init(void)
 int
 rte_mem_lock_page(const void *virt)
 {
-	unsigned long virtual = (unsigned long)virt;
-	int page_size = getpagesize();
-	unsigned long aligned = (virtual & ~(page_size - 1));
-	return mlock((void *)aligned, page_size);
+	uintptr_t virtual = (uintptr_t)virt;
+	size_t page_size = rte_mem_page_size();
+	uintptr_t aligned = RTE_PTR_ALIGN_FLOOR(virtual, page_size);
+	return rte_mem_lock((void *)aligned, page_size);
 }
 
 int
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 6733a2321..1696345c2 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -11,6 +11,7 @@
 
 #include <rte_dev.h>
 #include <rte_lcore.h>
+#include <rte_memory.h>
 
 /**
  * Structure storing internal configuration (per-lcore)
@@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
  */
 int rte_eal_check_module(const char *module_name);
 
+/**
+ * Memory reservation flags.
+ */
+enum eal_mem_reserve_flags {
+	/**
+	 * Reserve hugepages. May be unsupported by some platforms.
+	 */
+	EAL_RESERVE_HUGEPAGES = 1 << 0,
+	/**
+	 * Force reserving memory at the requested address.
+	 * This can be a destructive action depending on the implementation.
+	 *
+	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
+	 *      (although implementations are not required to use it).
+	 */
+	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
+};
+
 /**
  * Get virtual area of specified size from the OS.
  *
@@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
  *   Page size on which to align requested virtual area.
  * @param flags
  *   EAL_VIRTUAL_AREA_* flags.
- * @param mmap_flags
- *   Extra flags passed directly to mmap().
+ * @param reserve_flags
+ *   Extra flags passed directly to eal_mem_reserve().
  *
  * @return
  *   Virtual area address if successful.
@@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
 /**< immediately unmap reserved virtual area. */
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
-		size_t page_sz, int flags, int mmap_flags);
+		size_t page_sz, int flags, int reserve_flags);
 
 /**
  * Get cpu core_id.
@@ -493,4 +512,57 @@ eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
 int
 eal_file_truncate(int fd, ssize_t size);
 
+/**
+ * Reserve a region of virtual memory.
+ *
+ * Use eal_mem_free() to free reserved memory.
+ *
+ * @param requested_addr
+ *  A desired reservation address which must be page-aligned.
+ *  The system might not respect it.
+ *  NULL means the address will be chosen by the system.
+ * @param size
+ *  Reservation size. Must be a multiple of system page size.
+ * @param flags
+ *  Reservation options, a combination of eal_mem_reserve_flags.
+ * @returns
+ *  Starting address of the reserved area on success, NULL on failure.
+ *  Callers must not access this memory until remapping it.
+ */
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags);
+
+/**
+ * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ *
+ * If *virt* and *size* describe a part of the reserved region,
+ * only this part of the region is freed (accurately up to the system
+ * page size). If *virt* points to allocated memory, *size* must match
+ * the one specified on allocation. The behavior is undefined
+ * if the memory pointed by *virt* is obtained from another source
+ * than listed above.
+ *
+ * @param virt
+ *  A virtual address in a region previously reserved.
+ * @param size
+ *  Number of bytes to unreserve.
+ */
+void
+eal_mem_free(void *virt, size_t size);
+
+/**
+ * Configure memory region inclusion into dumps.
+ *
+ * @param virt
+ *  Starting address of the region.
+ * @param size
+ *  Size of the region.
+ * @param dump
+ *  True to include memory into dumps, false to exclude.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump);
+
 #endif /* _EAL_PRIVATE_H_ */
diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
index 0f8741d96..2374ba0b7 100644
--- a/lib/librte_eal/freebsd/Makefile
+++ b/lib/librte_eal/freebsd/Makefile
@@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
diff --git a/lib/librte_eal/include/rte_eal_paging.h b/lib/librte_eal/include/rte_eal_paging.h
new file mode 100644
index 000000000..ed98e70e9
--- /dev/null
+++ b/lib/librte_eal/include/rte_eal_paging.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+/**
+ * @file
+ * @internal
+ *
+ * Wrappers for OS facilities related to memory paging, used across DPDK.
+ */
+
+/** Memory protection flags. */
+enum rte_mem_prot {
+	RTE_PROT_READ = 1 << 0,   /**< Read access. */
+	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
+	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
+};
+
+/** Additional flags for memory mapping. */
+enum rte_map_flags {
+	/** Changes to the mapped memory are visible to other processes. */
+	RTE_MAP_SHARED = 1 << 0,
+	/** Mapping is not backed by a regular file. */
+	RTE_MAP_ANONYMOUS = 1 << 1,
+	/** Copy-on-write mapping, changes are invisible to other processes. */
+	RTE_MAP_PRIVATE = 1 << 2,
+	/**
+	 * Force mapping to the requested address. This flag should be used
+	 * with caution, because to fulfill the request implementation
+	 * may remove all other mappings in the requested region. However,
+	 * it is not required to do so, thus mapping with this flag may fail.
+	 */
+	RTE_MAP_FORCE_ADDRESS = 1 << 3
+};
+
+/**
+ * Map a portion of an opened file or the page file into memory.
+ *
+ * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
+ * extension, except for the return value.
+ *
+ * @param requested_addr
+ *  Desired virtual address for mapping. Can be NULL to let OS choose.
+ * @param size
+ *  Size of the mapping in bytes.
+ * @param prot
+ *  Protection flags, a combination of rte_mem_prot values.
+ * @param flags
+ *  Additional mapping flags, a combination of rte_map_flags.
+ * @param fd
+ *  Mapped file descriptor. Can be negative for anonymous mapping.
+ * @param offset
+ *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
+ * @return
+ *  Mapped address or NULL on failure and rte_errno is set to OS error.
+ */
+__rte_internal
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset);
+
+/**
+ * OS-independent implementation of POSIX munmap(3).
+ */
+__rte_internal
+int
+rte_mem_unmap(void *virt, size_t size);
+
+/**
+ * Get system page size. This function never fails.
+ *
+ * @return
+ *   Page size in bytes.
+ */
+__rte_internal
+size_t
+rte_mem_page_size(void);
+
+/**
+ * Lock in physical memory all pages crossed by the address region.
+ *
+ * @param virt
+ *   Base virtual address of the region.
+ * @param size
+ *   Size of the region.
+ * @return
+ *   0 on success, negative on error.
+ *
+ * @see rte_mem_page_size() to retrieve the page size.
+ * @see rte_mem_lock_page() to lock an entire single page.
+ */
+__rte_internal
+int
+rte_mem_lock(const void *virt, size_t size);
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 331489f99..8febf2212 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
 
 # from unix dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
 
 # from arch dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
index 2c717f8bd..bf29b83c6 100644
--- a/lib/librte_eal/linux/eal_memalloc.c
+++ b/lib/librte_eal/linux/eal_memalloc.c
@@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 mapped:
 	munmap(addr, alloc_sz);
 unmapped:
-	flags = MAP_FIXED;
+	flags = EAL_RESERVE_FORCE_ADDRESS;
 	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
 	if (new_addr != addr) {
 		if (new_addr != NULL)
@@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
-		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
+	eal_mem_set_dump(ms->addr, ms->len, false);
 
 	exit_early = false;
 
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d8038749a..196eef5af 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -387,3 +387,12 @@ EXPERIMENTAL {
 	rte_trace_regexp;
 	rte_trace_save;
 };
+
+INTERNAL {
+	global:
+
+	rte_mem_lock;
+	rte_mem_map;
+	rte_mem_page_size;
+	rte_mem_unmap;
+};
diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
new file mode 100644
index 000000000..ec7156df9
--- /dev/null
+++ b/lib/librte_eal/unix/eal_unix_memory.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+
+#include "eal_private.h"
+
+#ifdef RTE_EXEC_ENV_LINUX
+#define EAL_DONTDUMP MADV_DONTDUMP
+#define EAL_DODUMP   MADV_DODUMP
+#elif defined RTE_EXEC_ENV_FREEBSD
+#define EAL_DONTDUMP MADV_NOCORE
+#define EAL_DODUMP   MADV_CORE
+#else
+#error "madvise doesn't support this OS"
+#endif
+
+static void *
+mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
+	if (virt == MAP_FAILED) {
+		RTE_LOG(DEBUG, EAL,
+			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
+			requested_addr, size, prot, flags, fd, offset,
+			strerror(errno));
+		rte_errno = errno;
+		return NULL;
+	}
+	return virt;
+}
+
+static int
+mem_unmap(void *virt, size_t size)
+{
+	int ret = munmap(virt, size);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
+			virt, size, strerror(errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
+
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+#ifdef MAP_HUGETLB
+		sys_flags |= MAP_HUGETLB;
+#else
+		rte_errno = ENOTSUP;
+		return NULL;
+#endif
+	}
+
+	if (flags & EAL_RESERVE_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_unmap(virt, size);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
+	int ret = madvise(virt, size, flags);
+	if (ret) {
+		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
+				virt, size, flags, strerror(rte_errno));
+		rte_errno = errno;
+	}
+	return ret;
+}
+
+static int
+mem_rte_to_sys_prot(int prot)
+{
+	int sys_prot = PROT_NONE;
+
+	if (prot & RTE_PROT_READ)
+		sys_prot |= PROT_READ;
+	if (prot & RTE_PROT_WRITE)
+		sys_prot |= PROT_WRITE;
+	if (prot & RTE_PROT_EXECUTE)
+		sys_prot |= PROT_EXEC;
+
+	return sys_prot;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	int sys_flags = 0;
+	int sys_prot;
+
+	sys_prot = mem_rte_to_sys_prot(prot);
+
+	if (flags & RTE_MAP_SHARED)
+		sys_flags |= MAP_SHARED;
+	if (flags & RTE_MAP_ANONYMOUS)
+		sys_flags |= MAP_ANONYMOUS;
+	if (flags & RTE_MAP_PRIVATE)
+		sys_flags |= MAP_PRIVATE;
+	if (flags & RTE_MAP_FORCE_ADDRESS)
+		sys_flags |= MAP_FIXED;
+
+	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	return mem_unmap(virt, size);
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static size_t page_size;
+
+	if (!page_size)
+		page_size = sysconf(_SC_PAGESIZE);
+
+	return page_size;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	int ret = mlock(virt, size);
+	if (ret)
+		rte_errno = errno;
+	return ret;
+}
diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
index 21029ba1a..e733910a1 100644
--- a/lib/librte_eal/unix/meson.build
+++ b/lib/librte_eal/unix/meson.build
@@ -3,4 +3,5 @@
 
 sources += files(
 	'eal_file.c',
+	'eal_unix_memory.c',
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 04/12] eal/mem: extract common code for memseg list initialization
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (2 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15 13:13                 ` Thomas Monjalon
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 05/12] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
                                 ` (9 subsequent siblings)
  13 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Anatoly Burakov,
	Bruce Richardson

All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_memory.c |  96 ++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  62 ++++++++++++
 lib/librte_eal/freebsd/eal_memory.c       |  98 ++++--------------
 lib/librte_eal/linux/eal_memory.c         | 118 +++++-----------------
 4 files changed, 204 insertions(+), 170 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index aa377990f..1414460c7 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -25,6 +25,7 @@
 #include "eal_private.h"
 #include "eal_internal_cfg.h"
 #include "eal_memcfg.h"
+#include "eal_options.h"
 #include "malloc_heap.h"
 
 /*
@@ -182,6 +183,101 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
 	return aligned_addr;
 }
 
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+		uint64_t page_sz, int n_segs, int socket_id, bool heap)
+{
+	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
+			sizeof(struct rte_memseg))) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	msl->page_sz = page_sz;
+	msl->socket_id = socket_id;
+	msl->base_va = NULL;
+	msl->heap = heap;
+
+	RTE_LOG(DEBUG, EAL,
+		"Memseg list allocated at socket %i, page size 0x%"PRIx64"kB\n",
+		socket_id, (size_t)page_sz >> 10);
+
+	return 0;
+}
+
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+		int n_segs, int socket_id, int type_msl_idx, bool heap)
+{
+	char name[RTE_FBARRAY_NAME_LEN];
+
+	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
+		 type_msl_idx);
+
+	return eal_memseg_list_init_named(
+		msl, name, page_sz, n_segs, socket_id, heap);
+}
+
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags)
+{
+	size_t page_sz, mem_sz;
+	void *addr;
+
+	page_sz = msl->page_sz;
+	mem_sz = page_sz * msl->memseg_arr.len;
+
+	addr = eal_get_virtual_area(
+		msl->base_va, &mem_sz, page_sz, 0, reserve_flags);
+	if (addr == NULL) {
+#ifndef RTE_EXEC_ENV_WINDOWS
+		/* The hint would be misleading on Windows, because address
+		 * is by default system-selected (base VA = 0).
+		 * However, this function is called from many places,
+		 * including common code, so don't duplicate the message.
+		 */
+		if (rte_errno == EADDRNOTAVAIL)
+			RTE_LOG(ERR, EAL, "Cannot reserve %llu bytes at [%p] - "
+				"please use '--" OPT_BASE_VIRTADDR "' option\n",
+				(unsigned long long)mem_sz, msl->base_va);
+#endif
+		return -1;
+	}
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	RTE_LOG(DEBUG, EAL, "VA reserved for memseg list at %p, size %zx\n",
+			addr, mem_sz);
+
+	return 0;
+}
+
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs)
+{
+	size_t page_sz = msl->page_sz;
+	int i;
+
+	for (i = 0; i < n_segs; i++) {
+		struct rte_fbarray *arr = &msl->memseg_arr;
+		struct rte_memseg *ms = rte_fbarray_get(arr, i);
+
+		if (rte_eal_iova_mode() == RTE_IOVA_VA)
+			ms->iova = (uintptr_t)addr;
+		else
+			ms->iova = RTE_BAD_IOVA;
+		ms->addr = addr;
+		ms->hugepage_sz = page_sz;
+		ms->socket_id = 0;
+		ms->len = page_sz;
+
+		rte_fbarray_set_used(arr, i);
+
+		addr = RTE_PTR_ADD(addr, page_sz);
+	}
+}
+
 static struct rte_memseg *
 virt2memseg(const void *addr, const struct rte_memseg_list *msl)
 {
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 1696345c2..75521d086 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -254,6 +254,68 @@ void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int reserve_flags);
 
+/**
+ * Initialize a memory segment list and create its backing storage.
+ *
+ * @param msl
+ *  Memory segment list to be filled.
+ * @param name
+ *  Name for the backing storage.
+ * @param page_sz
+ *  Size of segment pages in the MSL.
+ * @param n_segs
+ *  Number of segments.
+ * @param socket_id
+ *  Socket ID. Must not be SOCKET_ID_ANY.
+ * @param heap
+ *  Mark MSL as pointing to a heap.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_init_named(struct rte_memseg_list *msl, const char *name,
+	uint64_t page_sz, int n_segs, int socket_id, bool heap);
+
+/**
+ * Initialize memory segment list and create its backing storage
+ * with a name corresponding to MSL parameters.
+ *
+ * @param type_msl_idx
+ *  Index of the MSL among other MSLs of the same socket and page size.
+ *
+ * @see eal_memseg_list_init_named for remaining parameters description.
+ */
+int
+eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
+	int n_segs, int socket_id, int type_msl_idx, bool heap);
+
+/**
+ * Reserve VA space for a memory segment list
+ * previously initialized with eal_memseg_list_init().
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param reserve_flags
+ *  Extra memory reservation flags. Can be 0 if unnecessary.
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
+ */
+int
+eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
+
+/**
+ * Populate MSL, each segment is one page long.
+ *
+ * @param msl
+ *  Initialized memory segment list with page size defined.
+ * @param addr
+ *  Starting address of list segments.
+ * @param n_segs
+ *  Number of segments to populate.
+ */
+void
+eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
+
 /**
  * Get cpu core_id.
  *
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 5bc2da160..2eb70c2fe 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -64,55 +64,34 @@ rte_eal_hugepage_init(void)
 	/* for debug purposes, hugetlbfs can be disabled */
 	if (internal_config.no_hugetlbfs) {
 		struct rte_memseg_list *msl;
-		struct rte_fbarray *arr;
-		struct rte_memseg *ms;
-		uint64_t page_sz;
-		int n_segs, cur_seg;
+		uint64_t mem_sz, page_sz;
+		int n_segs;
 
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-				sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
-		addr = mmap(NULL, internal_config.memory,
-				PROT_READ | PROT_WRITE,
+		addr = mmap(NULL, mem_sz, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (addr == MAP_FAILED) {
 			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
 					strerror(errno));
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->len = internal_config.memory;
-		msl->socket_id = 0;
-		msl->heap = 1;
-
-		/* populate memsegs. each memseg is 1 page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->len = page_sz;
-			ms->socket_id = 0;
+		msl->base_va = addr;
+		msl->len = mem_sz;
 
-			rte_fbarray_set_used(arr, cur_seg);
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			addr = RTE_PTR_ADD(addr, page_sz);
-		}
 		return 0;
 	}
 
@@ -336,64 +315,25 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
 	int flags = 0;
 
 #ifdef RTE_ARCH_PPC_64
-	flags |= MAP_HUGETLB;
+	flags |= EAL_RESERVE_HUGEPAGES;
 #endif
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, flags);
 }
 
-
 static int
 memseg_primary_init(void)
 {
@@ -479,7 +419,7 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+			if (memseg_list_init(msl, hugepage_sz, n_segs,
 					0, type_msl_idx))
 				return -1;
 
@@ -487,7 +427,7 @@ memseg_primary_init(void)
 			total_type_mem = total_segs * hugepage_sz;
 			type_msl_idx++;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				return -1;
 			}
@@ -518,7 +458,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 7a9c97ff8..9cc39e6fb 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -802,7 +802,7 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 }
 
 static int
-free_memseg_list(struct rte_memseg_list *msl)
+memseg_list_free(struct rte_memseg_list *msl)
 {
 	if (rte_fbarray_destroy(&msl->memseg_arr)) {
 		RTE_LOG(ERR, EAL, "Cannot destroy memseg list\n");
@@ -812,58 +812,18 @@ free_memseg_list(struct rte_memseg_list *msl)
 	return 0;
 }
 
-#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
 static int
-alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz,
+memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
 		int n_segs, int socket_id, int type_msl_idx)
 {
-	char name[RTE_FBARRAY_NAME_LEN];
-
-	snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id,
-		 type_msl_idx);
-	if (rte_fbarray_init(&msl->memseg_arr, name, n_segs,
-			sizeof(struct rte_memseg))) {
-		RTE_LOG(ERR, EAL, "Cannot allocate memseg list: %s\n",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-
-	msl->page_sz = page_sz;
-	msl->socket_id = socket_id;
-	msl->base_va = NULL;
-	msl->heap = 1; /* mark it as a heap segment */
-
-	RTE_LOG(DEBUG, EAL, "Memseg list allocated: 0x%zxkB at socket %i\n",
-			(size_t)page_sz >> 10, socket_id);
-
-	return 0;
+	return eal_memseg_list_init(
+		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
 }
 
 static int
-alloc_va_space(struct rte_memseg_list *msl)
+memseg_list_alloc(struct rte_memseg_list *msl)
 {
-	uint64_t page_sz;
-	size_t mem_sz;
-	void *addr;
-	int flags = 0;
-
-	page_sz = msl->page_sz;
-	mem_sz = page_sz * msl->memseg_arr.len;
-
-	addr = eal_get_virtual_area(msl->base_va, &mem_sz, page_sz, 0, flags);
-	if (addr == NULL) {
-		if (rte_errno == EADDRNOTAVAIL)
-			RTE_LOG(ERR, EAL, "Could not mmap %llu bytes at [%p] - "
-				"please use '--" OPT_BASE_VIRTADDR "' option\n",
-				(unsigned long long)mem_sz, msl->base_va);
-		else
-			RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
-		return -1;
-	}
-	msl->base_va = addr;
-	msl->len = mem_sz;
-
-	return 0;
+	return eal_memseg_list_alloc(msl, 0);
 }
 
 /*
@@ -1009,13 +969,16 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
+			if (memseg_list_init(msl, page_sz, n_segs, socket,
 						msl_idx) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (alloc_va_space(msl) < 0)
+			if (memseg_list_alloc(msl) < 0) {
+				RTE_LOG(ERR, EAL, "Cannot preallocate 0x%"PRIx64"kB hugepages\n",
+					page_sz >> 10);
 				return -1;
+			}
 		}
 	}
 	return 0;
@@ -1323,8 +1286,6 @@ eal_legacy_hugepage_init(void)
 	struct rte_mem_config *mcfg;
 	struct hugepage_file *hugepage = NULL, *tmp_hp = NULL;
 	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	struct rte_fbarray *arr;
-	struct rte_memseg *ms;
 
 	uint64_t memory[RTE_MAX_NUMA_NODES];
 
@@ -1343,7 +1304,7 @@ eal_legacy_hugepage_init(void)
 		void *prealloc_addr;
 		size_t mem_sz;
 		struct rte_memseg_list *msl;
-		int n_segs, cur_seg, fd, flags;
+		int n_segs, fd, flags;
 #ifdef MEMFD_SUPPORTED
 		int memfd;
 #endif
@@ -1358,12 +1319,12 @@ eal_legacy_hugepage_init(void)
 		/* create a memseg list */
 		msl = &mcfg->memsegs[0];
 
+		mem_sz = internal_config.memory;
 		page_sz = RTE_PGSIZE_4K;
-		n_segs = internal_config.memory / page_sz;
+		n_segs = mem_sz / page_sz;
 
-		if (rte_fbarray_init(&msl->memseg_arr, "nohugemem", n_segs,
-					sizeof(struct rte_memseg))) {
-			RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n");
+		if (eal_memseg_list_init_named(
+				msl, "nohugemem", page_sz, n_segs, 0, true)) {
 			return -1;
 		}
 
@@ -1400,16 +1361,12 @@ eal_legacy_hugepage_init(void)
 		/* preallocate address space for the memory, so that it can be
 		 * fit into the DMA mask.
 		 */
-		mem_sz = internal_config.memory;
-		prealloc_addr = eal_get_virtual_area(
-				NULL, &mem_sz, page_sz, 0, 0);
-		if (prealloc_addr == NULL) {
-			RTE_LOG(ERR, EAL,
-					"%s: reserving memory area failed: "
-					"%s\n",
-					__func__, strerror(errno));
+		if (eal_memseg_list_alloc(msl, 0)) {
+			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
+
+		prealloc_addr = msl->base_va;
 		addr = mmap(prealloc_addr, mem_sz, PROT_READ | PROT_WRITE,
 				flags | MAP_FIXED, fd, 0);
 		if (addr == MAP_FAILED || addr != prealloc_addr) {
@@ -1418,11 +1375,6 @@ eal_legacy_hugepage_init(void)
 			munmap(prealloc_addr, mem_sz);
 			return -1;
 		}
-		msl->base_va = addr;
-		msl->page_sz = page_sz;
-		msl->socket_id = 0;
-		msl->len = mem_sz;
-		msl->heap = 1;
 
 		/* we're in single-file segments mode, so only the segment list
 		 * fd needs to be set up.
@@ -1434,24 +1386,8 @@ eal_legacy_hugepage_init(void)
 			}
 		}
 
-		/* populate memsegs. each memseg is one page long */
-		for (cur_seg = 0; cur_seg < n_segs; cur_seg++) {
-			arr = &msl->memseg_arr;
+		eal_memseg_list_populate(msl, addr, n_segs);
 
-			ms = rte_fbarray_get(arr, cur_seg);
-			if (rte_eal_iova_mode() == RTE_IOVA_VA)
-				ms->iova = (uintptr_t)addr;
-			else
-				ms->iova = RTE_BAD_IOVA;
-			ms->addr = addr;
-			ms->hugepage_sz = page_sz;
-			ms->socket_id = 0;
-			ms->len = page_sz;
-
-			rte_fbarray_set_used(arr, cur_seg);
-
-			addr = RTE_PTR_ADD(addr, (size_t)page_sz);
-		}
 		if (mcfg->dma_maskbits &&
 		    rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
 			RTE_LOG(ERR, EAL,
@@ -2191,7 +2127,7 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (alloc_memseg_list(msl, hugepage_sz, n_segs,
+				if (memseg_list_init(msl, hugepage_sz, n_segs,
 						socket_id, type_msl_idx)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
@@ -2200,13 +2136,13 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (alloc_va_space(msl)) {
+				if (memseg_list_alloc(msl)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
 					RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list, retrying with different page size\n");
 					/* deallocate memseg list */
-					if (free_memseg_list(msl))
+					if (memseg_list_free(msl))
 						return -1;
 					break;
 				}
@@ -2395,11 +2331,11 @@ memseg_primary_init(void)
 			}
 			msl = &mcfg->memsegs[msl_idx++];
 
-			if (alloc_memseg_list(msl, pagesz, n_segs,
+			if (memseg_list_init(msl, pagesz, n_segs,
 					socket_id, cur_seglist))
 				goto out;
 
-			if (alloc_va_space(msl)) {
+			if (memseg_list_alloc(msl)) {
 				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
 				goto out;
 			}
@@ -2433,7 +2369,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (alloc_va_space(msl)) {
+		if (memseg_list_alloc(msl)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 05/12] eal/mem: extract common code for dynamic memory allocation
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (3 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 04/12] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 06/12] trace: add size_t field emitter Dmitry Kozlyuk
                                 ` (8 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Anatoly Burakov, Bruce Richardson

Code in Linux EAL that supports dynamic memory allocation (as opposed to
static allocation used by FreeBSD) is not OS-dependent and can be reused
by Windows EAL. Move such code to a file compiled only for the OS that
require it. Keep Anatoly Burakov maintainer of extracted code.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                               |   1 +
 lib/librte_eal/common/eal_common_dynmem.c | 521 +++++++++++++++++++++
 lib/librte_eal/common/eal_private.h       |  43 +-
 lib/librte_eal/common/meson.build         |   4 +
 lib/librte_eal/freebsd/eal_memory.c       |  12 +-
 lib/librte_eal/linux/Makefile             |   1 +
 lib/librte_eal/linux/eal_memory.c         | 523 +---------------------
 7 files changed, 582 insertions(+), 523 deletions(-)
 create mode 100644 lib/librte_eal/common/eal_common_dynmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4d162efd6..241dbc3d7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -209,6 +209,7 @@ F: lib/librte_eal/include/rte_fbarray.h
 F: lib/librte_eal/include/rte_mem*
 F: lib/librte_eal/include/rte_malloc.h
 F: lib/librte_eal/common/*malloc*
+F: lib/librte_eal/common/eal_common_dynmem.c
 F: lib/librte_eal/common/eal_common_fbarray.c
 F: lib/librte_eal/common/eal_common_mem*
 F: lib/librte_eal/common/eal_hugepages.h
diff --git a/lib/librte_eal/common/eal_common_dynmem.c b/lib/librte_eal/common/eal_common_dynmem.c
new file mode 100644
index 000000000..6b07672d0
--- /dev/null
+++ b/lib/librte_eal/common/eal_common_dynmem.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation.
+ * Copyright(c) 2013 6WIND S.A.
+ */
+
+#include <inttypes.h>
+#include <string.h>
+
+#include <rte_log.h>
+#include <rte_string_fns.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+
+/** @file Functions common to EALs that support dynamic memory allocation. */
+
+int
+eal_dynmem_memseg_lists_init(void)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct memtype {
+		uint64_t page_sz;
+		int socket_id;
+	} *memtypes = NULL;
+	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
+	struct rte_memseg_list *msl;
+	uint64_t max_mem, max_mem_per_type;
+	unsigned int max_seglists_per_type;
+	unsigned int n_memtypes, cur_type;
+
+	/* no-huge does not need this at all */
+	if (internal_config.no_hugetlbfs)
+		return 0;
+
+	/*
+	 * figuring out amount of memory we're going to have is a long and very
+	 * involved process. the basic element we're operating with is a memory
+	 * type, defined as a combination of NUMA node ID and page size (so that
+	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
+	 *
+	 * deciding amount of memory going towards each memory type is a
+	 * balancing act between maximum segments per type, maximum memory per
+	 * type, and number of detected NUMA nodes. the goal is to make sure
+	 * each memory type gets at least one memseg list.
+	 *
+	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
+	 *
+	 * the total amount of memory per type is limited by either
+	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
+	 * of detected NUMA nodes. additionally, maximum number of segments per
+	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
+	 * smaller page sizes, it can take hundreds of thousands of segments to
+	 * reach the above specified per-type memory limits.
+	 *
+	 * additionally, each type may have multiple memseg lists associated
+	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
+	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
+	 *
+	 * the number of memseg lists per type is decided based on the above
+	 * limits, and also taking number of detected NUMA nodes, to make sure
+	 * that we don't run out of memseg lists before we populate all NUMA
+	 * nodes with memory.
+	 *
+	 * we do this in three stages. first, we collect the number of types.
+	 * then, we figure out memory constraints and populate the list of
+	 * would-be memseg lists. then, we go ahead and allocate the memseg
+	 * lists.
+	 */
+
+	/* create space for mem types */
+	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
+	memtypes = calloc(n_memtypes, sizeof(*memtypes));
+	if (memtypes == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
+		return -1;
+	}
+
+	/* populate mem types */
+	cur_type = 0;
+	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
+			hpi_idx++) {
+		struct hugepage_info *hpi;
+		uint64_t hugepage_sz;
+
+		hpi = &internal_config.hugepage_info[hpi_idx];
+		hugepage_sz = hpi->hugepage_sz;
+
+		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
+			int socket_id = rte_socket_id_by_idx(i);
+
+#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
+			/* we can still sort pages by socket in legacy mode */
+			if (!internal_config.legacy_mem && socket_id > 0)
+				break;
+#endif
+			memtypes[cur_type].page_sz = hugepage_sz;
+			memtypes[cur_type].socket_id = socket_id;
+
+			RTE_LOG(DEBUG, EAL, "Detected memory type: "
+				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
+				socket_id, hugepage_sz);
+		}
+	}
+	/* number of memtypes could have been lower due to no NUMA support */
+	n_memtypes = cur_type;
+
+	/* set up limits for types */
+	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
+	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
+			max_mem / n_memtypes);
+	/*
+	 * limit maximum number of segment lists per type to ensure there's
+	 * space for memseg lists for all NUMA nodes with all page sizes
+	 */
+	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
+
+	if (max_seglists_per_type == 0) {
+		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
+			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+		goto out;
+	}
+
+	/* go through all mem types and create segment lists */
+	msl_idx = 0;
+	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
+		unsigned int cur_seglist, n_seglists, n_segs;
+		unsigned int max_segs_per_type, max_segs_per_list;
+		struct memtype *type = &memtypes[cur_type];
+		uint64_t max_mem_per_list, pagesz;
+		int socket_id;
+
+		pagesz = type->page_sz;
+		socket_id = type->socket_id;
+
+		/*
+		 * we need to create segment lists for this type. we must take
+		 * into account the following things:
+		 *
+		 * 1. total amount of memory we can use for this memory type
+		 * 2. total amount of memory per memseg list allowed
+		 * 3. number of segments needed to fit the amount of memory
+		 * 4. number of segments allowed per type
+		 * 5. number of segments allowed per memseg list
+		 * 6. number of memseg lists we are allowed to take up
+		 */
+
+		/* calculate how much segments we will need in total */
+		max_segs_per_type = max_mem_per_type / pagesz;
+		/* limit number of segments to maximum allowed per type */
+		max_segs_per_type = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
+		/* limit number of segments to maximum allowed per list */
+		max_segs_per_list = RTE_MIN(max_segs_per_type,
+				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
+
+		/* calculate how much memory we can have per segment list */
+		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
+				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
+
+		/* calculate how many segments each segment list will have */
+		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
+
+		/* calculate how many segment lists we can have */
+		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
+				max_mem_per_type / max_mem_per_list);
+
+		/* limit number of segment lists according to our maximum */
+		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
+
+		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
+				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
+			n_seglists, n_segs, socket_id, pagesz);
+
+		/* create all segment lists */
+		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
+			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
+				RTE_LOG(ERR, EAL,
+					"No more space in memseg lists, please increase %s\n",
+					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
+				goto out;
+			}
+			msl = &mcfg->memsegs[msl_idx++];
+
+			if (eal_memseg_list_init(msl, pagesz, n_segs,
+					socket_id, cur_seglist, true))
+				goto out;
+
+			if (eal_memseg_list_alloc(msl, 0)) {
+				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
+				goto out;
+			}
+		}
+	}
+	/* we're successful */
+	ret = 0;
+out:
+	free(memtypes);
+	return ret;
+}
+
+static int __rte_unused
+hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct hugepage_info *hpi = arg;
+
+	if (msl->page_sz != hpi->hugepage_sz)
+		return 0;
+
+	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
+	return 0;
+}
+
+static int
+limits_callback(int socket_id, size_t cur_limit, size_t new_len)
+{
+	RTE_SET_USED(socket_id);
+	RTE_SET_USED(cur_limit);
+	RTE_SET_USED(new_len);
+	return -1;
+}
+
+int
+eal_dynmem_hugepage_init(void)
+{
+	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
+	uint64_t memory[RTE_MAX_NUMA_NODES];
+	int hp_sz_idx, socket_id;
+
+	memset(used_hp, 0, sizeof(used_hp));
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+#ifndef RTE_ARCH_64
+		struct hugepage_info dummy;
+		unsigned int i;
+#endif
+		/* also initialize used_hp hugepage sizes in used_hp */
+		struct hugepage_info *hpi;
+		hpi = &internal_config.hugepage_info[hp_sz_idx];
+		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
+
+#ifndef RTE_ARCH_64
+		/* for 32-bit, limit number of pages on socket to whatever we've
+		 * preallocated, as we cannot allocate more.
+		 */
+		memset(&dummy, 0, sizeof(dummy));
+		dummy.hugepage_sz = hpi->hugepage_sz;
+		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
+			return -1;
+
+		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
+			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
+					dummy.num_pages[i]);
+		}
+#endif
+	}
+
+	/* make a copy of socket_mem, needed for balanced allocation. */
+	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
+		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
+
+	/* calculate final number of pages */
+	if (eal_dynmem_calc_num_pages_per_socket(memory,
+			internal_config.hugepage_info, used_hp,
+			internal_config.num_hugepage_sizes) < 0)
+		return -1;
+
+	for (hp_sz_idx = 0;
+			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
+			hp_sz_idx++) {
+		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
+				socket_id++) {
+			struct rte_memseg **pages;
+			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
+			unsigned int num_pages = hpi->num_pages[socket_id];
+			unsigned int num_pages_alloc;
+
+			if (num_pages == 0)
+				continue;
+
+			RTE_LOG(DEBUG, EAL,
+				"Allocating %u pages of size %" PRIu64 "M "
+				"on socket %i\n",
+				num_pages, hpi->hugepage_sz >> 20, socket_id);
+
+			/* we may not be able to allocate all pages in one go,
+			 * because we break up our memory map into multiple
+			 * memseg lists. therefore, try allocating multiple
+			 * times and see if we can get the desired number of
+			 * pages from multiple allocations.
+			 */
+
+			num_pages_alloc = 0;
+			do {
+				int i, cur_pages, needed;
+
+				needed = num_pages - num_pages_alloc;
+
+				pages = malloc(sizeof(*pages) * needed);
+
+				/* do not request exact number of pages */
+				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
+						needed, hpi->hugepage_sz,
+						socket_id, false);
+				if (cur_pages <= 0) {
+					free(pages);
+					return -1;
+				}
+
+				/* mark preallocated pages as unfreeable */
+				for (i = 0; i < cur_pages; i++) {
+					struct rte_memseg *ms = pages[i];
+					ms->flags |=
+						RTE_MEMSEG_FLAG_DO_NOT_FREE;
+				}
+				free(pages);
+
+				num_pages_alloc += cur_pages;
+			} while (num_pages_alloc != num_pages);
+		}
+	}
+
+	/* if socket limits were specified, set them */
+	if (internal_config.force_socket_limits) {
+		unsigned int i;
+		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
+			uint64_t limit = internal_config.socket_limit[i];
+			if (limit == 0)
+				continue;
+			if (rte_mem_alloc_validator_register("socket-limit",
+					limits_callback, i, limit))
+				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
+		}
+	}
+	return 0;
+}
+
+__rte_unused /* function is unused on 32-bit builds */
+static inline uint64_t
+get_socket_mem_size(int socket)
+{
+	uint64_t size = 0;
+	unsigned int i;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		size += hpi->hugepage_sz * hpi->num_pages[socket];
+	}
+
+	return size;
+}
+
+int
+eal_dynmem_calc_num_pages_per_socket(
+	uint64_t *memory, struct hugepage_info *hp_info,
+	struct hugepage_info *hp_used, unsigned int num_hp_info)
+{
+	unsigned int socket, j, i = 0;
+	unsigned int requested, available;
+	int total_num_pages = 0;
+	uint64_t remaining_mem, cur_mem;
+	uint64_t total_mem = internal_config.memory;
+
+	if (num_hp_info == 0)
+		return -1;
+
+	/* if specific memory amounts per socket weren't requested */
+	if (internal_config.force_sockets == 0) {
+		size_t total_size;
+#ifdef RTE_ARCH_64
+		int cpu_per_socket[RTE_MAX_NUMA_NODES];
+		size_t default_size;
+		unsigned int lcore_id;
+
+		/* Compute number of cores per socket */
+		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
+		RTE_LCORE_FOREACH(lcore_id) {
+			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
+		}
+
+		/*
+		 * Automatically spread requested memory amongst detected
+		 * sockets according to number of cores from CPU mask present
+		 * on each socket.
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+
+			/* Set memory amount per socket */
+			default_size = internal_config.memory *
+				cpu_per_socket[socket] / rte_lcore_count();
+
+			/* Limit to maximum available memory on socket */
+			default_size = RTE_MIN(
+				default_size, get_socket_mem_size(socket));
+
+			/* Update sizes */
+			memory[socket] = default_size;
+			total_size -= default_size;
+		}
+
+		/*
+		 * If some memory is remaining, try to allocate it by getting
+		 * all available memory from sockets, one after the other.
+		 */
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			/* take whatever is available */
+			default_size = RTE_MIN(
+				get_socket_mem_size(socket) - memory[socket],
+				total_size);
+
+			/* Update sizes */
+			memory[socket] += default_size;
+			total_size -= default_size;
+		}
+#else
+		/* in 32-bit mode, allocate all of the memory only on master
+		 * lcore socket
+		 */
+		total_size = internal_config.memory;
+		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
+				socket++) {
+			struct rte_config *cfg = rte_eal_get_configuration();
+			unsigned int master_lcore_socket;
+
+			master_lcore_socket =
+				rte_lcore_to_socket_id(cfg->master_lcore);
+
+			if (master_lcore_socket != socket)
+				continue;
+
+			/* Update sizes */
+			memory[socket] = total_size;
+			break;
+		}
+#endif
+	}
+
+	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0;
+			socket++) {
+		/* skips if the memory on specific socket wasn't requested */
+		for (i = 0; i < num_hp_info && memory[socket] != 0; i++) {
+			rte_strscpy(hp_used[i].hugedir, hp_info[i].hugedir,
+				sizeof(hp_used[i].hugedir));
+			hp_used[i].num_pages[socket] = RTE_MIN(
+					memory[socket] / hp_info[i].hugepage_sz,
+					hp_info[i].num_pages[socket]);
+
+			cur_mem = hp_used[i].num_pages[socket] *
+					hp_used[i].hugepage_sz;
+
+			memory[socket] -= cur_mem;
+			total_mem -= cur_mem;
+
+			total_num_pages += hp_used[i].num_pages[socket];
+
+			/* check if we have met all memory requests */
+			if (memory[socket] == 0)
+				break;
+
+			/* Check if we have any more pages left at this size,
+			 * if so, move on to next size.
+			 */
+			if (hp_used[i].num_pages[socket] ==
+					hp_info[i].num_pages[socket])
+				continue;
+			/* At this point we know that there are more pages
+			 * available that are bigger than the memory we want,
+			 * so lets see if we can get enough from other page
+			 * sizes.
+			 */
+			remaining_mem = 0;
+			for (j = i+1; j < num_hp_info; j++)
+				remaining_mem += hp_info[j].hugepage_sz *
+				hp_info[j].num_pages[socket];
+
+			/* Is there enough other memory?
+			 * If not, allocate another page and quit.
+			 */
+			if (remaining_mem < memory[socket]) {
+				cur_mem = RTE_MIN(
+					memory[socket], hp_info[i].hugepage_sz);
+				memory[socket] -= cur_mem;
+				total_mem -= cur_mem;
+				hp_used[i].num_pages[socket]++;
+				total_num_pages++;
+				break; /* we are done with this socket*/
+			}
+		}
+
+		/* if we didn't satisfy all memory requirements per socket */
+		if (memory[socket] > 0 &&
+				internal_config.socket_mem[socket] != 0) {
+			/* to prevent icc errors */
+			requested = (unsigned int)(
+				internal_config.socket_mem[socket] / 0x100000);
+			available = requested -
+				((unsigned int)(memory[socket] / 0x100000));
+			RTE_LOG(ERR, EAL, "Not enough memory available on "
+				"socket %u! Requested: %uMB, available: %uMB\n",
+				socket, requested, available);
+			return -1;
+		}
+	}
+
+	/* if we didn't satisfy total memory requirements */
+	if (total_mem > 0) {
+		requested = (unsigned int)(internal_config.memory / 0x100000);
+		available = requested - (unsigned int)(total_mem / 0x100000);
+		RTE_LOG(ERR, EAL, "Not enough memory available! "
+			"Requested: %uMB, available: %uMB\n",
+			requested, available);
+		return -1;
+	}
+	return total_num_pages;
+}
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 75521d086..0592fcd69 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -13,6 +13,8 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 
+#include "eal_internal_cfg.h"
+
 /**
  * Structure storing internal configuration (per-lcore)
  */
@@ -316,6 +318,45 @@ eal_memseg_list_alloc(struct rte_memseg_list *msl, int reserve_flags);
 void
 eal_memseg_list_populate(struct rte_memseg_list *msl, void *addr, int n_segs);
 
+/**
+ * Distribute available memory between MSLs.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_memseg_lists_init(void);
+
+/**
+ * Preallocate hugepages for dynamic allocation.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_hugepage_init(void);
+
+/**
+ * Given the list of hugepage sizes and the number of pages thereof,
+ * calculate the best number of pages of each size to fulfill the request
+ * for RAM on each NUMA node.
+ *
+ * @param memory
+ *  Amounts of memory requested for each NUMA node of RTE_MAX_NUMA_NODES.
+ * @param hp_info
+ *  Information about hugepages of different size.
+ * @param hp_used
+ *  Receives information about used hugepages of each size.
+ * @param num_hp_info
+ *  Number of elements in hp_info and hp_used.
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int
+eal_dynmem_calc_num_pages_per_socket(
+		uint64_t *memory, struct hugepage_info *hp_info,
+		struct hugepage_info *hp_used, unsigned int num_hp_info);
+
 /**
  * Get cpu core_id.
  *
@@ -595,7 +636,7 @@ void *
 eal_mem_reserve(void *requested_addr, size_t size, int flags);
 
 /**
- * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
+ * Free memory obtained by eal_mem_reserve() and possibly allocated.
  *
  * If *virt* and *size* describe a part of the reserved region,
  * only this part of the region is freed (accurately up to the system
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 55aaeb18e..d91c22220 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -56,3 +56,7 @@ sources += files(
 	'rte_reciprocal.c',
 	'rte_service.c',
 )
+
+if is_linux
+	sources += files('eal_common_dynmem.c')
+endif
diff --git a/lib/librte_eal/freebsd/eal_memory.c b/lib/librte_eal/freebsd/eal_memory.c
index 2eb70c2fe..72a30f21a 100644
--- a/lib/librte_eal/freebsd/eal_memory.c
+++ b/lib/librte_eal/freebsd/eal_memory.c
@@ -315,14 +315,6 @@ get_mem_amount(uint64_t page_sz, uint64_t max_mem)
 	return RTE_ALIGN(area_sz, page_sz);
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, false);
-}
-
 static int
 memseg_list_alloc(struct rte_memseg_list *msl)
 {
@@ -419,8 +411,8 @@ memseg_primary_init(void)
 					cur_max_mem);
 			n_segs = cur_mem / hugepage_sz;
 
-			if (memseg_list_init(msl, hugepage_sz, n_segs,
-					0, type_msl_idx))
+			if (eal_memseg_list_init(msl, hugepage_sz, n_segs,
+					0, type_msl_idx, false))
 				return -1;
 
 			total_segs += msl->memseg_arr.len;
diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
index 8febf2212..07ce643ba 100644
--- a/lib/librte_eal/linux/Makefile
+++ b/lib/librte_eal/linux/Makefile
@@ -50,6 +50,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_timer.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memzone.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_log.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_launch.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_dynmem.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_mcfg.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memalloc.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_common_memory.c
diff --git a/lib/librte_eal/linux/eal_memory.c b/lib/librte_eal/linux/eal_memory.c
index 9cc39e6fb..5986dab23 100644
--- a/lib/librte_eal/linux/eal_memory.c
+++ b/lib/librte_eal/linux/eal_memory.c
@@ -812,20 +812,6 @@ memseg_list_free(struct rte_memseg_list *msl)
 	return 0;
 }
 
-static int
-memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz,
-		int n_segs, int socket_id, int type_msl_idx)
-{
-	return eal_memseg_list_init(
-		msl, page_sz, n_segs, socket_id, type_msl_idx, true);
-}
-
-static int
-memseg_list_alloc(struct rte_memseg_list *msl)
-{
-	return eal_memseg_list_alloc(msl, 0);
-}
-
 /*
  * Our VA space is not preallocated yet, so preallocate it here. We need to know
  * how many segments there are in order to map all pages into one address space,
@@ -969,12 +955,12 @@ prealloc_segments(struct hugepage_file *hugepages, int n_pages)
 			}
 
 			/* now, allocate fbarray itself */
-			if (memseg_list_init(msl, page_sz, n_segs, socket,
-						msl_idx) < 0)
+			if (eal_memseg_list_init(msl, page_sz, n_segs,
+					socket, msl_idx, true) < 0)
 				return -1;
 
 			/* finally, allocate VA space */
-			if (memseg_list_alloc(msl) < 0) {
+			if (eal_memseg_list_alloc(msl, 0) < 0) {
 				RTE_LOG(ERR, EAL, "Cannot preallocate 0x%"PRIx64"kB hugepages\n",
 					page_sz >> 10);
 				return -1;
@@ -1048,182 +1034,6 @@ remap_needed_hugepages(struct hugepage_file *hugepages, int n_pages)
 	return 0;
 }
 
-__rte_unused /* function is unused on 32-bit builds */
-static inline uint64_t
-get_socket_mem_size(int socket)
-{
-	uint64_t size = 0;
-	unsigned i;
-
-	for (i = 0; i < internal_config.num_hugepage_sizes; i++){
-		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
-		size += hpi->hugepage_sz * hpi->num_pages[socket];
-	}
-
-	return size;
-}
-
-/*
- * This function is a NUMA-aware equivalent of calc_num_pages.
- * It takes in the list of hugepage sizes and the
- * number of pages thereof, and calculates the best number of
- * pages of each size to fulfill the request for <memory> ram
- */
-static int
-calc_num_pages_per_socket(uint64_t * memory,
-		struct hugepage_info *hp_info,
-		struct hugepage_info *hp_used,
-		unsigned num_hp_info)
-{
-	unsigned socket, j, i = 0;
-	unsigned requested, available;
-	int total_num_pages = 0;
-	uint64_t remaining_mem, cur_mem;
-	uint64_t total_mem = internal_config.memory;
-
-	if (num_hp_info == 0)
-		return -1;
-
-	/* if specific memory amounts per socket weren't requested */
-	if (internal_config.force_sockets == 0) {
-		size_t total_size;
-#ifdef RTE_ARCH_64
-		int cpu_per_socket[RTE_MAX_NUMA_NODES];
-		size_t default_size;
-		unsigned lcore_id;
-
-		/* Compute number of cores per socket */
-		memset(cpu_per_socket, 0, sizeof(cpu_per_socket));
-		RTE_LCORE_FOREACH(lcore_id) {
-			cpu_per_socket[rte_lcore_to_socket_id(lcore_id)]++;
-		}
-
-		/*
-		 * Automatically spread requested memory amongst detected sockets according
-		 * to number of cores from cpu mask present on each socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-
-			/* Set memory amount per socket */
-			default_size = (internal_config.memory * cpu_per_socket[socket])
-					/ rte_lcore_count();
-
-			/* Limit to maximum available memory on socket */
-			default_size = RTE_MIN(default_size, get_socket_mem_size(socket));
-
-			/* Update sizes */
-			memory[socket] = default_size;
-			total_size -= default_size;
-		}
-
-		/*
-		 * If some memory is remaining, try to allocate it by getting all
-		 * available memory from sockets, one after the other
-		 */
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0; socket++) {
-			/* take whatever is available */
-			default_size = RTE_MIN(get_socket_mem_size(socket) - memory[socket],
-					       total_size);
-
-			/* Update sizes */
-			memory[socket] += default_size;
-			total_size -= default_size;
-		}
-#else
-		/* in 32-bit mode, allocate all of the memory only on master
-		 * lcore socket
-		 */
-		total_size = internal_config.memory;
-		for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_size != 0;
-				socket++) {
-			struct rte_config *cfg = rte_eal_get_configuration();
-			unsigned int master_lcore_socket;
-
-			master_lcore_socket =
-				rte_lcore_to_socket_id(cfg->master_lcore);
-
-			if (master_lcore_socket != socket)
-				continue;
-
-			/* Update sizes */
-			memory[socket] = total_size;
-			break;
-		}
-#endif
-	}
-
-	for (socket = 0; socket < RTE_MAX_NUMA_NODES && total_mem != 0; socket++) {
-		/* skips if the memory on specific socket wasn't requested */
-		for (i = 0; i < num_hp_info && memory[socket] != 0; i++){
-			strlcpy(hp_used[i].hugedir, hp_info[i].hugedir,
-				sizeof(hp_used[i].hugedir));
-			hp_used[i].num_pages[socket] = RTE_MIN(
-					memory[socket] / hp_info[i].hugepage_sz,
-					hp_info[i].num_pages[socket]);
-
-			cur_mem = hp_used[i].num_pages[socket] *
-					hp_used[i].hugepage_sz;
-
-			memory[socket] -= cur_mem;
-			total_mem -= cur_mem;
-
-			total_num_pages += hp_used[i].num_pages[socket];
-
-			/* check if we have met all memory requests */
-			if (memory[socket] == 0)
-				break;
-
-			/* check if we have any more pages left at this size, if so
-			 * move on to next size */
-			if (hp_used[i].num_pages[socket] == hp_info[i].num_pages[socket])
-				continue;
-			/* At this point we know that there are more pages available that are
-			 * bigger than the memory we want, so lets see if we can get enough
-			 * from other page sizes.
-			 */
-			remaining_mem = 0;
-			for (j = i+1; j < num_hp_info; j++)
-				remaining_mem += hp_info[j].hugepage_sz *
-				hp_info[j].num_pages[socket];
-
-			/* is there enough other memory, if not allocate another page and quit */
-			if (remaining_mem < memory[socket]){
-				cur_mem = RTE_MIN(memory[socket],
-						hp_info[i].hugepage_sz);
-				memory[socket] -= cur_mem;
-				total_mem -= cur_mem;
-				hp_used[i].num_pages[socket]++;
-				total_num_pages++;
-				break; /* we are done with this socket*/
-			}
-		}
-		/* if we didn't satisfy all memory requirements per socket */
-		if (memory[socket] > 0 &&
-				internal_config.socket_mem[socket] != 0) {
-			/* to prevent icc errors */
-			requested = (unsigned) (internal_config.socket_mem[socket] /
-					0x100000);
-			available = requested -
-					((unsigned) (memory[socket] / 0x100000));
-			RTE_LOG(ERR, EAL, "Not enough memory available on socket %u! "
-					"Requested: %uMB, available: %uMB\n", socket,
-					requested, available);
-			return -1;
-		}
-	}
-
-	/* if we didn't satisfy total memory requirements */
-	if (total_mem > 0) {
-		requested = (unsigned) (internal_config.memory / 0x100000);
-		available = requested - (unsigned) (total_mem / 0x100000);
-		RTE_LOG(ERR, EAL, "Not enough memory available! Requested: %uMB,"
-				" available: %uMB\n", requested, available);
-		return -1;
-	}
-	return total_num_pages;
-}
-
 static inline size_t
 eal_get_hugepage_mem_size(void)
 {
@@ -1529,7 +1339,7 @@ eal_legacy_hugepage_init(void)
 		memory[i] = internal_config.socket_mem[i];
 
 	/* calculate final number of pages */
-	nr_hugepages = calc_num_pages_per_socket(memory,
+	nr_hugepages = eal_dynmem_calc_num_pages_per_socket(memory,
 			internal_config.hugepage_info, used_hp,
 			internal_config.num_hugepage_sizes);
 
@@ -1656,140 +1466,6 @@ eal_legacy_hugepage_init(void)
 	return -1;
 }
 
-static int __rte_unused
-hugepage_count_walk(const struct rte_memseg_list *msl, void *arg)
-{
-	struct hugepage_info *hpi = arg;
-
-	if (msl->page_sz != hpi->hugepage_sz)
-		return 0;
-
-	hpi->num_pages[msl->socket_id] += msl->memseg_arr.len;
-	return 0;
-}
-
-static int
-limits_callback(int socket_id, size_t cur_limit, size_t new_len)
-{
-	RTE_SET_USED(socket_id);
-	RTE_SET_USED(cur_limit);
-	RTE_SET_USED(new_len);
-	return -1;
-}
-
-static int
-eal_hugepage_init(void)
-{
-	struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
-	uint64_t memory[RTE_MAX_NUMA_NODES];
-	int hp_sz_idx, socket_id;
-
-	memset(used_hp, 0, sizeof(used_hp));
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int) internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-#ifndef RTE_ARCH_64
-		struct hugepage_info dummy;
-		unsigned int i;
-#endif
-		/* also initialize used_hp hugepage sizes in used_hp */
-		struct hugepage_info *hpi;
-		hpi = &internal_config.hugepage_info[hp_sz_idx];
-		used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz;
-
-#ifndef RTE_ARCH_64
-		/* for 32-bit, limit number of pages on socket to whatever we've
-		 * preallocated, as we cannot allocate more.
-		 */
-		memset(&dummy, 0, sizeof(dummy));
-		dummy.hugepage_sz = hpi->hugepage_sz;
-		if (rte_memseg_list_walk(hugepage_count_walk, &dummy) < 0)
-			return -1;
-
-		for (i = 0; i < RTE_DIM(dummy.num_pages); i++) {
-			hpi->num_pages[i] = RTE_MIN(hpi->num_pages[i],
-					dummy.num_pages[i]);
-		}
-#endif
-	}
-
-	/* make a copy of socket_mem, needed for balanced allocation. */
-	for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++)
-		memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx];
-
-	/* calculate final number of pages */
-	if (calc_num_pages_per_socket(memory,
-			internal_config.hugepage_info, used_hp,
-			internal_config.num_hugepage_sizes) < 0)
-		return -1;
-
-	for (hp_sz_idx = 0;
-			hp_sz_idx < (int)internal_config.num_hugepage_sizes;
-			hp_sz_idx++) {
-		for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
-				socket_id++) {
-			struct rte_memseg **pages;
-			struct hugepage_info *hpi = &used_hp[hp_sz_idx];
-			unsigned int num_pages = hpi->num_pages[socket_id];
-			unsigned int num_pages_alloc;
-
-			if (num_pages == 0)
-				continue;
-
-			RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %" PRIu64 "M on socket %i\n",
-				num_pages, hpi->hugepage_sz >> 20, socket_id);
-
-			/* we may not be able to allocate all pages in one go,
-			 * because we break up our memory map into multiple
-			 * memseg lists. therefore, try allocating multiple
-			 * times and see if we can get the desired number of
-			 * pages from multiple allocations.
-			 */
-
-			num_pages_alloc = 0;
-			do {
-				int i, cur_pages, needed;
-
-				needed = num_pages - num_pages_alloc;
-
-				pages = malloc(sizeof(*pages) * needed);
-
-				/* do not request exact number of pages */
-				cur_pages = eal_memalloc_alloc_seg_bulk(pages,
-						needed, hpi->hugepage_sz,
-						socket_id, false);
-				if (cur_pages <= 0) {
-					free(pages);
-					return -1;
-				}
-
-				/* mark preallocated pages as unfreeable */
-				for (i = 0; i < cur_pages; i++) {
-					struct rte_memseg *ms = pages[i];
-					ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE;
-				}
-				free(pages);
-
-				num_pages_alloc += cur_pages;
-			} while (num_pages_alloc != num_pages);
-		}
-	}
-	/* if socket limits were specified, set them */
-	if (internal_config.force_socket_limits) {
-		unsigned int i;
-		for (i = 0; i < RTE_MAX_NUMA_NODES; i++) {
-			uint64_t limit = internal_config.socket_limit[i];
-			if (limit == 0)
-				continue;
-			if (rte_mem_alloc_validator_register("socket-limit",
-					limits_callback, i, limit))
-				RTE_LOG(ERR, EAL, "Failed to register socket limits validator callback\n");
-		}
-	}
-	return 0;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -1948,7 +1624,7 @@ rte_eal_hugepage_init(void)
 {
 	return internal_config.legacy_mem ?
 			eal_legacy_hugepage_init() :
-			eal_hugepage_init();
+			eal_dynmem_hugepage_init();
 }
 
 int
@@ -2127,8 +1803,9 @@ memseg_primary_init_32(void)
 						max_pagesz_mem);
 				n_segs = cur_mem / hugepage_sz;
 
-				if (memseg_list_init(msl, hugepage_sz, n_segs,
-						socket_id, type_msl_idx)) {
+				if (eal_memseg_list_init(msl, hugepage_sz,
+						n_segs, socket_id, type_msl_idx,
+						true)) {
 					/* failing to allocate a memseg list is
 					 * a serious error.
 					 */
@@ -2136,7 +1813,7 @@ memseg_primary_init_32(void)
 					return -1;
 				}
 
-				if (memseg_list_alloc(msl)) {
+				if (eal_memseg_list_alloc(msl, 0)) {
 					/* if we couldn't allocate VA space, we
 					 * can try with smaller page sizes.
 					 */
@@ -2167,185 +1844,7 @@ memseg_primary_init_32(void)
 static int __rte_unused
 memseg_primary_init(void)
 {
-	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
-	struct memtype {
-		uint64_t page_sz;
-		int socket_id;
-	} *memtypes = NULL;
-	int i, hpi_idx, msl_idx, ret = -1; /* fail unless told to succeed */
-	struct rte_memseg_list *msl;
-	uint64_t max_mem, max_mem_per_type;
-	unsigned int max_seglists_per_type;
-	unsigned int n_memtypes, cur_type;
-
-	/* no-huge does not need this at all */
-	if (internal_config.no_hugetlbfs)
-		return 0;
-
-	/*
-	 * figuring out amount of memory we're going to have is a long and very
-	 * involved process. the basic element we're operating with is a memory
-	 * type, defined as a combination of NUMA node ID and page size (so that
-	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
-	 *
-	 * deciding amount of memory going towards each memory type is a
-	 * balancing act between maximum segments per type, maximum memory per
-	 * type, and number of detected NUMA nodes. the goal is to make sure
-	 * each memory type gets at least one memseg list.
-	 *
-	 * the total amount of memory is limited by RTE_MAX_MEM_MB value.
-	 *
-	 * the total amount of memory per type is limited by either
-	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB divided by the number
-	 * of detected NUMA nodes. additionally, maximum number of segments per
-	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is because for
-	 * smaller page sizes, it can take hundreds of thousands of segments to
-	 * reach the above specified per-type memory limits.
-	 *
-	 * additionally, each type may have multiple memseg lists associated
-	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for bigger
-	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller ones.
-	 *
-	 * the number of memseg lists per type is decided based on the above
-	 * limits, and also taking number of detected NUMA nodes, to make sure
-	 * that we don't run out of memseg lists before we populate all NUMA
-	 * nodes with memory.
-	 *
-	 * we do this in three stages. first, we collect the number of types.
-	 * then, we figure out memory constraints and populate the list of
-	 * would-be memseg lists. then, we go ahead and allocate the memseg
-	 * lists.
-	 */
-
-	/* create space for mem types */
-	n_memtypes = internal_config.num_hugepage_sizes * rte_socket_count();
-	memtypes = calloc(n_memtypes, sizeof(*memtypes));
-	if (memtypes == NULL) {
-		RTE_LOG(ERR, EAL, "Cannot allocate space for memory types\n");
-		return -1;
-	}
-
-	/* populate mem types */
-	cur_type = 0;
-	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
-			hpi_idx++) {
-		struct hugepage_info *hpi;
-		uint64_t hugepage_sz;
-
-		hpi = &internal_config.hugepage_info[hpi_idx];
-		hugepage_sz = hpi->hugepage_sz;
-
-		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
-			int socket_id = rte_socket_id_by_idx(i);
-
-#ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
-			/* we can still sort pages by socket in legacy mode */
-			if (!internal_config.legacy_mem && socket_id > 0)
-				break;
-#endif
-			memtypes[cur_type].page_sz = hugepage_sz;
-			memtypes[cur_type].socket_id = socket_id;
-
-			RTE_LOG(DEBUG, EAL, "Detected memory type: "
-				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
-				socket_id, hugepage_sz);
-		}
-	}
-	/* number of memtypes could have been lower due to no NUMA support */
-	n_memtypes = cur_type;
-
-	/* set up limits for types */
-	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
-	max_mem_per_type = RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
-			max_mem / n_memtypes);
-	/*
-	 * limit maximum number of segment lists per type to ensure there's
-	 * space for memseg lists for all NUMA nodes with all page sizes
-	 */
-	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
-
-	if (max_seglists_per_type == 0) {
-		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types, please increase %s\n",
-			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-		goto out;
-	}
-
-	/* go through all mem types and create segment lists */
-	msl_idx = 0;
-	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
-		unsigned int cur_seglist, n_seglists, n_segs;
-		unsigned int max_segs_per_type, max_segs_per_list;
-		struct memtype *type = &memtypes[cur_type];
-		uint64_t max_mem_per_list, pagesz;
-		int socket_id;
-
-		pagesz = type->page_sz;
-		socket_id = type->socket_id;
-
-		/*
-		 * we need to create segment lists for this type. we must take
-		 * into account the following things:
-		 *
-		 * 1. total amount of memory we can use for this memory type
-		 * 2. total amount of memory per memseg list allowed
-		 * 3. number of segments needed to fit the amount of memory
-		 * 4. number of segments allowed per type
-		 * 5. number of segments allowed per memseg list
-		 * 6. number of memseg lists we are allowed to take up
-		 */
-
-		/* calculate how much segments we will need in total */
-		max_segs_per_type = max_mem_per_type / pagesz;
-		/* limit number of segments to maximum allowed per type */
-		max_segs_per_type = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_TYPE);
-		/* limit number of segments to maximum allowed per list */
-		max_segs_per_list = RTE_MIN(max_segs_per_type,
-				(unsigned int)RTE_MAX_MEMSEG_PER_LIST);
-
-		/* calculate how much memory we can have per segment list */
-		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
-				(uint64_t)RTE_MAX_MEM_MB_PER_LIST << 20);
-
-		/* calculate how many segments each segment list will have */
-		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list / pagesz);
-
-		/* calculate how many segment lists we can have */
-		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
-				max_mem_per_type / max_mem_per_list);
-
-		/* limit number of segment lists according to our maximum */
-		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
-
-		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
-				"n_segs:%i socket_id:%i hugepage_sz:%" PRIu64 "\n",
-			n_seglists, n_segs, socket_id, pagesz);
-
-		/* create all segment lists */
-		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
-			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
-				RTE_LOG(ERR, EAL,
-					"No more space in memseg lists, please increase %s\n",
-					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
-				goto out;
-			}
-			msl = &mcfg->memsegs[msl_idx++];
-
-			if (memseg_list_init(msl, pagesz, n_segs,
-					socket_id, cur_seglist))
-				goto out;
-
-			if (memseg_list_alloc(msl)) {
-				RTE_LOG(ERR, EAL, "Cannot allocate VA space for memseg list\n");
-				goto out;
-			}
-		}
-	}
-	/* we're successful */
-	ret = 0;
-out:
-	free(memtypes);
-	return ret;
+	return eal_dynmem_memseg_lists_init();
 }
 
 static int
@@ -2369,7 +1868,7 @@ memseg_secondary_init(void)
 		}
 
 		/* preallocate VA space */
-		if (memseg_list_alloc(msl)) {
+		if (eal_memseg_list_alloc(msl, 0)) {
 			RTE_LOG(ERR, EAL, "Cannot preallocate VA space for hugepage memory\n");
 			return -1;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 06/12] trace: add size_t field emitter
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (4 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 05/12] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 07/12] eal/windows: add tracing support stubs Dmitry Kozlyuk
                                 ` (7 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Jerin Jacob, Sunil Kumar Kori,
	Olivier Matz, Andrew Rybchenko

It is not guaranteed that sizeof(long) == sizeof(size_t). On Windows,
sizeof(long) == 4 and sizeof(size_t) == 8 for 64-bit programs.
Tracepoints using "long" field emitter are therefore invalid there.
Add dedicated field emitter for size_t and use it to store size_t values
in all existing tracepoints.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/include/rte_eal_trace.h   |  8 ++++----
 lib/librte_eal/include/rte_trace_point.h |  3 +++
 lib/librte_mempool/rte_mempool_trace.h   | 10 +++++-----
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/include/rte_eal_trace.h b/lib/librte_eal/include/rte_eal_trace.h
index 1ebb2905a..bcfef0cfa 100644
--- a/lib/librte_eal/include/rte_eal_trace.h
+++ b/lib/librte_eal/include/rte_eal_trace.h
@@ -143,7 +143,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -154,7 +154,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(const char *type, size_t size, unsigned int align,
 		int socket, void *ptr),
 	rte_trace_point_emit_string(type);
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -164,7 +164,7 @@ RTE_TRACE_POINT(
 	rte_eal_trace_mem_realloc,
 	RTE_TRACE_POINT_ARGS(size_t size, unsigned int align, int socket,
 		void *ptr),
-	rte_trace_point_emit_long(size);
+	rte_trace_point_emit_size_t(size);
 	rte_trace_point_emit_u32(align);
 	rte_trace_point_emit_int(socket);
 	rte_trace_point_emit_ptr(ptr);
@@ -183,7 +183,7 @@ RTE_TRACE_POINT(
 		unsigned int flags, unsigned int align, unsigned int bound,
 		const void *mz),
 	rte_trace_point_emit_string(name);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_int(socket_id);
 	rte_trace_point_emit_u32(flags);
 	rte_trace_point_emit_u32(align);
diff --git a/lib/librte_eal/include/rte_trace_point.h b/lib/librte_eal/include/rte_trace_point.h
index b45171275..377c2414a 100644
--- a/lib/librte_eal/include/rte_trace_point.h
+++ b/lib/librte_eal/include/rte_trace_point.h
@@ -138,6 +138,8 @@ _tp _args \
 #define rte_trace_point_emit_int(val)
 /** Tracepoint function payload for long datatype */
 #define rte_trace_point_emit_long(val)
+/** Tracepoint function payload for size_t datatype */
+#define rte_trace_point_emit_size_t(val)
 /** Tracepoint function payload for float datatype */
 #define rte_trace_point_emit_float(val)
 /** Tracepoint function payload for double datatype */
@@ -395,6 +397,7 @@ do { \
 #define rte_trace_point_emit_i8(in) __rte_trace_point_emit(in, int8_t)
 #define rte_trace_point_emit_int(in) __rte_trace_point_emit(in, int32_t)
 #define rte_trace_point_emit_long(in) __rte_trace_point_emit(in, long)
+#define rte_trace_point_emit_size_t(in) __rte_trace_point_emit(in, size_t)
 #define rte_trace_point_emit_float(in) __rte_trace_point_emit(in, float)
 #define rte_trace_point_emit_double(in) __rte_trace_point_emit(in, double)
 #define rte_trace_point_emit_ptr(in) __rte_trace_point_emit(in, uintptr_t)
diff --git a/lib/librte_mempool/rte_mempool_trace.h b/lib/librte_mempool/rte_mempool_trace.h
index e776df0a6..087c913c8 100644
--- a/lib/librte_mempool/rte_mempool_trace.h
+++ b/lib/librte_mempool/rte_mempool_trace.h
@@ -72,7 +72,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -84,8 +84,8 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
 	rte_trace_point_emit_ptr(addr);
-	rte_trace_point_emit_long(len);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(len);
+	rte_trace_point_emit_size_t(pg_sz);
 	rte_trace_point_emit_ptr(free_cb);
 	rte_trace_point_emit_ptr(opaque);
 )
@@ -126,7 +126,7 @@ RTE_TRACE_POINT(
 	RTE_TRACE_POINT_ARGS(struct rte_mempool *mempool, size_t pg_sz),
 	rte_trace_point_emit_ptr(mempool);
 	rte_trace_point_emit_string(mempool->name);
-	rte_trace_point_emit_long(pg_sz);
+	rte_trace_point_emit_size_t(pg_sz);
 )
 
 RTE_TRACE_POINT(
@@ -139,7 +139,7 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_u32(max_objs);
 	rte_trace_point_emit_ptr(vaddr);
 	rte_trace_point_emit_u64(iova);
-	rte_trace_point_emit_long(len);
+	rte_trace_point_emit_size_t(len);
 	rte_trace_point_emit_ptr(obj_cb);
 	rte_trace_point_emit_ptr(obj_cb_arg);
 )
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 07/12] eal/windows: add tracing support stubs
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (5 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 06/12] trace: add size_t field emitter Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 08/12] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
                                 ` (6 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

EAL common code depends on tracepoint calls, but generic implementation
cannot be enabled on Windows due to missing standard library facilities.
Add stub functions to support tracepoint compilation, so that common
code does not have to conditionally include tracepoints until proper
support is added.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/common/eal_common_thread.c |  5 +---
 lib/librte_eal/common/meson.build         |  1 +
 lib/librte_eal/windows/eal.c              | 34 ++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_thread.c b/lib/librte_eal/common/eal_common_thread.c
index f9f588c17..370bb1b63 100644
--- a/lib/librte_eal/common/eal_common_thread.c
+++ b/lib/librte_eal/common/eal_common_thread.c
@@ -15,9 +15,7 @@
 #include <rte_lcore.h>
 #include <rte_memory.h>
 #include <rte_log.h>
-#ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_trace_point.h>
-#endif
 
 #include "eal_internal_cfg.h"
 #include "eal_private.h"
@@ -169,9 +167,8 @@ static void *rte_thread_init(void *arg)
 		free(params);
 	}
 
-#ifndef RTE_EXEC_ENV_WINDOWS
 	__rte_trace_mem_per_thread_alloc();
-#endif
+
 	return start_routine(routine_arg);
 }
 
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index d91c22220..4e9208129 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -14,6 +14,7 @@ if is_windows
 		'eal_common_log.c',
 		'eal_common_options.c',
 		'eal_common_thread.c',
+		'eal_common_trace_points.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index d084606a6..e7461f731 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -17,6 +17,7 @@
 #include <eal_filesystem.h>
 #include <eal_options.h>
 #include <eal_private.h>
+#include <rte_trace_point.h>
 
 #include "eal_windows.h"
 
@@ -221,7 +222,38 @@ rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
- /* Launch threads, called at application init(). */
+/* Stubs to enable EAL trace point compilation
+ * until eal_common_trace.c can be compiled.
+ */
+
+RTE_DEFINE_PER_LCORE(volatile int, trace_point_sz);
+RTE_DEFINE_PER_LCORE(void *, trace_mem);
+
+void
+__rte_trace_mem_per_thread_alloc(void)
+{
+}
+
+void
+__rte_trace_point_emit_field(size_t sz, const char *field,
+	const char *type)
+{
+	RTE_SET_USED(sz);
+	RTE_SET_USED(field);
+	RTE_SET_USED(type);
+}
+
+int
+__rte_trace_point_register(rte_trace_point_t *trace, const char *name,
+	void (*register_fn)(void))
+{
+	RTE_SET_USED(trace);
+	RTE_SET_USED(name);
+	RTE_SET_USED(register_fn);
+	return -ENOTSUP;
+}
+
+/* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 08/12] eal/windows: replace sys/queue.h with a complete one from FreeBSD
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (6 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 07/12] eal/windows: add tracing support stubs Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 09/12] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
                                 ` (5 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

Limited version imported previously lacks at least SLIST macros.
Import a complete file from FreeBSD, since its license exception is
already approved by Technical Board.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/include/sys/queue.h | 663 +++++++++++++++++++--
 1 file changed, 601 insertions(+), 62 deletions(-)

diff --git a/lib/librte_eal/windows/include/sys/queue.h b/lib/librte_eal/windows/include/sys/queue.h
index a65949a78..9756bee6f 100644
--- a/lib/librte_eal/windows/include/sys/queue.h
+++ b/lib/librte_eal/windows/include/sys/queue.h
@@ -8,7 +8,36 @@
 #define	_SYS_QUEUE_H_
 
 /*
- * This file defines tail queues.
+ * This file defines four types of data structures: singly-linked lists,
+ * singly-linked tail queues, lists and tail queues.
+ *
+ * A singly-linked list is headed by a single forward pointer. The elements
+ * are singly linked for minimum space and pointer manipulation overhead at
+ * the expense of O(n) removal for arbitrary elements. New elements can be
+ * added to the list after an existing element or at the head of the list.
+ * Elements being removed from the head of the list should use the explicit
+ * macro for this purpose for optimum efficiency. A singly-linked list may
+ * only be traversed in the forward direction.  Singly-linked lists are ideal
+ * for applications with large datasets and few or no removals or for
+ * implementing a LIFO queue.
+ *
+ * A singly-linked tail queue is headed by a pair of pointers, one to the
+ * head of the list and the other to the tail of the list. The elements are
+ * singly linked for minimum space and pointer manipulation overhead at the
+ * expense of O(n) removal for arbitrary elements. New elements can be added
+ * to the list after an existing element, at the head of the list, or at the
+ * end of the list. Elements being removed from the head of the tail queue
+ * should use the explicit macro for this purpose for optimum efficiency.
+ * A singly-linked tail queue may only be traversed in the forward direction.
+ * Singly-linked tail queues are ideal for applications with large datasets
+ * and few or no removals or for implementing a FIFO queue.
+ *
+ * A list is headed by a single forward pointer (or an array of forward
+ * pointers for a hash table header). The elements are doubly linked
+ * so that an arbitrary element can be removed without a need to
+ * traverse the list. New elements can be added to the list before
+ * or after an existing element or at the head of the list. A list
+ * may be traversed in either direction.
  *
  * A tail queue is headed by a pair of pointers, one to the head of the
  * list and the other to the tail of the list. The elements are doubly
@@ -17,65 +46,93 @@
  * after an existing element, at the head of the list, or at the end of
  * the list. A tail queue may be traversed in either direction.
  *
+ * For details on the use of these macros, see the queue(3) manual page.
+ *
  * Below is a summary of implemented functions where:
  *  +  means the macro is available
  *  -  means the macro is not available
  *  s  means the macro is available but is slow (runs in O(n) time)
  *
- *				TAILQ
- * _HEAD			+
- * _CLASS_HEAD			+
- * _HEAD_INITIALIZER		+
- * _ENTRY			+
- * _CLASS_ENTRY			+
- * _INIT			+
- * _EMPTY			+
- * _FIRST			+
- * _NEXT			+
- * _PREV			+
- * _LAST			+
- * _LAST_FAST			+
- * _FOREACH			+
- * _FOREACH_FROM		+
- * _FOREACH_SAFE		+
- * _FOREACH_FROM_SAFE		+
- * _FOREACH_REVERSE		+
- * _FOREACH_REVERSE_FROM	+
- * _FOREACH_REVERSE_SAFE	+
- * _FOREACH_REVERSE_FROM_SAFE	+
- * _INSERT_HEAD			+
- * _INSERT_BEFORE		+
- * _INSERT_AFTER		+
- * _INSERT_TAIL			+
- * _CONCAT			+
- * _REMOVE_AFTER		-
- * _REMOVE_HEAD			-
- * _REMOVE			+
- * _SWAP			+
+ *				SLIST	LIST	STAILQ	TAILQ
+ * _HEAD			+	+	+	+
+ * _CLASS_HEAD			+	+	+	+
+ * _HEAD_INITIALIZER		+	+	+	+
+ * _ENTRY			+	+	+	+
+ * _CLASS_ENTRY			+	+	+	+
+ * _INIT			+	+	+	+
+ * _EMPTY			+	+	+	+
+ * _FIRST			+	+	+	+
+ * _NEXT			+	+	+	+
+ * _PREV			-	+	-	+
+ * _LAST			-	-	+	+
+ * _LAST_FAST			-	-	-	+
+ * _FOREACH			+	+	+	+
+ * _FOREACH_FROM		+	+	+	+
+ * _FOREACH_SAFE		+	+	+	+
+ * _FOREACH_FROM_SAFE		+	+	+	+
+ * _FOREACH_REVERSE		-	-	-	+
+ * _FOREACH_REVERSE_FROM	-	-	-	+
+ * _FOREACH_REVERSE_SAFE	-	-	-	+
+ * _FOREACH_REVERSE_FROM_SAFE	-	-	-	+
+ * _INSERT_HEAD			+	+	+	+
+ * _INSERT_BEFORE		-	+	-	+
+ * _INSERT_AFTER		+	+	+	+
+ * _INSERT_TAIL			-	-	+	+
+ * _CONCAT			s	s	+	+
+ * _REMOVE_AFTER		+	-	+	-
+ * _REMOVE_HEAD			+	-	+	-
+ * _REMOVE			s	+	s	+
+ * _SWAP			+	+	+	+
  *
  */
-
-#ifdef __cplusplus
-extern "C" {
+#ifdef QUEUE_MACRO_DEBUG
+#warn Use QUEUE_MACRO_DEBUG_TRACE and/or QUEUE_MACRO_DEBUG_TRASH
+#define	QUEUE_MACRO_DEBUG_TRACE
+#define	QUEUE_MACRO_DEBUG_TRASH
 #endif
 
-/*
- * List definitions.
- */
-#define	LIST_HEAD(name, type)						\
-struct name {								\
-	struct type *lh_first;	/* first element */			\
-}
+#ifdef QUEUE_MACRO_DEBUG_TRACE
+/* Store the last 2 places the queue element or head was altered */
+struct qm_trace {
+	unsigned long	 lastline;
+	unsigned long	 prevline;
+	const char	*lastfile;
+	const char	*prevfile;
+};
+
+#define	TRACEBUF	struct qm_trace trace;
+#define	TRACEBUF_INITIALIZER	{ __LINE__, 0, __FILE__, NULL } ,
+
+#define	QMD_TRACE_HEAD(head) do {					\
+	(head)->trace.prevline = (head)->trace.lastline;		\
+	(head)->trace.prevfile = (head)->trace.lastfile;		\
+	(head)->trace.lastline = __LINE__;				\
+	(head)->trace.lastfile = __FILE__;				\
+} while (0)
 
+#define	QMD_TRACE_ELEM(elem) do {					\
+	(elem)->trace.prevline = (elem)->trace.lastline;		\
+	(elem)->trace.prevfile = (elem)->trace.lastfile;		\
+	(elem)->trace.lastline = __LINE__;				\
+	(elem)->trace.lastfile = __FILE__;				\
+} while (0)
+
+#else	/* !QUEUE_MACRO_DEBUG_TRACE */
 #define	QMD_TRACE_ELEM(elem)
 #define	QMD_TRACE_HEAD(head)
 #define	TRACEBUF
 #define	TRACEBUF_INITIALIZER
+#endif	/* QUEUE_MACRO_DEBUG_TRACE */
 
+#ifdef QUEUE_MACRO_DEBUG_TRASH
+#define	QMD_SAVELINK(name, link)	void **name = (void *)&(link)
+#define	TRASHIT(x)		do {(x) = (void *)-1;} while (0)
+#define	QMD_IS_TRASHED(x)	((x) == (void *)(intptr_t)-1)
+#else	/* !QUEUE_MACRO_DEBUG_TRASH */
+#define	QMD_SAVELINK(name, link)
 #define	TRASHIT(x)
 #define	QMD_IS_TRASHED(x)	0
-
-#define	QMD_SAVELINK(name, link)
+#endif	/* QUEUE_MACRO_DEBUG_TRASH */
 
 #ifdef __cplusplus
 /*
@@ -86,6 +143,445 @@ struct name {								\
 #define	QUEUE_TYPEOF(type) struct type
 #endif
 
+/*
+ * Singly-linked List declarations.
+ */
+#define	SLIST_HEAD(name, type)						\
+struct name {								\
+	struct type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *slh_first;	/* first element */			\
+}
+
+#define	SLIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	SLIST_ENTRY(type)						\
+struct {								\
+	struct type *sle_next;	/* next element */			\
+}
+
+#define	SLIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *sle_next;		/* next element */		\
+}
+
+/*
+ * Singly-linked List functions.
+ */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm) do {			\
+	if (*(prevp) != (elm))						\
+		panic("Bad prevptr *(%p) == %p != %p",			\
+		    (prevp), *(prevp), (elm));				\
+} while (0)
+#else
+#define	QMD_SLIST_CHECK_PREVPTR(prevp, elm)
+#endif
+
+#define SLIST_CONCAT(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head1);		\
+	if (curelm == NULL) {						\
+		if ((SLIST_FIRST(head1) = SLIST_FIRST(head2)) != NULL)	\
+			SLIST_INIT(head2);				\
+	} else if (SLIST_FIRST(head2) != NULL) {			\
+		while (SLIST_NEXT(curelm, field) != NULL)		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_NEXT(curelm, field) = SLIST_FIRST(head2);		\
+		SLIST_INIT(head2);					\
+	}								\
+} while (0)
+
+#define	SLIST_EMPTY(head)	((head)->slh_first == NULL)
+
+#define	SLIST_FIRST(head)	((head)->slh_first)
+
+#define	SLIST_FOREACH(var, head, field)					\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = SLIST_NEXT((var), field))
+
+#define	SLIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = SLIST_FIRST((head));				\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : SLIST_FIRST((head)));		\
+	    (var) && ((tvar) = SLIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	SLIST_FOREACH_PREVPTR(var, varp, head, field)			\
+	for ((varp) = &SLIST_FIRST((head));				\
+	    ((var) = *(varp)) != NULL;					\
+	    (varp) = &SLIST_NEXT((var), field))
+
+#define	SLIST_INIT(head) do {						\
+	SLIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	SLIST_INSERT_AFTER(slistelm, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_NEXT((slistelm), field);	\
+	SLIST_NEXT((slistelm), field) = (elm);				\
+} while (0)
+
+#define	SLIST_INSERT_HEAD(head, elm, field) do {			\
+	SLIST_NEXT((elm), field) = SLIST_FIRST((head));			\
+	SLIST_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	SLIST_NEXT(elm, field)	((elm)->field.sle_next)
+
+#define	SLIST_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.sle_next);			\
+	if (SLIST_FIRST((head)) == (elm)) {				\
+		SLIST_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = SLIST_FIRST(head);		\
+		while (SLIST_NEXT(curelm, field) != (elm))		\
+			curelm = SLIST_NEXT(curelm, field);		\
+		SLIST_REMOVE_AFTER(curelm, field);			\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define SLIST_REMOVE_AFTER(elm, field) do {				\
+	SLIST_NEXT(elm, field) =					\
+	    SLIST_NEXT(SLIST_NEXT(elm, field), field);			\
+} while (0)
+
+#define	SLIST_REMOVE_HEAD(head, field) do {				\
+	SLIST_FIRST((head)) = SLIST_NEXT(SLIST_FIRST((head)), field);	\
+} while (0)
+
+#define	SLIST_REMOVE_PREVPTR(prevp, elm, field) do {			\
+	QMD_SLIST_CHECK_PREVPTR(prevp, elm);				\
+	*(prevp) = SLIST_NEXT(elm, field);				\
+	TRASHIT((elm)->field.sle_next);					\
+} while (0)
+
+#define SLIST_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = SLIST_FIRST(head1);		\
+	SLIST_FIRST(head1) = SLIST_FIRST(head2);			\
+	SLIST_FIRST(head2) = swap_first;				\
+} while (0)
+
+/*
+ * Singly-linked Tail queue declarations.
+ */
+#define	STAILQ_HEAD(name, type)						\
+struct name {								\
+	struct type *stqh_first;/* first element */			\
+	struct type **stqh_last;/* addr of last next element */		\
+}
+
+#define	STAILQ_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *stqh_first;	/* first element */			\
+	class type **stqh_last;	/* addr of last next element */		\
+}
+
+#define	STAILQ_HEAD_INITIALIZER(head)					\
+	{ NULL, &(head).stqh_first }
+
+#define	STAILQ_ENTRY(type)						\
+struct {								\
+	struct type *stqe_next;	/* next element */			\
+}
+
+#define	STAILQ_CLASS_ENTRY(type)					\
+struct {								\
+	class type *stqe_next;	/* next element */			\
+}
+
+/*
+ * Singly-linked Tail queue functions.
+ */
+#define	STAILQ_CONCAT(head1, head2) do {				\
+	if (!STAILQ_EMPTY((head2))) {					\
+		*(head1)->stqh_last = (head2)->stqh_first;		\
+		(head1)->stqh_last = (head2)->stqh_last;		\
+		STAILQ_INIT((head2));					\
+	}								\
+} while (0)
+
+#define	STAILQ_EMPTY(head)	((head)->stqh_first == NULL)
+
+#define	STAILQ_FIRST(head)	((head)->stqh_first)
+
+#define	STAILQ_FOREACH(var, head, field)				\
+	for((var) = STAILQ_FIRST((head));				\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	   (var);							\
+	   (var) = STAILQ_NEXT((var), field))
+
+#define	STAILQ_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = STAILQ_FIRST((head));				\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_FOREACH_FROM_SAFE(var, head, field, tvar)		\
+	for ((var) = ((var) ? (var) : STAILQ_FIRST((head)));		\
+	    (var) && ((tvar) = STAILQ_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	STAILQ_INIT(head) do {						\
+	STAILQ_FIRST((head)) = NULL;					\
+	(head)->stqh_last = &STAILQ_FIRST((head));			\
+} while (0)
+
+#define	STAILQ_INSERT_AFTER(head, tqelm, elm, field) do {		\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_NEXT((tqelm), field)) == NULL)\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_NEXT((tqelm), field) = (elm);				\
+} while (0)
+
+#define	STAILQ_INSERT_HEAD(head, elm, field) do {			\
+	if ((STAILQ_NEXT((elm), field) = STAILQ_FIRST((head))) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+	STAILQ_FIRST((head)) = (elm);					\
+} while (0)
+
+#define	STAILQ_INSERT_TAIL(head, elm, field) do {			\
+	STAILQ_NEXT((elm), field) = NULL;				\
+	*(head)->stqh_last = (elm);					\
+	(head)->stqh_last = &STAILQ_NEXT((elm), field);			\
+} while (0)
+
+#define	STAILQ_LAST(head, type, field)				\
+	(STAILQ_EMPTY((head)) ? NULL :				\
+	    __containerof((head)->stqh_last,			\
+	    QUEUE_TYPEOF(type), field.stqe_next))
+
+#define	STAILQ_NEXT(elm, field)	((elm)->field.stqe_next)
+
+#define	STAILQ_REMOVE(head, elm, type, field) do {			\
+	QMD_SAVELINK(oldnext, (elm)->field.stqe_next);			\
+	if (STAILQ_FIRST((head)) == (elm)) {				\
+		STAILQ_REMOVE_HEAD((head), field);			\
+	}								\
+	else {								\
+		QUEUE_TYPEOF(type) *curelm = STAILQ_FIRST(head);	\
+		while (STAILQ_NEXT(curelm, field) != (elm))		\
+			curelm = STAILQ_NEXT(curelm, field);		\
+		STAILQ_REMOVE_AFTER(head, curelm, field);		\
+	}								\
+	TRASHIT(*oldnext);						\
+} while (0)
+
+#define STAILQ_REMOVE_AFTER(head, elm, field) do {			\
+	if ((STAILQ_NEXT(elm, field) =					\
+	     STAILQ_NEXT(STAILQ_NEXT(elm, field), field)) == NULL)	\
+		(head)->stqh_last = &STAILQ_NEXT((elm), field);		\
+} while (0)
+
+#define	STAILQ_REMOVE_HEAD(head, field) do {				\
+	if ((STAILQ_FIRST((head)) =					\
+	     STAILQ_NEXT(STAILQ_FIRST((head)), field)) == NULL)		\
+		(head)->stqh_last = &STAILQ_FIRST((head));		\
+} while (0)
+
+#define STAILQ_SWAP(head1, head2, type) do {				\
+	QUEUE_TYPEOF(type) *swap_first = STAILQ_FIRST(head1);		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->stqh_last;		\
+	STAILQ_FIRST(head1) = STAILQ_FIRST(head2);			\
+	(head1)->stqh_last = (head2)->stqh_last;			\
+	STAILQ_FIRST(head2) = swap_first;				\
+	(head2)->stqh_last = swap_last;					\
+	if (STAILQ_EMPTY(head1))					\
+		(head1)->stqh_last = &STAILQ_FIRST(head1);		\
+	if (STAILQ_EMPTY(head2))					\
+		(head2)->stqh_last = &STAILQ_FIRST(head2);		\
+} while (0)
+
+
+/*
+ * List declarations.
+ */
+#define	LIST_HEAD(name, type)						\
+struct name {								\
+	struct type *lh_first;	/* first element */			\
+}
+
+#define	LIST_CLASS_HEAD(name, type)					\
+struct name {								\
+	class type *lh_first;	/* first element */			\
+}
+
+#define	LIST_HEAD_INITIALIZER(head)					\
+	{ NULL }
+
+#define	LIST_ENTRY(type)						\
+struct {								\
+	struct type *le_next;	/* next element */			\
+	struct type **le_prev;	/* address of previous next element */	\
+}
+
+#define	LIST_CLASS_ENTRY(type)						\
+struct {								\
+	class type *le_next;	/* next element */			\
+	class type **le_prev;	/* address of previous next element */	\
+}
+
+/*
+ * List functions.
+ */
+
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_LIST_CHECK_HEAD(LIST_HEAD *head, LIST_ENTRY NAME)
+ *
+ * If the list is non-empty, validates that the first element of the list
+ * points back at 'head.'
+ */
+#define	QMD_LIST_CHECK_HEAD(head, field) do {				\
+	if (LIST_FIRST((head)) != NULL &&				\
+	    LIST_FIRST((head))->field.le_prev !=			\
+	     &LIST_FIRST((head)))					\
+		panic("Bad list head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_NEXT(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the list, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_LIST_CHECK_NEXT(elm, field) do {				\
+	if (LIST_NEXT((elm), field) != NULL &&				\
+	    LIST_NEXT((elm), field)->field.le_prev !=			\
+	     &((elm)->field.le_next))					\
+	     	panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_LIST_CHECK_PREV(TYPE *elm, LIST_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the list) points to 'elm.'
+ */
+#define	QMD_LIST_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.le_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
+#define	QMD_LIST_CHECK_HEAD(head, field)
+#define	QMD_LIST_CHECK_NEXT(elm, field)
+#define	QMD_LIST_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
+
+#define LIST_CONCAT(head1, head2, type, field) do {			      \
+	QUEUE_TYPEOF(type) *curelm = LIST_FIRST(head1);			      \
+	if (curelm == NULL) {						      \
+		if ((LIST_FIRST(head1) = LIST_FIRST(head2)) != NULL) {	      \
+			LIST_FIRST(head2)->field.le_prev =		      \
+			    &LIST_FIRST((head1));			      \
+			LIST_INIT(head2);				      \
+		}							      \
+	} else if (LIST_FIRST(head2) != NULL) {				      \
+		while (LIST_NEXT(curelm, field) != NULL)		      \
+			curelm = LIST_NEXT(curelm, field);		      \
+		LIST_NEXT(curelm, field) = LIST_FIRST(head2);		      \
+		LIST_FIRST(head2)->field.le_prev = &LIST_NEXT(curelm, field); \
+		LIST_INIT(head2);					      \
+	}								      \
+} while (0)
+
+#define	LIST_EMPTY(head)	((head)->lh_first == NULL)
+
+#define	LIST_FIRST(head)	((head)->lh_first)
+
+#define	LIST_FOREACH(var, head, field)					\
+	for ((var) = LIST_FIRST((head));				\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_FROM(var, head, field)				\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var);							\
+	    (var) = LIST_NEXT((var), field))
+
+#define	LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = LIST_FIRST((head));				\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_FOREACH_FROM_SAFE(var, head, field, tvar)			\
+	for ((var) = ((var) ? (var) : LIST_FIRST((head)));		\
+	    (var) && ((tvar) = LIST_NEXT((var), field), 1);		\
+	    (var) = (tvar))
+
+#define	LIST_INIT(head) do {						\
+	LIST_FIRST((head)) = NULL;					\
+} while (0)
+
+#define	LIST_INSERT_AFTER(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_NEXT(listelm, field);				\
+	if ((LIST_NEXT((elm), field) = LIST_NEXT((listelm), field)) != NULL)\
+		LIST_NEXT((listelm), field)->field.le_prev =		\
+		    &LIST_NEXT((elm), field);				\
+	LIST_NEXT((listelm), field) = (elm);				\
+	(elm)->field.le_prev = &LIST_NEXT((listelm), field);		\
+} while (0)
+
+#define	LIST_INSERT_BEFORE(listelm, elm, field) do {			\
+	QMD_LIST_CHECK_PREV(listelm, field);				\
+	(elm)->field.le_prev = (listelm)->field.le_prev;		\
+	LIST_NEXT((elm), field) = (listelm);				\
+	*(listelm)->field.le_prev = (elm);				\
+	(listelm)->field.le_prev = &LIST_NEXT((elm), field);		\
+} while (0)
+
+#define	LIST_INSERT_HEAD(head, elm, field) do {				\
+	QMD_LIST_CHECK_HEAD((head), field);				\
+	if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL)	\
+		LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
+	LIST_FIRST((head)) = (elm);					\
+	(elm)->field.le_prev = &LIST_FIRST((head));			\
+} while (0)
+
+#define	LIST_NEXT(elm, field)	((elm)->field.le_next)
+
+#define	LIST_PREV(elm, head, type, field)			\
+	((elm)->field.le_prev == &LIST_FIRST((head)) ? NULL :	\
+	    __containerof((elm)->field.le_prev,			\
+	    QUEUE_TYPEOF(type), field.le_next))
+
+#define	LIST_REMOVE(elm, field) do {					\
+	QMD_SAVELINK(oldnext, (elm)->field.le_next);			\
+	QMD_SAVELINK(oldprev, (elm)->field.le_prev);			\
+	QMD_LIST_CHECK_NEXT(elm, field);				\
+	QMD_LIST_CHECK_PREV(elm, field);				\
+	if (LIST_NEXT((elm), field) != NULL)				\
+		LIST_NEXT((elm), field)->field.le_prev = 		\
+		    (elm)->field.le_prev;				\
+	*(elm)->field.le_prev = LIST_NEXT((elm), field);		\
+	TRASHIT(*oldnext);						\
+	TRASHIT(*oldprev);						\
+} while (0)
+
+#define LIST_SWAP(head1, head2, type, field) do {			\
+	QUEUE_TYPEOF(type) *swap_tmp = LIST_FIRST(head1);		\
+	LIST_FIRST((head1)) = LIST_FIRST((head2));			\
+	LIST_FIRST((head2)) = swap_tmp;					\
+	if ((swap_tmp = LIST_FIRST((head1))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head1));		\
+	if ((swap_tmp = LIST_FIRST((head2))) != NULL)			\
+		swap_tmp->field.le_prev = &LIST_FIRST((head2));		\
+} while (0)
+
 /*
  * Tail queue declarations.
  */
@@ -123,10 +619,58 @@ struct {								\
 /*
  * Tail queue functions.
  */
+#if (defined(_KERNEL) && defined(INVARIANTS))
+/*
+ * QMD_TAILQ_CHECK_HEAD(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * If the tailq is non-empty, validates that the first element of the tailq
+ * points back at 'head.'
+ */
+#define	QMD_TAILQ_CHECK_HEAD(head, field) do {				\
+	if (!TAILQ_EMPTY(head) &&					\
+	    TAILQ_FIRST((head))->field.tqe_prev !=			\
+	     &TAILQ_FIRST((head)))					\
+		panic("Bad tailq head %p first->prev != head", (head));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_TAIL(TAILQ_HEAD *head, TAILQ_ENTRY NAME)
+ *
+ * Validates that the tail of the tailq is a pointer to pointer to NULL.
+ */
+#define	QMD_TAILQ_CHECK_TAIL(head, field) do {				\
+	if (*(head)->tqh_last != NULL)					\
+	    	panic("Bad tailq NEXT(%p->tqh_last) != NULL", (head)); 	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_NEXT(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * If an element follows 'elm' in the tailq, validates that the next element
+ * points back at 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_NEXT(elm, field) do {				\
+	if (TAILQ_NEXT((elm), field) != NULL &&				\
+	    TAILQ_NEXT((elm), field)->field.tqe_prev !=			\
+	     &((elm)->field.tqe_next))					\
+		panic("Bad link elm %p next->prev != elm", (elm));	\
+} while (0)
+
+/*
+ * QMD_TAILQ_CHECK_PREV(TYPE *elm, TAILQ_ENTRY NAME)
+ *
+ * Validates that the previous element (or head of the tailq) points to 'elm.'
+ */
+#define	QMD_TAILQ_CHECK_PREV(elm, field) do {				\
+	if (*(elm)->field.tqe_prev != (elm))				\
+		panic("Bad link elm %p prev->next != elm", (elm));	\
+} while (0)
+#else
 #define	QMD_TAILQ_CHECK_HEAD(head, field)
 #define	QMD_TAILQ_CHECK_TAIL(head, headname)
 #define	QMD_TAILQ_CHECK_NEXT(elm, field)
 #define	QMD_TAILQ_CHECK_PREV(elm, field)
+#endif /* (_KERNEL && INVARIANTS) */
 
 #define	TAILQ_CONCAT(head1, head2, field) do {				\
 	if (!TAILQ_EMPTY(head2)) {					\
@@ -191,9 +735,8 @@ struct {								\
 
 #define	TAILQ_INSERT_AFTER(head, listelm, elm, field) do {		\
 	QMD_TAILQ_CHECK_NEXT(listelm, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field);	\
-	if (TAILQ_NEXT((listelm), field) != NULL)			\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_NEXT((listelm), field)) != NULL)\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    &TAILQ_NEXT((elm), field);				\
 	else {								\
 		(head)->tqh_last = &TAILQ_NEXT((elm), field);		\
@@ -217,8 +760,7 @@ struct {								\
 
 #define	TAILQ_INSERT_HEAD(head, elm, field) do {			\
 	QMD_TAILQ_CHECK_HEAD(head, field);				\
-	TAILQ_NEXT((elm), field) = TAILQ_FIRST((head));			\
-	if (TAILQ_FIRST((head)) != NULL)				\
+	if ((TAILQ_NEXT((elm), field) = TAILQ_FIRST((head))) != NULL)	\
 		TAILQ_FIRST((head))->field.tqe_prev =			\
 		    &TAILQ_NEXT((elm), field);				\
 	else								\
@@ -250,21 +792,24 @@ struct {								\
  * you may want to prefetch the last data element.
  */
 #define	TAILQ_LAST_FAST(head, type, field)			\
-	(TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last,	\
-	QUEUE_TYPEOF(type), field.tqe_next))
+    (TAILQ_EMPTY(head) ? NULL : __containerof((head)->tqh_last, QUEUE_TYPEOF(type), field.tqe_next))
 
 #define	TAILQ_NEXT(elm, field) ((elm)->field.tqe_next)
 
 #define	TAILQ_PREV(elm, headname, field)				\
 	(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
 
+#define	TAILQ_PREV_FAST(elm, head, type, field)				\
+    ((elm)->field.tqe_prev == &(head)->tqh_first ? NULL :		\
+     __containerof((elm)->field.tqe_prev, QUEUE_TYPEOF(type), field.tqe_next))
+
 #define	TAILQ_REMOVE(head, elm, field) do {				\
 	QMD_SAVELINK(oldnext, (elm)->field.tqe_next);			\
 	QMD_SAVELINK(oldprev, (elm)->field.tqe_prev);			\
 	QMD_TAILQ_CHECK_NEXT(elm, field);				\
 	QMD_TAILQ_CHECK_PREV(elm, field);				\
 	if ((TAILQ_NEXT((elm), field)) != NULL)				\
-		TAILQ_NEXT((elm), field)->field.tqe_prev =		\
+		TAILQ_NEXT((elm), field)->field.tqe_prev = 		\
 		    (elm)->field.tqe_prev;				\
 	else {								\
 		(head)->tqh_last = (elm)->field.tqe_prev;		\
@@ -277,26 +822,20 @@ struct {								\
 } while (0)
 
 #define TAILQ_SWAP(head1, head2, type, field) do {			\
-	QUEUE_TYPEOF(type) * swap_first = (head1)->tqh_first;		\
-	QUEUE_TYPEOF(type) * *swap_last = (head1)->tqh_last;		\
+	QUEUE_TYPEOF(type) *swap_first = (head1)->tqh_first;		\
+	QUEUE_TYPEOF(type) **swap_last = (head1)->tqh_last;		\
 	(head1)->tqh_first = (head2)->tqh_first;			\
 	(head1)->tqh_last = (head2)->tqh_last;				\
 	(head2)->tqh_first = swap_first;				\
 	(head2)->tqh_last = swap_last;					\
-	swap_first = (head1)->tqh_first;				\
-	if (swap_first != NULL)						\
+	if ((swap_first = (head1)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head1)->tqh_first;	\
 	else								\
 		(head1)->tqh_last = &(head1)->tqh_first;		\
-	swap_first = (head2)->tqh_first;				\
-	if (swap_first != NULL)			\
+	if ((swap_first = (head2)->tqh_first) != NULL)			\
 		swap_first->field.tqe_prev = &(head2)->tqh_first;	\
 	else								\
 		(head2)->tqh_last = &(head2)->tqh_first;		\
 } while (0)
 
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* _SYS_QUEUE_H_ */
+#endif /* !_SYS_QUEUE_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 09/12] eal/windows: improve CPU and NUMA node detection
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (7 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 08/12] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15 15:21                 ` Thomas Monjalon
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 10/12] doc/windows: split build and run instructions Dmitry Kozlyuk
                                 ` (4 subsequent siblings)
  13 siblings, 1 reply; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

1. Map CPU cores to their respective NUMA nodes as reported by system.
2. Support systems with more than 64 cores (multiple processor groups).
3. Fix magic constants, styling issues, and compiler warnings.
4. Add EAL private function to map DPDK socket ID to NUMA node number.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 lib/librte_eal/windows/eal.c         |   7 +-
 lib/librte_eal/windows/eal_lcore.c   | 205 +++++++++++++++++----------
 lib/librte_eal/windows/eal_windows.h |  15 +-
 3 files changed, 152 insertions(+), 75 deletions(-)

diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index e7461f731..dfc10b494 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -263,8 +263,11 @@ rte_eal_init(int argc, char **argv)
 
 	eal_log_level_parse(argc, argv);
 
-	/* create a map of all processors in the system */
-	eal_create_cpu_map();
+	if (eal_create_cpu_map() < 0) {
+		rte_eal_init_alert("Cannot discover CPU and NUMA.");
+		/* rte_errno is set */
+		return -1;
+	}
 
 	if (rte_eal_cpu_init() < 0) {
 		rte_eal_init_alert("Cannot detect lcores.");
diff --git a/lib/librte_eal/windows/eal_lcore.c b/lib/librte_eal/windows/eal_lcore.c
index 82ee45413..d5ff721e0 100644
--- a/lib/librte_eal/windows/eal_lcore.c
+++ b/lib/librte_eal/windows/eal_lcore.c
@@ -3,103 +3,164 @@
  */
 
 #include <pthread.h>
+#include <stdbool.h>
 #include <stdint.h>
 
 #include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_lcore.h>
+#include <rte_os.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
 #include "eal_windows.h"
 
-/* global data structure that contains the CPU map */
-static struct _wcpu_map {
-	unsigned int total_procs;
-	unsigned int proc_sockets;
-	unsigned int proc_cores;
-	unsigned int reserved;
-	struct _win_lcore_map {
-		uint8_t socket_id;
-		uint8_t core_id;
-	} wlcore_map[RTE_MAX_LCORE];
-} wcpu_map = { 0 };
-
-/*
- * Create a map of all processors and associated cores on the system
- */
-void
-eal_create_cpu_map()
+/** Number of logical processors (cores) in a processor group (32 or 64). */
+#define EAL_PROCESSOR_GROUP_SIZE (sizeof(KAFFINITY) * CHAR_BIT)
+
+struct lcore_map {
+	uint8_t socket_id;
+	uint8_t core_id;
+};
+
+struct socket_map {
+	uint16_t node_id;
+};
+
+struct cpu_map {
+	unsigned int socket_count;
+	unsigned int lcore_count;
+	struct lcore_map lcores[RTE_MAX_LCORE];
+	struct socket_map sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct cpu_map cpu_map = { 0 };
+
+/* eal_create_cpu_map() is called before logging is initialized */
+static void
+log_early(const char *format, ...)
+{
+	va_list va;
+
+	va_start(va, format);
+	vfprintf(stderr, format, va);
+	va_end(va);
+}
+
+int
+eal_create_cpu_map(void)
 {
-	wcpu_map.total_procs =
-		GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
-
-	LOGICAL_PROCESSOR_RELATIONSHIP lprocRel;
-	DWORD lprocInfoSize = 0;
-	BOOL ht_enabled = FALSE;
-
-	/* First get the processor package information */
-	lprocRel = RelationProcessorPackage;
-	/* Determine the size of buffer we need (pass NULL) */
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_sockets = lprocInfoSize / 48;
-
-	lprocInfoSize = 0;
-	/* Next get the processor core information */
-	lprocRel = RelationProcessorCore;
-	GetLogicalProcessorInformationEx(lprocRel, NULL, &lprocInfoSize);
-	wcpu_map.proc_cores = lprocInfoSize / 48;
-
-	if (wcpu_map.total_procs > wcpu_map.proc_cores)
-		ht_enabled = TRUE;
-
-	/* Distribute the socket and core ids appropriately
-	 * across the logical cores. For now, split the cores
-	 * equally across the sockets.
-	 */
-	unsigned int lcore = 0;
-	for (unsigned int socket = 0; socket <
-			wcpu_map.proc_sockets; ++socket) {
-		for (unsigned int core = 0;
-			core < (wcpu_map.proc_cores / wcpu_map.proc_sockets);
-			++core) {
-			wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-			wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-			lcore++;
-			if (ht_enabled) {
-				wcpu_map.wlcore_map[lcore]
-					.socket_id = socket;
-				wcpu_map.wlcore_map[lcore]
-					.core_id = core;
-				lcore++;
+	SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *infos, *info;
+	DWORD infos_size;
+	bool full = false;
+
+	infos_size = 0;
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, NULL, &infos_size)) {
+		DWORD error = GetLastError();
+		if (error != ERROR_INSUFFICIENT_BUFFER) {
+			log_early("Cannot get NUMA node info size, error %lu\n",
+				GetLastError());
+			rte_errno = ENOMEM;
+			return -1;
+		}
+	}
+
+	infos = malloc(infos_size);
+	if (infos == NULL) {
+		log_early("Cannot allocate memory for NUMA node information\n");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (!GetLogicalProcessorInformationEx(
+			RelationNumaNode, infos, &infos_size)) {
+		log_early("Cannot get NUMA node information, error %lu\n",
+			GetLastError());
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	info = infos;
+	while ((uint8_t *)info - (uint8_t *)infos < infos_size) {
+		unsigned int node_id = info->NumaNode.NodeNumber;
+		GROUP_AFFINITY *cores = &info->NumaNode.GroupMask;
+		struct lcore_map *lcore;
+		unsigned int i, socket_id;
+
+		/* NUMA node may be reported multiple times if it includes
+		 * cores from different processor groups, e. g. 80 cores
+		 * of a physical processor comprise one NUMA node, but two
+		 * processor groups, because group size is limited by 32/64.
+		 */
+		for (socket_id = 0; socket_id < cpu_map.socket_count;
+		    socket_id++) {
+			if (cpu_map.sockets[socket_id].node_id == node_id)
+				break;
+		}
+
+		if (socket_id == cpu_map.socket_count) {
+			if (socket_id == RTE_DIM(cpu_map.sockets)) {
+				full = true;
+				goto exit;
+			}
+
+			cpu_map.sockets[socket_id].node_id = node_id;
+			cpu_map.socket_count++;
+		}
+
+		for (i = 0; i < EAL_PROCESSOR_GROUP_SIZE; i++) {
+			if ((cores->Mask & ((KAFFINITY)1 << i)) == 0)
+				continue;
+
+			if (cpu_map.lcore_count == RTE_DIM(cpu_map.lcores)) {
+				full = true;
+				goto exit;
 			}
+
+			lcore = &cpu_map.lcores[cpu_map.lcore_count];
+			lcore->socket_id = socket_id;
+			lcore->core_id =
+				cores->Group * EAL_PROCESSOR_GROUP_SIZE + i;
+			cpu_map.lcore_count++;
 		}
+
+		info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *)(
+			(uint8_t *)info + info->Size);
 	}
+
+exit:
+	if (full) {
+		/* Not a fatal error, but important for troubleshooting. */
+		log_early("Enumerated maximum of %u NUMA nodes and %u cores\n",
+			cpu_map.socket_count, cpu_map.lcore_count);
+	}
+
+	free(infos);
+
+	return 0;
 }
 
-/*
- * Check if a cpu is present by the presence of the cpu information for it
- */
 int
 eal_cpu_detected(unsigned int lcore_id)
 {
-	return (lcore_id < wcpu_map.total_procs);
+	return lcore_id < cpu_map.lcore_count;
 }
 
-/*
- * Get CPU socket id for a logical core
- */
 unsigned
 eal_cpu_socket_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].socket_id;
+	return cpu_map.lcores[lcore_id].socket_id;
 }
 
-/*
- * Get CPU socket id (NUMA node) for a logical core
- */
 unsigned
 eal_cpu_core_id(unsigned int lcore_id)
 {
-	return wcpu_map.wlcore_map[lcore_id].core_id;
+	return cpu_map.lcores[lcore_id].core_id;
+}
+
+unsigned int
+eal_socket_numa_node(unsigned int socket_id)
+{
+	return cpu_map.sockets[socket_id].node_id;
 }
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index fadd676b2..f3ed8c37f 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -13,8 +13,11 @@
 
 /**
  * Create a map of processors and cores on the system.
+ *
+ * @return
+ *  0 on success, (-1) on failure and rte_errno is set.
  */
-void eal_create_cpu_map(void);
+int eal_create_cpu_map(void);
 
 /**
  * Create a thread.
@@ -26,4 +29,14 @@ void eal_create_cpu_map(void);
  */
 int eal_thread_create(pthread_t *thread);
 
+/**
+ * Get system NUMA node number for a socket ID.
+ *
+ * @param socket_id
+ *  Valid EAL socket ID.
+ * @return
+ *  NUMA node number to use with Win32 API.
+ */
+unsigned int eal_socket_numa_node(unsigned int socket_id);
+
 #endif /* _EAL_WINDOWS_H_ */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 10/12] doc/windows: split build and run instructions
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (8 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 09/12] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 11/12] eal/windows: initialize hugepage info Dmitry Kozlyuk
                                 ` (3 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon, John McNamara,
	Marko Kovacevic

With memory management implemented for Windows, the guide for running
sample applications is going to be extended with hugepages and driver
setup.  Move run instructions to a separate file to give space for
planned expansion.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 doc/guides/windows_gsg/build_dpdk.rst | 20 --------------------
 doc/guides/windows_gsg/index.rst      |  1 +
 doc/guides/windows_gsg/run_apps.rst   | 24 ++++++++++++++++++++++++
 3 files changed, 25 insertions(+), 20 deletions(-)
 create mode 100644 doc/guides/windows_gsg/run_apps.rst

diff --git a/doc/guides/windows_gsg/build_dpdk.rst b/doc/guides/windows_gsg/build_dpdk.rst
index d46e84e3f..650483e3b 100644
--- a/doc/guides/windows_gsg/build_dpdk.rst
+++ b/doc/guides/windows_gsg/build_dpdk.rst
@@ -111,23 +111,3 @@ Depending on the distribution, paths in this file may need adjustments.
 
     meson --cross-file config/x86/meson_mingw.txt -Dexamples=helloworld build
     ninja -C build
-
-
-Run the helloworld example
-==========================
-
-Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
-
-.. code-block:: console
-
-    cd C:\Users\me\dpdk\build\examples
-    dpdk-helloworld.exe
-    hello from core 1
-    hello from core 3
-    hello from core 0
-    hello from core 2
-
-Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
-by default. To run the example, either add toolchain executables directory
-to the PATH or copy the library to the working directory.
-Alternatively, static linking may be used (mind the LGPLv2.1 license).
diff --git a/doc/guides/windows_gsg/index.rst b/doc/guides/windows_gsg/index.rst
index d9b7990a8..e94593572 100644
--- a/doc/guides/windows_gsg/index.rst
+++ b/doc/guides/windows_gsg/index.rst
@@ -12,3 +12,4 @@ Getting Started Guide for Windows
 
     intro
     build_dpdk
+    run_apps
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
new file mode 100644
index 000000000..ff4c4654f
--- /dev/null
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -0,0 +1,24 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2020 Dmitry Kozlyuk
+
+Running DPDK Applications
+=========================
+
+Run the ``helloworld`` Example
+------------------------------
+
+Navigate to the examples in the build directory and run `dpdk-helloworld.exe`.
+
+.. code-block:: console
+
+    cd C:\Users\me\dpdk\build\examples
+    dpdk-helloworld.exe
+    hello from core 1
+    hello from core 3
+    hello from core 0
+    hello from core 2
+
+Note for MinGW-w64: applications are linked to ``libwinpthread-1.dll``
+by default. To run the example, either add toolchain executables directory
+to the PATH or copy the library to the working directory.
+Alternatively, static linking may be used (mind the LGPLv2.1 license).
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 11/12] eal/windows: initialize hugepage info
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (9 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 10/12] doc/windows: split build and run instructions Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 12/12] eal/windows: implement basic memory management Dmitry Kozlyuk
                                 ` (2 subsequent siblings)
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic

Add hugepages discovery ("large pages" in Windows terminology)
and update documentation for required privilege setup. Only 2MB
hugepages are supported and their number is estimated roughly
due to the lack or unstable status of suitable OS APIs.
Assign myself as maintainer for the implementation file.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                            |   4 +
 config/meson.build                     |   2 +
 doc/guides/windows_gsg/run_apps.rst    |  23 ++++++
 lib/librte_eal/windows/eal.c           |  14 ++++
 lib/librte_eal/windows/eal_hugepages.c | 108 +++++++++++++++++++++++++
 lib/librte_eal/windows/meson.build     |   1 +
 6 files changed, 152 insertions(+)
 create mode 100644 lib/librte_eal/windows/eal_hugepages.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 241dbc3d7..9d5dacc23 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -334,6 +334,10 @@ F: lib/librte_eal/windows/
 F: lib/librte_eal/rte_eal_exports.def
 F: doc/guides/windows_gsg/
 
+Windows memory allocation
+M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
+F: lib/librte_eal/windows/eal_hugepages.c
+
 
 Core Libraries
 --------------
diff --git a/config/meson.build b/config/meson.build
index 43ab11310..c1e80de4b 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -268,6 +268,8 @@ if is_windows
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
+
+	add_project_link_arguments('-ladvapi32', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index ff4c4654f..21ac7f6c1 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -4,6 +4,29 @@
 Running DPDK Applications
 =========================
 
+Grant *Lock pages in memory* Privilege
+--------------------------------------
+
+Use of hugepages ("large pages" in Windows terminolocy) requires
+``SeLockMemoryPrivilege`` for the user running an application.
+
+1. Open *Local Security Policy* snap in, either:
+
+   * Control Panel / Computer Management / Local Security Policy;
+   * or Win+R, type ``secpol``, press Enter.
+
+2. Open *Local Policies / User Rights Assignment / Lock pages in memory.*
+
+3. Add desired users or groups to the list of grantees.
+
+4. Privilege is applied upon next logon. In particular, if privilege has been
+   granted to current user, a logoff is required before it is available.
+
+See `Large-Page Support`_ in MSDN for details.
+
+.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
 Run the ``helloworld`` Example
 ------------------------------
 
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index dfc10b494..759bf4be5 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -19,8 +19,11 @@
 #include <eal_private.h>
 #include <rte_trace_point.h>
 
+#include "eal_hugepages.h"
 #include "eal_windows.h"
 
+#define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
+
  /* Allow the application to print its usage message too if set */
 static rte_usage_hook_t	rte_application_usage_hook;
 
@@ -279,6 +282,17 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
+		rte_eal_init_alert("Cannot get hugepage information");
+		rte_errno = EACCES;
+		return -1;
+	}
+
+	if (internal_config.memory == 0 && !internal_config.force_sockets) {
+		if (internal_config.no_hugetlbfs)
+			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_hugepages.c b/lib/librte_eal/windows/eal_hugepages.c
new file mode 100644
index 000000000..61d0dcd3c
--- /dev/null
+++ b/lib/librte_eal/windows/eal_hugepages.c
@@ -0,0 +1,108 @@
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_os.h>
+
+#include "eal_filesystem.h"
+#include "eal_hugepages.h"
+#include "eal_internal_cfg.h"
+#include "eal_windows.h"
+
+static int
+hugepage_claim_privilege(void)
+{
+	static const wchar_t privilege[] = L"SeLockMemoryPrivilege";
+
+	HANDLE token;
+	LUID luid;
+	TOKEN_PRIVILEGES tp;
+	int ret = -1;
+
+	if (!OpenProcessToken(GetCurrentProcess(),
+			TOKEN_ADJUST_PRIVILEGES, &token)) {
+		RTE_LOG_WIN32_ERR("OpenProcessToken()");
+		return -1;
+	}
+
+	if (!LookupPrivilegeValueW(NULL, privilege, &luid)) {
+		RTE_LOG_WIN32_ERR("LookupPrivilegeValue(\"%S\")", privilege);
+		goto exit;
+	}
+
+	tp.PrivilegeCount = 1;
+	tp.Privileges[0].Luid = luid;
+	tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
+
+	if (!AdjustTokenPrivileges(
+			token, FALSE, &tp, sizeof(tp), NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("AdjustTokenPrivileges()");
+		goto exit;
+	}
+
+	ret = 0;
+
+exit:
+	CloseHandle(token);
+
+	return ret;
+}
+
+static int
+hugepage_info_init(void)
+{
+	struct hugepage_info *hpi;
+	unsigned int socket_id;
+	int ret = 0;
+
+	/* Only one hugepage size available on Windows. */
+	internal_config.num_hugepage_sizes = 1;
+	hpi = &internal_config.hugepage_info[0];
+
+	hpi->hugepage_sz = GetLargePageMinimum();
+	if (hpi->hugepage_sz == 0)
+		return -ENOTSUP;
+
+	/* Assume all memory on each NUMA node available for hugepages,
+	 * because Windows neither advertises additional limits,
+	 * nor provides an API to query them.
+	 */
+	for (socket_id = 0; socket_id < rte_socket_count(); socket_id++) {
+		ULONGLONG bytes;
+		unsigned int numa_node;
+
+		numa_node = eal_socket_numa_node(socket_id);
+		if (!GetNumaAvailableMemoryNodeEx(numa_node, &bytes)) {
+			RTE_LOG_WIN32_ERR("GetNumaAvailableMemoryNodeEx(%u)",
+				numa_node);
+			continue;
+		}
+
+		hpi->num_pages[socket_id] = bytes / hpi->hugepage_sz;
+		RTE_LOG(DEBUG, EAL,
+			"Found %u hugepages of %zu bytes on socket %u\n",
+			hpi->num_pages[socket_id], hpi->hugepage_sz, socket_id);
+	}
+
+	/* No hugepage filesystem on Windows. */
+	hpi->lock_descriptor = -1;
+	memset(hpi->hugedir, 0, sizeof(hpi->hugedir));
+
+	return ret;
+}
+
+int
+eal_hugepage_info_init(void)
+{
+	if (hugepage_claim_privilege() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot claim hugepage privilege\n");
+		return -1;
+	}
+
+	if (hugepage_info_init() < 0) {
+		RTE_LOG(ERR, EAL, "Cannot get hugepage information\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index adfc8b9b7..52978e9d7 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,6 +6,7 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
 	'eal_thread.c',
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* [dpdk-dev] [PATCH v9 12/12] eal/windows: implement basic memory management
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (10 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 11/12] eal/windows: initialize hugepage info Dmitry Kozlyuk
@ 2020-06-15  0:43               ` Dmitry Kozlyuk
  2020-06-15 17:34               ` [dpdk-dev] [PATCH v9 00/12] Windows " Thomas Monjalon
  2020-06-16  1:52               ` Ranjit Menon
  13 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  0:43 UTC (permalink / raw)
  To: dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Thomas Monjalon,
	Harini Ramakrishnan, Omar Cardona, Pallavi Kadam, Ranjit Menon,
	John McNamara, Marko Kovacevic, Anatoly Burakov

Basic memory management supports core libraries and PMDs operating in
IOVA as PA mode. It uses a kernel-mode driver, virt2phys, to obtain
IOVAs of hugepages allocated from user-mode. Multi-process mode is not
implemented and is forcefully disabled at startup. Assign myself as a
maintainer for Windows file and memory management implementation.

Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
---
 MAINTAINERS                                   |   1 +
 config/meson.build                            |  12 +-
 doc/guides/windows_gsg/run_apps.rst           |  54 +-
 lib/librte_eal/common/meson.build             |  11 +
 lib/librte_eal/common/rte_malloc.c            |   1 +
 lib/librte_eal/rte_eal_exports.def            | 119 +++
 lib/librte_eal/windows/eal.c                  |  63 +-
 lib/librte_eal/windows/eal_file.c             | 125 +++
 lib/librte_eal/windows/eal_memalloc.c         | 441 +++++++++++
 lib/librte_eal/windows/eal_memory.c           | 710 ++++++++++++++++++
 lib/librte_eal/windows/eal_mp.c               | 103 +++
 lib/librte_eal/windows/eal_windows.h          |  75 ++
 lib/librte_eal/windows/include/meson.build    |   1 +
 lib/librte_eal/windows/include/rte_os.h       |  17 +
 .../windows/include/rte_virt2phys.h           |  34 +
 lib/librte_eal/windows/include/rte_windows.h  |   2 +
 lib/librte_eal/windows/include/unistd.h       |   3 +
 lib/librte_eal/windows/meson.build            |   6 +
 18 files changed, 1771 insertions(+), 7 deletions(-)
 create mode 100644 lib/librte_eal/windows/eal_file.c
 create mode 100644 lib/librte_eal/windows/eal_memalloc.c
 create mode 100644 lib/librte_eal/windows/eal_memory.c
 create mode 100644 lib/librte_eal/windows/eal_mp.c
 create mode 100644 lib/librte_eal/windows/include/rte_virt2phys.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 9d5dacc23..a80a3b904 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -337,6 +337,7 @@ F: doc/guides/windows_gsg/
 Windows memory allocation
 M: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
 F: lib/librte_eal/windows/eal_hugepages.c
+F: lib/librte_eal/windows/eal_mem*
 
 
 Core Libraries
diff --git a/config/meson.build b/config/meson.build
index c1e80de4b..d3f05f878 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -261,15 +261,21 @@ if is_freebsd
 endif
 
 if is_windows
-	# Minimum supported API is Windows 7.
-	add_project_arguments('-D_WIN32_WINNT=0x0601', language: 'c')
+	# VirtualAlloc2() is available since Windows 10 / Server 2016.
+	add_project_arguments('-D_WIN32_WINNT=0x0A00', language: 'c')
 
 	# Use MinGW-w64 stdio, because DPDK assumes ANSI-compliant formatting.
 	if cc.get_id() == 'gcc'
 		add_project_arguments('-D__USE_MINGW_ANSI_STDIO', language: 'c')
 	endif
 
-	add_project_link_arguments('-ladvapi32', language: 'c')
+	# Contrary to docs, VirtualAlloc2() is exported by mincore.lib
+	# in Windows SDK, while MinGW exports it by advapi32.a.
+	if is_ms_linker
+		add_project_link_arguments('-lmincore', language: 'c')
+	endif
+
+	add_project_link_arguments('-ladvapi32', '-lsetupapi', language: 'c')
 endif
 
 if get_option('b_lto')
diff --git a/doc/guides/windows_gsg/run_apps.rst b/doc/guides/windows_gsg/run_apps.rst
index 21ac7f6c1..78e5a614f 100644
--- a/doc/guides/windows_gsg/run_apps.rst
+++ b/doc/guides/windows_gsg/run_apps.rst
@@ -7,10 +7,10 @@ Running DPDK Applications
 Grant *Lock pages in memory* Privilege
 --------------------------------------
 
-Use of hugepages ("large pages" in Windows terminolocy) requires
+Use of hugepages ("large pages" in Windows terminology) requires
 ``SeLockMemoryPrivilege`` for the user running an application.
 
-1. Open *Local Security Policy* snap in, either:
+1. Open *Local Security Policy* snap-in, either:
 
    * Control Panel / Computer Management / Local Security Policy;
    * or Win+R, type ``secpol``, press Enter.
@@ -24,7 +24,55 @@ Use of hugepages ("large pages" in Windows terminolocy) requires
 
 See `Large-Page Support`_ in MSDN for details.
 
-.. _Large-page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+.. _Large-Page Support: https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support
+
+
+Load virt2phys Driver
+---------------------
+
+Access to physical addresses is provided by a kernel-mode driver, virt2phys.
+It is mandatory at least for using hardware PMDs, but may also be required
+for mempools.
+
+Refer to documentation in ``dpdk-kmods`` repository for details on system
+setup, driver build and installation. This driver is not signed, so signature
+checking must be disabled to load it.
+
+.. warning::
+
+    Disabling driver signature enforcement weakens OS security.
+    It is discouraged in production environments.
+
+Compiled package consists of ``virt2phys.inf``, ``virt2phys.cat``,
+and ``virt2phys.sys``. It can be installed as follows
+from Elevated Command Prompt:
+
+.. code-block:: console
+
+    pnputil /add-driver Z:\path\to\virt2phys.inf /install
+
+On Windows Server additional steps are required:
+
+1. From Device Manager, Action menu, select "Add legacy hardware".
+2. It will launch the "Add Hardware Wizard". Click "Next".
+3. Select second option "Install the hardware that I manually select
+   from a list (Advanced)".
+4. On the next screen, "Kernel bypass" will be shown as a device class.
+5. Select it, and click "Next".
+6. The previously installed drivers will now be installed for the
+   "Virtual to physical address translator" device.
+
+When loaded successfully, the driver is shown in *Device Manager* as *Virtual
+to physical address translator* device under *Kernel bypass* category.
+Installed driver persists across reboots.
+
+If DPDK is unable to communicate with the driver, a warning is printed
+on initialization (debug-level logs provide more details):
+
+.. code-block:: text
+
+    EAL: Cannot open virt2phys driver interface
+
 
 
 Run the ``helloworld`` Example
diff --git a/lib/librte_eal/common/meson.build b/lib/librte_eal/common/meson.build
index 4e9208129..310844269 100644
--- a/lib/librte_eal/common/meson.build
+++ b/lib/librte_eal/common/meson.build
@@ -8,13 +8,24 @@ if is_windows
 		'eal_common_bus.c',
 		'eal_common_class.c',
 		'eal_common_devargs.c',
+		'eal_common_dynmem.c',
 		'eal_common_errno.c',
+		'eal_common_fbarray.c',
 		'eal_common_launch.c',
 		'eal_common_lcore.c',
 		'eal_common_log.c',
+		'eal_common_mcfg.c',
+		'eal_common_memalloc.c',
+		'eal_common_memory.c',
+		'eal_common_memzone.c',
 		'eal_common_options.c',
+		'eal_common_string_fns.c',
+		'eal_common_tailqs.c',
 		'eal_common_thread.c',
 		'eal_common_trace_points.c',
+		'malloc_elem.c',
+		'malloc_heap.c',
+		'rte_malloc.c',
 	)
 	subdir_done()
 endif
diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c
index f1b73168b..9d39e58c0 100644
--- a/lib/librte_eal/common/rte_malloc.c
+++ b/lib/librte_eal/common/rte_malloc.c
@@ -20,6 +20,7 @@
 #include <rte_lcore.h>
 #include <rte_common.h>
 #include <rte_spinlock.h>
+
 #include <rte_eal_trace.h>
 
 #include <rte_malloc.h>
diff --git a/lib/librte_eal/rte_eal_exports.def b/lib/librte_eal/rte_eal_exports.def
index 12a6c79d6..e2eb24f01 100644
--- a/lib/librte_eal/rte_eal_exports.def
+++ b/lib/librte_eal/rte_eal_exports.def
@@ -1,9 +1,128 @@
 EXPORTS
 	__rte_panic
+	rte_calloc
+	rte_calloc_socket
 	rte_eal_get_configuration
+	rte_eal_has_hugepages
 	rte_eal_init
+	rte_eal_iova_mode
 	rte_eal_mp_remote_launch
 	rte_eal_mp_wait_lcore
+	rte_eal_process_type
 	rte_eal_remote_launch
 	rte_log
+	rte_eal_tailq_lookup
+	rte_eal_tailq_register
+	rte_eal_using_phys_addrs
+	rte_free
+	rte_malloc
+	rte_malloc_dump_stats
+	rte_malloc_get_socket_stats
+	rte_malloc_set_limit
+	rte_malloc_socket
+	rte_malloc_validate
+	rte_malloc_virt2iova
+	rte_mcfg_mem_read_lock
+	rte_mcfg_mem_read_unlock
+	rte_mcfg_mem_write_lock
+	rte_mcfg_mem_write_unlock
+	rte_mcfg_mempool_read_lock
+	rte_mcfg_mempool_read_unlock
+	rte_mcfg_mempool_write_lock
+	rte_mcfg_mempool_write_unlock
+	rte_mcfg_tailq_read_lock
+	rte_mcfg_tailq_read_unlock
+	rte_mcfg_tailq_write_lock
+	rte_mcfg_tailq_write_unlock
+	rte_mem_lock_page
+	rte_mem_virt2iova
+	rte_mem_virt2phy
+	rte_memory_get_nchannel
+	rte_memory_get_nrank
+	rte_memzone_dump
+	rte_memzone_free
+	rte_memzone_lookup
+	rte_memzone_reserve
+	rte_memzone_reserve_aligned
+	rte_memzone_reserve_bounded
+	rte_memzone_walk
 	rte_vlog
+	rte_realloc
+	rte_zmalloc
+	rte_zmalloc_socket
+
+	rte_mp_action_register
+	rte_mp_action_unregister
+	rte_mp_reply
+	rte_mp_sendmsg
+
+	rte_fbarray_attach
+	rte_fbarray_destroy
+	rte_fbarray_detach
+	rte_fbarray_dump_metadata
+	rte_fbarray_find_contig_free
+	rte_fbarray_find_contig_used
+	rte_fbarray_find_idx
+	rte_fbarray_find_next_free
+	rte_fbarray_find_next_n_free
+	rte_fbarray_find_next_n_used
+	rte_fbarray_find_next_used
+	rte_fbarray_get
+	rte_fbarray_init
+	rte_fbarray_is_used
+	rte_fbarray_set_free
+	rte_fbarray_set_used
+	rte_malloc_dump_heaps
+	rte_mem_alloc_validator_register
+	rte_mem_alloc_validator_unregister
+	rte_mem_check_dma_mask
+	rte_mem_event_callback_register
+	rte_mem_event_callback_unregister
+	rte_mem_iova2virt
+	rte_mem_virt2memseg
+	rte_mem_virt2memseg_list
+	rte_memseg_contig_walk
+	rte_memseg_list_walk
+	rte_memseg_walk
+	rte_mp_request_async
+	rte_mp_request_sync
+
+	rte_fbarray_find_prev_free
+	rte_fbarray_find_prev_n_free
+	rte_fbarray_find_prev_n_used
+	rte_fbarray_find_prev_used
+	rte_fbarray_find_rev_contig_free
+	rte_fbarray_find_rev_contig_used
+	rte_memseg_contig_walk_thread_unsafe
+	rte_memseg_list_walk_thread_unsafe
+	rte_memseg_walk_thread_unsafe
+
+	rte_malloc_heap_create
+	rte_malloc_heap_destroy
+	rte_malloc_heap_get_socket
+	rte_malloc_heap_memory_add
+	rte_malloc_heap_memory_attach
+	rte_malloc_heap_memory_detach
+	rte_malloc_heap_memory_remove
+	rte_malloc_heap_socket_is_external
+	rte_mem_check_dma_mask_thread_unsafe
+	rte_mem_set_dma_mask
+	rte_memseg_get_fd
+	rte_memseg_get_fd_offset
+	rte_memseg_get_fd_offset_thread_unsafe
+	rte_memseg_get_fd_thread_unsafe
+
+	rte_extmem_attach
+	rte_extmem_detach
+	rte_extmem_register
+	rte_extmem_unregister
+
+	rte_fbarray_find_biggest_free
+	rte_fbarray_find_biggest_used
+	rte_fbarray_find_rev_biggest_free
+	rte_fbarray_find_rev_biggest_used
+
+	rte_mem_lock
+	rte_mem_map
+	rte_mem_page_size
+	rte_mem_unmap
diff --git a/lib/librte_eal/windows/eal.c b/lib/librte_eal/windows/eal.c
index 759bf4be5..666651dc7 100644
--- a/lib/librte_eal/windows/eal.c
+++ b/lib/librte_eal/windows/eal.c
@@ -94,6 +94,24 @@ eal_proc_type_detect(void)
 	return ptype;
 }
 
+enum rte_proc_type_t
+rte_eal_process_type(void)
+{
+	return rte_config.process_type;
+}
+
+int
+rte_eal_has_hugepages(void)
+{
+	return !internal_config.no_hugetlbfs;
+}
+
+enum rte_iova_mode
+rte_eal_iova_mode(void)
+{
+	return rte_config.iova_mode;
+}
+
 /* display usage */
 static void
 eal_usage(const char *prgname)
@@ -256,7 +274,7 @@ __rte_trace_point_register(rte_trace_point_t *trace, const char *name,
 	return -ENOTSUP;
 }
 
-/* Launch threads, called at application init(). */
+ /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
 {
@@ -282,6 +300,13 @@ rte_eal_init(int argc, char **argv)
 	if (fctret < 0)
 		exit(1);
 
+	/* Prevent creation of shared memory files. */
+	if (internal_config.in_memory == 0) {
+		RTE_LOG(WARNING, EAL, "Multi-process support is requested, "
+			"but not available.\n");
+		internal_config.in_memory = 1;
+	}
+
 	if (!internal_config.no_hugetlbfs && (eal_hugepage_info_init() < 0)) {
 		rte_eal_init_alert("Cannot get hugepage information");
 		rte_errno = EACCES;
@@ -293,6 +318,42 @@ rte_eal_init(int argc, char **argv)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
 	}
 
+	if (eal_mem_win32api_init() < 0) {
+		rte_eal_init_alert("Cannot access Win32 memory management");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (eal_mem_virt2iova_init() < 0) {
+		/* Non-fatal error if physical addresses are not required. */
+		RTE_LOG(WARNING, EAL, "Cannot access virt2phys driver, "
+			"PA will not be available\n");
+	}
+
+	if (rte_eal_memzone_init() < 0) {
+		rte_eal_init_alert("Cannot init memzone");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_memory_init() < 0) {
+		rte_eal_init_alert("Cannot init memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (rte_eal_malloc_heap_init() < 0) {
+		rte_eal_init_alert("Cannot init malloc heap");
+		rte_errno = ENODEV;
+		return -1;
+	}
+
+	if (rte_eal_tailqs_init() < 0) {
+		rte_eal_init_alert("Cannot init tail queues for objects");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
 	eal_thread_init_master(rte_config.master_lcore);
 
 	RTE_LCORE_FOREACH_SLAVE(i) {
diff --git a/lib/librte_eal/windows/eal_file.c b/lib/librte_eal/windows/eal_file.c
new file mode 100644
index 000000000..dfbe8d311
--- /dev/null
+++ b/lib/librte_eal/windows/eal_file.c
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Dmitry Kozlyuk
+ */
+
+#include <fcntl.h>
+#include <io.h>
+#include <share.h>
+#include <sys/stat.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_file_open(const char *path, int flags)
+{
+	static const int MODE_MASK = EAL_OPEN_READONLY | EAL_OPEN_READWRITE;
+
+	int fd, ret, sys_flags;
+
+	switch (flags & MODE_MASK) {
+	case EAL_OPEN_READONLY:
+		sys_flags = _O_RDONLY;
+		break;
+	case EAL_OPEN_READWRITE:
+		sys_flags = _O_RDWR;
+		break;
+	default:
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	if (flags & EAL_OPEN_CREATE)
+		sys_flags |= _O_CREAT;
+
+	ret = _sopen_s(&fd, path, sys_flags, _SH_DENYNO, _S_IWRITE);
+	if (ret < 0) {
+		rte_errno = errno;
+		return -1;
+	}
+
+	return fd;
+}
+
+int
+eal_file_truncate(int fd, ssize_t size)
+{
+	HANDLE handle;
+	DWORD ret;
+	LONG low = (LONG)((size_t)size);
+	LONG high = (LONG)((size_t)size >> 32);
+
+	handle = (HANDLE)_get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	ret = SetFilePointer(handle, low, &high, FILE_BEGIN);
+	if (ret == INVALID_SET_FILE_POINTER) {
+		RTE_LOG_WIN32_ERR("SetFilePointer()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+lock_file(HANDLE handle, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	DWORD sys_flags = 0;
+	OVERLAPPED overlapped;
+
+	if (op == EAL_FLOCK_EXCLUSIVE)
+		sys_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+	if (mode == EAL_FLOCK_RETURN)
+		sys_flags |= LOCKFILE_FAIL_IMMEDIATELY;
+
+	memset(&overlapped, 0, sizeof(overlapped));
+	if (!LockFileEx(handle, sys_flags, 0, 0, 0, &overlapped)) {
+		if ((sys_flags & LOCKFILE_FAIL_IMMEDIATELY) &&
+			(GetLastError() == ERROR_IO_PENDING)) {
+			rte_errno = EWOULDBLOCK;
+		} else {
+			RTE_LOG_WIN32_ERR("LockFileEx()");
+			rte_errno = EINVAL;
+		}
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+unlock_file(HANDLE handle)
+{
+	if (!UnlockFileEx(handle, 0, 0, 0, NULL)) {
+		RTE_LOG_WIN32_ERR("UnlockFileEx()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+int
+eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode)
+{
+	HANDLE handle = (HANDLE)_get_osfhandle(fd);
+
+	if (handle == INVALID_HANDLE_VALUE) {
+		rte_errno = EBADF;
+		return -1;
+	}
+
+	switch (op) {
+	case EAL_FLOCK_EXCLUSIVE:
+	case EAL_FLOCK_SHARED:
+		return lock_file(handle, op, mode);
+	case EAL_FLOCK_UNLOCK:
+		return unlock_file(handle);
+	default:
+		rte_errno = EINVAL;
+		return -1;
+	}
+}
diff --git a/lib/librte_eal/windows/eal_memalloc.c b/lib/librte_eal/windows/eal_memalloc.c
new file mode 100644
index 000000000..a7452b6e1
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memalloc.c
@@ -0,0 +1,441 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <rte_errno.h>
+#include <rte_os.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+int
+eal_memalloc_get_seg_fd(int list_idx, int seg_idx)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset)
+{
+	/* Hugepages have no associated files in Windows. */
+	RTE_SET_USED(list_idx);
+	RTE_SET_USED(seg_idx);
+	RTE_SET_USED(offset);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+static int
+alloc_seg(struct rte_memseg *ms, void *requested_addr, int socket_id,
+	struct hugepage_info *hi)
+{
+	HANDLE current_process;
+	unsigned int numa_node;
+	size_t alloc_sz;
+	void *addr;
+	rte_iova_t iova = RTE_BAD_IOVA;
+	PSAPI_WORKING_SET_EX_INFORMATION info;
+	PSAPI_WORKING_SET_EX_BLOCK *page;
+
+	if (ms->len > 0) {
+		/* If a segment is already allocated as needed, return it. */
+		if ((ms->addr == requested_addr) &&
+			(ms->socket_id == socket_id) &&
+			(ms->hugepage_sz == hi->hugepage_sz)) {
+			return 0;
+		}
+
+		/* Bugcheck, should not happen. */
+		RTE_LOG(DEBUG, EAL, "Attempted to reallocate segment %p "
+			"(size %zu) on socket %d", ms->addr,
+			ms->len, ms->socket_id);
+		return -1;
+	}
+
+	current_process = GetCurrentProcess();
+	numa_node = eal_socket_numa_node(socket_id);
+	alloc_sz = hi->hugepage_sz;
+
+	if (requested_addr == NULL) {
+		/* Request a new chunk of memory from OS. */
+		addr = eal_mem_alloc_socket(alloc_sz, socket_id);
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot allocate %zu bytes "
+				"on socket %d\n", alloc_sz, socket_id);
+			return -1;
+		}
+	} else {
+		/* Requested address is already reserved, commit memory. */
+		addr = eal_mem_commit(requested_addr, alloc_sz, socket_id);
+
+		/* During commitment, memory is temporary freed and might
+		 * be allocated by different non-EAL thread. This is a fatal
+		 * error, because it breaks MSL assumptions.
+		 */
+		if ((addr != NULL) && (addr != requested_addr)) {
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				requested_addr);
+			return -1;
+		}
+
+		if (addr == NULL) {
+			RTE_LOG(DEBUG, EAL, "Cannot commit reserved memory %p "
+				"(size %zu) on socket %d\n",
+				requested_addr, alloc_sz, socket_id);
+			return -1;
+		}
+	}
+
+	/* Force OS to allocate a physical page and select a NUMA node.
+	 * Hugepages are not pageable in Windows, so there's no race
+	 * for physical address.
+	 */
+	*(volatile int *)addr = *(volatile int *)addr;
+
+	/* Only try to obtain IOVA if it's available, so that applications
+	 * that do not need IOVA can use this allocator.
+	 */
+	if (rte_eal_using_phys_addrs()) {
+		iova = rte_mem_virt2iova(addr);
+		if (iova == RTE_BAD_IOVA) {
+			RTE_LOG(DEBUG, EAL,
+				"Cannot get IOVA of allocated segment\n");
+			goto error;
+		}
+	}
+
+	/* Only "Ex" function can handle hugepages. */
+	info.VirtualAddress = addr;
+	if (!QueryWorkingSetEx(current_process, &info, sizeof(info))) {
+		RTE_LOG_WIN32_ERR("QueryWorkingSetEx(%p)", addr);
+		goto error;
+	}
+
+	page = &info.VirtualAttributes;
+	if (!page->Valid || !page->LargePage) {
+		RTE_LOG(DEBUG, EAL, "Got regular page instead of a hugepage\n");
+		goto error;
+	}
+	if (page->Node != numa_node) {
+		RTE_LOG(DEBUG, EAL,
+			"NUMA node hint %u (socket %d) not respected, got %u\n",
+			numa_node, socket_id, page->Node);
+		goto error;
+	}
+
+	ms->addr = addr;
+	ms->hugepage_sz = hi->hugepage_sz;
+	ms->len = alloc_sz;
+	ms->nchannel = rte_memory_get_nchannel();
+	ms->nrank = rte_memory_get_nrank();
+	ms->iova = iova;
+	ms->socket_id = socket_id;
+
+	return 0;
+
+error:
+	/* Only jump here when `addr` and `alloc_sz` are valid. */
+	if (eal_mem_decommit(addr, alloc_sz) && (rte_errno == EADDRNOTAVAIL)) {
+		/* During decommitment, memory is temporarily returned
+		 * to the system and the address may become unavailable.
+		 */
+		RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+			" allocation - MSL is not VA-contiguous!\n", addr);
+	}
+	return -1;
+}
+
+static int
+free_seg(struct rte_memseg *ms)
+{
+	if (eal_mem_decommit(ms->addr, ms->len)) {
+		if (rte_errno == EADDRNOTAVAIL) {
+			/* See alloc_seg() for explanation. */
+			RTE_LOG(CRIT, EAL, "Address %p occupied by an alien "
+				" allocation - MSL is not VA-contiguous!\n",
+				ms->addr);
+		}
+		return -1;
+	}
+
+	/* Must clear the segment, because alloc_seg() inspects it. */
+	memset(ms, 0, sizeof(*ms));
+	return 0;
+}
+
+struct alloc_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg **ms;
+	size_t page_sz;
+	unsigned int segs_allocated;
+	unsigned int n_segs;
+	int socket;
+	bool exact;
+};
+
+static int
+alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct alloc_walk_param *wa = arg;
+	struct rte_memseg_list *cur_msl;
+	size_t page_sz;
+	int cur_idx, start_idx, j;
+	unsigned int msl_idx, need, i;
+
+	if (msl->page_sz != wa->page_sz)
+		return 0;
+	if (msl->socket_id != wa->socket)
+		return 0;
+
+	page_sz = (size_t)msl->page_sz;
+
+	msl_idx = msl - mcfg->memsegs;
+	cur_msl = &mcfg->memsegs[msl_idx];
+
+	need = wa->n_segs;
+
+	/* try finding space in memseg list */
+	if (wa->exact) {
+		/* if we require exact number of pages in a list, find them */
+		cur_idx = rte_fbarray_find_next_n_free(
+			&cur_msl->memseg_arr, 0, need);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+	} else {
+		int cur_len;
+
+		/* we don't require exact number of pages, so we're going to go
+		 * for best-effort allocation. that means finding the biggest
+		 * unused block, and going with that.
+		 */
+		cur_idx = rte_fbarray_find_biggest_free(
+			&cur_msl->memseg_arr, 0);
+		if (cur_idx < 0)
+			return 0;
+		start_idx = cur_idx;
+		/* adjust the size to possibly be smaller than original
+		 * request, but do not allow it to be bigger.
+		 */
+		cur_len = rte_fbarray_find_contig_free(
+			&cur_msl->memseg_arr, cur_idx);
+		need = RTE_MIN(need, (unsigned int)cur_len);
+	}
+
+	for (i = 0; i < need; i++, cur_idx++) {
+		struct rte_memseg *cur;
+		void *map_addr;
+
+		cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
+		map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
+
+		if (alloc_seg(cur, map_addr, wa->socket, wa->hi)) {
+			RTE_LOG(DEBUG, EAL, "attempted to allocate %i segments, "
+				"but only %i were allocated\n", need, i);
+
+			/* if exact number wasn't requested, stop */
+			if (!wa->exact)
+				goto out;
+
+			/* clean up */
+			for (j = start_idx; j < cur_idx; j++) {
+				struct rte_memseg *tmp;
+				struct rte_fbarray *arr = &cur_msl->memseg_arr;
+
+				tmp = rte_fbarray_get(arr, j);
+				rte_fbarray_set_free(arr, j);
+
+				if (free_seg(tmp))
+					RTE_LOG(DEBUG, EAL, "Cannot free page\n");
+			}
+			/* clear the list */
+			if (wa->ms)
+				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			return -1;
+		}
+		if (wa->ms)
+			wa->ms[i] = cur;
+
+		rte_fbarray_set_used(&cur_msl->memseg_arr, cur_idx);
+	}
+
+out:
+	wa->segs_allocated = i;
+	if (i > 0)
+		cur_msl->version++;
+
+	/* if we didn't allocate any segments, move on to the next list */
+	return i > 0;
+}
+
+struct free_walk_param {
+	struct hugepage_info *hi;
+	struct rte_memseg *ms;
+};
+static int
+free_seg_walk(const struct rte_memseg_list *msl, void *arg)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *found_msl;
+	struct free_walk_param *wa = arg;
+	uintptr_t start_addr, end_addr;
+	int msl_idx, seg_idx, ret;
+
+	start_addr = (uintptr_t) msl->base_va;
+	end_addr = start_addr + msl->len;
+
+	if ((uintptr_t)wa->ms->addr < start_addr ||
+		(uintptr_t)wa->ms->addr >= end_addr)
+		return 0;
+
+	msl_idx = msl - mcfg->memsegs;
+	seg_idx = RTE_PTR_DIFF(wa->ms->addr, start_addr) / msl->page_sz;
+
+	/* msl is const */
+	found_msl = &mcfg->memsegs[msl_idx];
+	found_msl->version++;
+
+	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
+
+	ret = free_seg(wa->ms);
+
+	return (ret < 0) ? (-1) : 1;
+}
+
+int
+eal_memalloc_alloc_seg_bulk(struct rte_memseg **ms, int n_segs,
+		size_t page_sz, int socket, bool exact)
+{
+	unsigned int i;
+	int ret = -1;
+	struct alloc_walk_param wa;
+	struct hugepage_info *hi = NULL;
+
+	if (internal_config.legacy_mem) {
+		RTE_LOG(ERR, EAL, "dynamic allocation not supported in legacy mode\n");
+		return -ENOTSUP;
+	}
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (page_sz == hpi->hugepage_sz) {
+			hi = hpi;
+			break;
+		}
+	}
+	if (!hi) {
+		RTE_LOG(ERR, EAL, "cannot find relevant hugepage_info entry\n");
+		return -1;
+	}
+
+	memset(&wa, 0, sizeof(wa));
+	wa.exact = exact;
+	wa.hi = hi;
+	wa.ms = ms;
+	wa.n_segs = n_segs;
+	wa.page_sz = page_sz;
+	wa.socket = socket;
+	wa.segs_allocated = 0;
+
+	/* memalloc is locked, so it's safe to use thread-unsafe version */
+	ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
+	if (ret == 0) {
+		RTE_LOG(ERR, EAL, "cannot find suitable memseg_list\n");
+		ret = -1;
+	} else if (ret > 0) {
+		ret = (int)wa.segs_allocated;
+	}
+
+	return ret;
+}
+
+struct rte_memseg *
+eal_memalloc_alloc_seg(size_t page_sz, int socket)
+{
+	struct rte_memseg *ms = NULL;
+	eal_memalloc_alloc_seg_bulk(&ms, 1, page_sz, socket, true);
+	return ms;
+}
+
+int
+eal_memalloc_free_seg_bulk(struct rte_memseg **ms, int n_segs)
+{
+	int seg, ret = 0;
+
+	/* dynamic free not supported in legacy mode */
+	if (internal_config.legacy_mem)
+		return -1;
+
+	for (seg = 0; seg < n_segs; seg++) {
+		struct rte_memseg *cur = ms[seg];
+		struct hugepage_info *hi = NULL;
+		struct free_walk_param wa;
+		size_t i;
+		int walk_res;
+
+		/* if this page is marked as unfreeable, fail */
+		if (cur->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
+			RTE_LOG(DEBUG, EAL, "Page is not allowed to be freed\n");
+			ret = -1;
+			continue;
+		}
+
+		memset(&wa, 0, sizeof(wa));
+
+		for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
+			hi = &internal_config.hugepage_info[i];
+			if (cur->hugepage_sz == hi->hugepage_sz)
+				break;
+		}
+		if (i == RTE_DIM(internal_config.hugepage_info)) {
+			RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n");
+			ret = -1;
+			continue;
+		}
+
+		wa.ms = cur;
+		wa.hi = hi;
+
+		/* memalloc is locked, so it's safe to use thread-unsafe version
+		 */
+		walk_res = rte_memseg_list_walk_thread_unsafe(free_seg_walk,
+				&wa);
+		if (walk_res == 1)
+			continue;
+		if (walk_res == 0)
+			RTE_LOG(ERR, EAL, "Couldn't find memseg list\n");
+		ret = -1;
+	}
+	return ret;
+}
+
+int
+eal_memalloc_free_seg(struct rte_memseg *ms)
+{
+	return eal_memalloc_free_seg_bulk(&ms, 1);
+}
+
+int
+eal_memalloc_sync_with_primary(void)
+{
+	/* No multi-process support. */
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+eal_memalloc_init(void)
+{
+	/* No action required. */
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_memory.c b/lib/librte_eal/windows/eal_memory.c
new file mode 100644
index 000000000..73be1cf72
--- /dev/null
+++ b/lib/librte_eal/windows/eal_memory.c
@@ -0,0 +1,710 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+#include <inttypes.h>
+#include <io.h>
+
+#include <rte_eal_paging.h>
+#include <rte_errno.h>
+
+#include "eal_internal_cfg.h"
+#include "eal_memalloc.h"
+#include "eal_memcfg.h"
+#include "eal_options.h"
+#include "eal_private.h"
+#include "eal_windows.h"
+
+#include <rte_virt2phys.h>
+
+/* MinGW-w64 headers lack VirtualAlloc2() in some distributions.
+ * Provide a copy of definitions and code to load it dynamically.
+ * Note: definitions are copied verbatim from Microsoft documentation
+ * and don't follow DPDK code style.
+ *
+ * MEM_RESERVE_PLACEHOLDER being defined means VirtualAlloc2() is present too.
+ */
+#ifndef MEM_PRESERVE_PLACEHOLDER
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ne-winnt-mem_extended_parameter_type */
+typedef enum MEM_EXTENDED_PARAMETER_TYPE {
+	MemExtendedParameterInvalidType,
+	MemExtendedParameterAddressRequirements,
+	MemExtendedParameterNumaNode,
+	MemExtendedParameterPartitionHandle,
+	MemExtendedParameterUserPhysicalHandle,
+	MemExtendedParameterAttributeFlags,
+	MemExtendedParameterMax
+} *PMEM_EXTENDED_PARAMETER_TYPE;
+
+#define MEM_EXTENDED_PARAMETER_TYPE_BITS 4
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-mem_extended_parameter */
+typedef struct MEM_EXTENDED_PARAMETER {
+	struct {
+		DWORD64 Type : MEM_EXTENDED_PARAMETER_TYPE_BITS;
+		DWORD64 Reserved : 64 - MEM_EXTENDED_PARAMETER_TYPE_BITS;
+	} DUMMYSTRUCTNAME;
+	union {
+		DWORD64 ULong64;
+		PVOID   Pointer;
+		SIZE_T  Size;
+		HANDLE  Handle;
+		DWORD   ULong;
+	} DUMMYUNIONNAME;
+} MEM_EXTENDED_PARAMETER, *PMEM_EXTENDED_PARAMETER;
+
+/* https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc2 */
+typedef PVOID (*VirtualAlloc2_type)(
+	HANDLE                 Process,
+	PVOID                  BaseAddress,
+	SIZE_T                 Size,
+	ULONG                  AllocationType,
+	ULONG                  PageProtection,
+	MEM_EXTENDED_PARAMETER *ExtendedParameters,
+	ULONG                  ParameterCount
+);
+
+/* VirtualAlloc2() flags. */
+#define MEM_COALESCE_PLACEHOLDERS 0x00000001
+#define MEM_PRESERVE_PLACEHOLDER  0x00000002
+#define MEM_REPLACE_PLACEHOLDER   0x00004000
+#define MEM_RESERVE_PLACEHOLDER   0x00040000
+
+/* Named exactly as the function, so that user code does not depend
+ * on it being found at compile time or dynamically.
+ */
+static VirtualAlloc2_type VirtualAlloc2;
+
+int
+eal_mem_win32api_init(void)
+{
+	/* Contrary to the docs, VirtualAlloc2() is not in kernel32.dll,
+	 * see https://github.com/MicrosoftDocs/feedback/issues/1129.
+	 */
+	static const char library_name[] = "kernelbase.dll";
+	static const char function[] = "VirtualAlloc2";
+
+	HMODULE library = NULL;
+	int ret = 0;
+
+	/* Already done. */
+	if (VirtualAlloc2 != NULL)
+		return 0;
+
+	library = LoadLibraryA(library_name);
+	if (library == NULL) {
+		RTE_LOG_WIN32_ERR("LoadLibraryA(\"%s\")", library_name);
+		return -1;
+	}
+
+	VirtualAlloc2 = (VirtualAlloc2_type)(
+		(void *)GetProcAddress(library, function));
+	if (VirtualAlloc2 == NULL) {
+		RTE_LOG_WIN32_ERR("GetProcAddress(\"%s\", \"%s\")\n",
+			library_name, function);
+
+		/* Contrary to the docs, Server 2016 is not supported. */
+		RTE_LOG(ERR, EAL, "Windows 10 or Windows Server 2019 "
+			" is required for memory management\n");
+		ret = -1;
+	}
+
+	FreeLibrary(library);
+
+	return ret;
+}
+
+#else
+
+/* Stub in case VirtualAlloc2() is provided by the compiler. */
+int
+eal_mem_win32api_init(void)
+{
+	return 0;
+}
+
+#endif /* defined(MEM_RESERVE_PLACEHOLDER) */
+
+static HANDLE virt2phys_device = INVALID_HANDLE_VALUE;
+
+int
+eal_mem_virt2iova_init(void)
+{
+	HDEVINFO list = INVALID_HANDLE_VALUE;
+	SP_DEVICE_INTERFACE_DATA ifdata;
+	SP_DEVICE_INTERFACE_DETAIL_DATA *detail = NULL;
+	DWORD detail_size;
+	int ret = -1;
+
+	list = SetupDiGetClassDevs(
+		&GUID_DEVINTERFACE_VIRT2PHYS, NULL, NULL,
+		DIGCF_DEVICEINTERFACE | DIGCF_PRESENT);
+	if (list == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("SetupDiGetClassDevs()");
+		goto exit;
+	}
+
+	ifdata.cbSize = sizeof(ifdata);
+	if (!SetupDiEnumDeviceInterfaces(
+		list, NULL, &GUID_DEVINTERFACE_VIRT2PHYS, 0, &ifdata)) {
+		RTE_LOG_WIN32_ERR("SetupDiEnumDeviceInterfaces()");
+		goto exit;
+	}
+
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, NULL, 0, &detail_size, NULL)) {
+		if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+			RTE_LOG_WIN32_ERR(
+				"SetupDiGetDeviceInterfaceDetail(probe)");
+			goto exit;
+		}
+	}
+
+	detail = malloc(detail_size);
+	if (detail == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate virt2phys "
+			"device interface detail data\n");
+		goto exit;
+	}
+
+	detail->cbSize = sizeof(*detail);
+	if (!SetupDiGetDeviceInterfaceDetail(
+		list, &ifdata, detail, detail_size, NULL, NULL)) {
+		RTE_LOG_WIN32_ERR("SetupDiGetDeviceInterfaceDetail(read)");
+		goto exit;
+	}
+
+	RTE_LOG(DEBUG, EAL, "Found virt2phys device: %s\n", detail->DevicePath);
+
+	virt2phys_device = CreateFile(
+		detail->DevicePath, 0, 0, NULL, OPEN_EXISTING, 0, NULL);
+	if (virt2phys_device == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFile()");
+		goto exit;
+	}
+
+	/* Indicate success. */
+	ret = 0;
+
+exit:
+	if (detail != NULL)
+		free(detail);
+	if (list != INVALID_HANDLE_VALUE)
+		SetupDiDestroyDeviceInfoList(list);
+
+	return ret;
+}
+
+phys_addr_t
+rte_mem_virt2phy(const void *virt)
+{
+	LARGE_INTEGER phys;
+	DWORD bytes_returned;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_PHYS_ADDR;
+
+	if (!DeviceIoControl(
+			virt2phys_device, IOCTL_VIRT2PHYS_TRANSLATE,
+			&virt, sizeof(virt), &phys, sizeof(phys),
+			&bytes_returned, NULL)) {
+		RTE_LOG_WIN32_ERR("DeviceIoControl(IOCTL_VIRT2PHYS_TRANSLATE)");
+		return RTE_BAD_PHYS_ADDR;
+	}
+
+	return phys.QuadPart;
+}
+
+/* Windows currently only supports IOVA as PA. */
+rte_iova_t
+rte_mem_virt2iova(const void *virt)
+{
+	phys_addr_t phys;
+
+	if (virt2phys_device == INVALID_HANDLE_VALUE)
+		return RTE_BAD_IOVA;
+
+	phys = rte_mem_virt2phy(virt);
+	if (phys == RTE_BAD_PHYS_ADDR)
+		return RTE_BAD_IOVA;
+
+	return (rte_iova_t)phys;
+}
+
+/* Always using physical addresses under Windows if they can be obtained. */
+int
+rte_eal_using_phys_addrs(void)
+{
+	return virt2phys_device != INVALID_HANDLE_VALUE;
+}
+
+/* Approximate error mapping from VirtualAlloc2() to POSIX mmap(3). */
+static void
+set_errno_from_win32_alloc_error(DWORD code)
+{
+	switch (code) {
+	case ERROR_SUCCESS:
+		rte_errno = 0;
+		break;
+
+	case ERROR_INVALID_ADDRESS:
+		/* A valid requested address is not available. */
+	case ERROR_COMMITMENT_LIMIT:
+		/* May occur when committing regular memory. */
+	case ERROR_NO_SYSTEM_RESOURCES:
+		/* Occurs when the system runs out of hugepages. */
+		rte_errno = ENOMEM;
+		break;
+
+	case ERROR_INVALID_PARAMETER:
+	default:
+		rte_errno = EINVAL;
+		break;
+	}
+}
+
+void *
+eal_mem_reserve(void *requested_addr, size_t size, int flags)
+{
+	HANDLE process;
+	void *virt;
+
+	/* Windows requires hugepages to be committed. */
+	if (flags & EAL_RESERVE_HUGEPAGES) {
+		rte_errno = ENOTSUP;
+		return NULL;
+	}
+
+	process = GetCurrentProcess();
+
+	virt = VirtualAlloc2(process, requested_addr, size,
+		MEM_RESERVE | MEM_RESERVE_PLACEHOLDER, PAGE_NOACCESS,
+		NULL, 0);
+	if (virt == NULL) {
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2()");
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((flags & EAL_RESERVE_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!VirtualFreeEx(process, virt, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx()");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	return virt;
+}
+
+void *
+eal_mem_alloc_socket(size_t size, int socket_id)
+{
+	DWORD flags = MEM_RESERVE | MEM_COMMIT;
+	void *addr;
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAllocExNuma(GetCurrentProcess(), NULL, size, flags,
+		PAGE_READWRITE, eal_socket_numa_node(socket_id));
+	if (addr == NULL)
+		rte_errno = ENOMEM;
+	return addr;
+}
+
+void *
+eal_mem_commit(void *requested_addr, size_t size, int socket_id)
+{
+	HANDLE process;
+	MEM_EXTENDED_PARAMETER param;
+	DWORD param_count = 0;
+	DWORD flags;
+	void *addr;
+
+	process = GetCurrentProcess();
+
+	if (requested_addr != NULL) {
+		MEMORY_BASIC_INFORMATION info;
+
+		if (VirtualQueryEx(process, requested_addr, &info,
+				sizeof(info)) != sizeof(info)) {
+			RTE_LOG_WIN32_ERR("VirtualQuery(%p)", requested_addr);
+			return NULL;
+		}
+
+		/* Split reserved region if only a part is committed. */
+		flags = MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER;
+		if ((info.RegionSize > size) && !VirtualFreeEx(
+				process, requested_addr, size, flags)) {
+			RTE_LOG_WIN32_ERR(
+				"VirtualFreeEx(%p, %zu, preserve placeholder)",
+				requested_addr, size);
+			return NULL;
+		}
+
+		/* Temporarily release the region to be committed.
+		 *
+		 * There is an inherent race for this memory range
+		 * if another thread allocates memory via OS API.
+		 * However, VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)
+		 * doesn't work with MEM_LARGE_PAGES on Windows Server.
+		 */
+		if (!VirtualFreeEx(process, requested_addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				requested_addr);
+			return NULL;
+		}
+	}
+
+	if (socket_id != SOCKET_ID_ANY) {
+		param_count = 1;
+		memset(&param, 0, sizeof(param));
+		param.Type = MemExtendedParameterNumaNode;
+		param.ULong = eal_socket_numa_node(socket_id);
+	}
+
+	flags = MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES;
+	addr = VirtualAlloc2(process, requested_addr, size,
+		flags, PAGE_READWRITE, &param, param_count);
+	if (addr == NULL) {
+		/* Logging may overwrite GetLastError() result. */
+		DWORD err = GetLastError();
+		RTE_LOG_WIN32_ERR("VirtualAlloc2(%p, %zu, commit large pages)",
+			requested_addr, size);
+		set_errno_from_win32_alloc_error(err);
+		return NULL;
+	}
+
+	if ((requested_addr != NULL) && (addr != requested_addr)) {
+		/* We lost the race for the requested_addr. */
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", addr);
+
+		rte_errno = EADDRNOTAVAIL;
+		return NULL;
+	}
+
+	return addr;
+}
+
+int
+eal_mem_decommit(void *addr, size_t size)
+{
+	HANDLE process;
+	void *stub;
+	DWORD flags;
+
+	process = GetCurrentProcess();
+
+	/* Hugepages cannot be decommited on Windows,
+	 * so free them and replace the block with a placeholder.
+	 * There is a race for VA in this block until VirtualAlloc2 call.
+	 */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	flags = MEM_RESERVE | MEM_RESERVE_PLACEHOLDER;
+	stub = VirtualAlloc2(
+		process, addr, size, flags, PAGE_NOACCESS, NULL, 0);
+	if (stub == NULL) {
+		/* We lost the race for the VA. */
+		if (!VirtualFreeEx(process, stub, 0, MEM_RELEASE))
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, release)", stub);
+		rte_errno = EADDRNOTAVAIL;
+		return -1;
+	}
+
+	/* No need to join reserved regions adjacent to the freed one:
+	 * eal_mem_commit() will just pick up the page-size placeholder
+	 * created here.
+	 */
+	return 0;
+}
+
+/**
+ * Free a reserved memory region in full or in part.
+ *
+ * @param addr
+ *  Starting address of the area to free.
+ * @param size
+ *  Number of bytes to free. Must be a multiple of page size.
+ * @param reserved
+ *  Fail if the region is not in reserved state.
+ * @return
+ *  * 0 on successful deallocation;
+ *  * 1 if region must be in reserved state but it is not;
+ *  * (-1) on system API failures.
+ */
+static int
+mem_free(void *addr, size_t size, bool reserved)
+{
+	MEMORY_BASIC_INFORMATION info;
+	HANDLE process;
+
+	process = GetCurrentProcess();
+
+	if (VirtualQueryEx(
+			process, addr, &info, sizeof(info)) != sizeof(info)) {
+		RTE_LOG_WIN32_ERR("VirtualQueryEx(%p)", addr);
+		return -1;
+	}
+
+	if (reserved && (info.State != MEM_RESERVE))
+		return 1;
+
+	/* Free complete region. */
+	if ((addr == info.AllocationBase) && (size == info.RegionSize)) {
+		if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+			RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)",
+				addr);
+		}
+		return 0;
+	}
+
+	/* Split the part to be freed and the remaining reservation. */
+	if (!VirtualFreeEx(process, addr, size,
+			MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER)) {
+		RTE_LOG_WIN32_ERR(
+			"VirtualFreeEx(%p, %zu, preserve placeholder)",
+			addr, size);
+		return -1;
+	}
+
+	/* Actually free reservation part. */
+	if (!VirtualFreeEx(process, addr, 0, MEM_RELEASE)) {
+		RTE_LOG_WIN32_ERR("VirtualFreeEx(%p, 0, release)", addr);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+eal_mem_free(void *virt, size_t size)
+{
+	mem_free(virt, size, false);
+}
+
+int
+eal_mem_set_dump(void *virt, size_t size, bool dump)
+{
+	RTE_SET_USED(virt);
+	RTE_SET_USED(size);
+	RTE_SET_USED(dump);
+
+	/* Windows does not dump reserved memory by default.
+	 *
+	 * There is <werapi.h> to include or exclude regions from the dump,
+	 * but this is not currently required by EAL.
+	 */
+
+	rte_errno = ENOTSUP;
+	return -1;
+}
+
+void *
+rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
+	int fd, size_t offset)
+{
+	HANDLE file_handle = INVALID_HANDLE_VALUE;
+	HANDLE mapping_handle = INVALID_HANDLE_VALUE;
+	DWORD sys_prot = 0;
+	DWORD sys_access = 0;
+	DWORD size_high = (DWORD)(size >> 32);
+	DWORD size_low = (DWORD)size;
+	DWORD offset_high = (DWORD)(offset >> 32);
+	DWORD offset_low = (DWORD)offset;
+	LPVOID virt = NULL;
+
+	if (prot & RTE_PROT_EXECUTE) {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_EXECUTE_READ;
+			sys_access = FILE_MAP_READ | FILE_MAP_EXECUTE;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_EXECUTE_READWRITE;
+			sys_access = FILE_MAP_WRITE | FILE_MAP_EXECUTE;
+		}
+	} else {
+		if (prot & RTE_PROT_READ) {
+			sys_prot = PAGE_READONLY;
+			sys_access = FILE_MAP_READ;
+		}
+		if (prot & RTE_PROT_WRITE) {
+			sys_prot = PAGE_READWRITE;
+			sys_access = FILE_MAP_WRITE;
+		}
+	}
+
+	if (flags & RTE_MAP_PRIVATE)
+		sys_access |= FILE_MAP_COPY;
+
+	if ((flags & RTE_MAP_ANONYMOUS) == 0)
+		file_handle = (HANDLE)_get_osfhandle(fd);
+
+	mapping_handle = CreateFileMapping(
+		file_handle, NULL, sys_prot, size_high, size_low, NULL);
+	if (mapping_handle == INVALID_HANDLE_VALUE) {
+		RTE_LOG_WIN32_ERR("CreateFileMapping()");
+		return NULL;
+	}
+
+	/* There is a race for the requested_addr between mem_free()
+	 * and MapViewOfFileEx(). MapViewOfFile3() that can replace a reserved
+	 * region with a mapping in a single operation, but it does not support
+	 * private mappings.
+	 */
+	if (requested_addr != NULL) {
+		int ret = mem_free(requested_addr, size, true);
+		if (ret) {
+			if (ret > 0) {
+				RTE_LOG(ERR, EAL, "Cannot map memory "
+					"to a region not reserved\n");
+				rte_errno = EADDRNOTAVAIL;
+			}
+			return NULL;
+		}
+	}
+
+	virt = MapViewOfFileEx(mapping_handle, sys_access,
+		offset_high, offset_low, size, requested_addr);
+	if (!virt) {
+		RTE_LOG_WIN32_ERR("MapViewOfFileEx()");
+		return NULL;
+	}
+
+	if ((flags & RTE_MAP_FORCE_ADDRESS) && (virt != requested_addr)) {
+		if (!UnmapViewOfFile(virt))
+			RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		virt = NULL;
+	}
+
+	if (!CloseHandle(mapping_handle))
+		RTE_LOG_WIN32_ERR("CloseHandle()");
+
+	return virt;
+}
+
+int
+rte_mem_unmap(void *virt, size_t size)
+{
+	RTE_SET_USED(size);
+
+	if (!UnmapViewOfFile(virt)) {
+		RTE_LOG_WIN32_ERR("UnmapViewOfFile()");
+		rte_errno = EINVAL;
+		return -1;
+	}
+	return 0;
+}
+
+uint64_t
+eal_get_baseaddr(void)
+{
+	/* Windows strategy for memory allocation is undocumented.
+	 * Returning 0 here effectively disables address guessing
+	 * unless user provides an address hint.
+	 */
+	return 0;
+}
+
+size_t
+rte_mem_page_size(void)
+{
+	static SYSTEM_INFO info;
+
+	if (info.dwPageSize == 0)
+		GetSystemInfo(&info);
+
+	return info.dwPageSize;
+}
+
+int
+rte_mem_lock(const void *virt, size_t size)
+{
+	/* VirtualLock() takes `void*`, work around compiler warning. */
+	void *addr = (void *)((uintptr_t)virt);
+
+	if (!VirtualLock(addr, size)) {
+		RTE_LOG_WIN32_ERR("VirtualLock(%p %#zx)", virt, size);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_memseg_init(void)
+{
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		EAL_LOG_NOT_IMPLEMENTED();
+		return -1;
+	}
+
+	return eal_dynmem_memseg_lists_init();
+}
+
+static int
+eal_nohuge_init(void)
+{
+	struct rte_mem_config *mcfg;
+	struct rte_memseg_list *msl;
+	int n_segs;
+	uint64_t mem_sz, page_sz;
+	void *addr;
+
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* nohuge mode is legacy mode */
+	internal_config.legacy_mem = 1;
+
+	msl = &mcfg->memsegs[0];
+
+	mem_sz = internal_config.memory;
+	page_sz = RTE_PGSIZE_4K;
+	n_segs = mem_sz / page_sz;
+
+	if (eal_memseg_list_init_named(
+			msl, "nohugemem", page_sz, n_segs, 0, true)) {
+		return -1;
+	}
+
+	addr = VirtualAlloc(
+		NULL, mem_sz, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+	if (addr == NULL) {
+		RTE_LOG_WIN32_ERR("VirtualAlloc(size=%#zx)", mem_sz);
+		RTE_LOG(ERR, EAL, "Cannot allocate memory\n");
+		return -1;
+	}
+
+	msl->base_va = addr;
+	msl->len = mem_sz;
+
+	eal_memseg_list_populate(msl, addr, n_segs);
+
+	if (mcfg->dma_maskbits &&
+		rte_mem_check_dma_mask_thread_unsafe(mcfg->dma_maskbits)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): couldn't allocate memory due to IOVA "
+			"exceeding limits of current DMA mask.\n", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+rte_eal_hugepage_init(void)
+{
+	return internal_config.no_hugetlbfs ?
+		eal_nohuge_init() : eal_dynmem_hugepage_init();
+}
+
+int
+rte_eal_hugepage_attach(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
diff --git a/lib/librte_eal/windows/eal_mp.c b/lib/librte_eal/windows/eal_mp.c
new file mode 100644
index 000000000..16a5e8ba0
--- /dev/null
+++ b/lib/librte_eal/windows/eal_mp.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file Multiprocess support stubs
+ *
+ * Stubs must log an error until implemented. If success is required
+ * for non-multiprocess operation, stub must log a warning and a comment
+ * must document what requires success emulation.
+ */
+
+#include <rte_eal.h>
+#include <rte_errno.h>
+
+#include "eal_private.h"
+#include "eal_windows.h"
+#include "malloc_mp.h"
+
+void
+rte_mp_channel_cleanup(void)
+{
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_action_register(const char *name, rte_mp_t action)
+{
+	RTE_SET_USED(name);
+	RTE_SET_USED(action);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+void
+rte_mp_action_unregister(const char *name)
+{
+	RTE_SET_USED(name);
+	EAL_LOG_NOT_IMPLEMENTED();
+}
+
+int
+rte_mp_sendmsg(struct rte_mp_msg *msg)
+{
+	RTE_SET_USED(msg);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_sync(struct rte_mp_msg *req, struct rte_mp_reply *reply,
+	const struct timespec *ts)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(reply);
+	RTE_SET_USED(ts);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
+{
+	RTE_SET_USED(req);
+	RTE_SET_USED(ts);
+	RTE_SET_USED(clb);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
+	RTE_SET_USED(msg);
+	RTE_SET_USED(peer);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+register_mp_requests(void)
+{
+	/* Non-stub function succeeds if multi-process is not supported. */
+	EAL_LOG_STUB();
+	return 0;
+}
+
+int
+request_to_primary(struct malloc_mp_req *req)
+{
+	RTE_SET_USED(req);
+	EAL_LOG_NOT_IMPLEMENTED();
+	return -1;
+}
+
+int
+request_sync(void)
+{
+	/* Common memory allocator depends on this function success. */
+	EAL_LOG_STUB();
+	return 0;
+}
diff --git a/lib/librte_eal/windows/eal_windows.h b/lib/librte_eal/windows/eal_windows.h
index f3ed8c37f..d48ee0a12 100644
--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>
 
+/**
+ * Log current function as not implemented and set rte_errno.
+ */
+#define EAL_LOG_NOT_IMPLEMENTED() \
+	do { \
+		RTE_LOG(DEBUG, EAL, "%s() is not implemented\n", __func__); \
+		rte_errno = ENOTSUP; \
+	} while (0)
+
+/**
+ * Log current function as a stub.
+ */
+#define EAL_LOG_STUB() \
+	RTE_LOG(DEBUG, EAL, "Windows: %s() is a stub\n", __func__)
+
 /**
  * Create a map of processors and cores on the system.
  *
@@ -39,4 +55,63 @@ int eal_thread_create(pthread_t *thread);
  */
 unsigned int eal_socket_numa_node(unsigned int socket_id);
 
+/**
+ * Open virt2phys driver interface device.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_virt2iova_init(void);
+
+/**
+ * Locate Win32 memory management routines in system libraries.
+ *
+ * @return 0 on success, (-1) on failure.
+ */
+int eal_mem_win32api_init(void);
+
+/**
+ * Allocate new memory in hugepages on the specified NUMA node.
+ *
+ * @param size
+ *  Number of bytes to allocate. Must be a multiple of huge page size.
+ * @param socket_id
+ *  Socket ID.
+ * @return
+ *  Address of the memory allocated on success or NULL on failure.
+ */
+void *eal_mem_alloc_socket(size_t size, int socket_id);
+
+/**
+ * Commit memory previously reserved with eal_mem_reserve()
+ * or decommitted from hugepages by eal_mem_decommit().
+ *
+ * @param requested_addr
+ *  Address within a reserved region. Must not be NULL.
+ * @param size
+ *  Number of bytes to commit. Must be a multiple of page size.
+ * @param socket_id
+ *  Socket ID to allocate on. Can be SOCKET_ID_ANY.
+ * @return
+ *  On success, address of the committed memory, that is, requested_addr.
+ *  On failure, NULL and rte_errno is set.
+ */
+void *eal_mem_commit(void *requested_addr, size_t size, int socket_id);
+
+/**
+ * Put allocated or committed memory back into reserved state.
+ *
+ * @param addr
+ *  Address of the region to decommit.
+ * @param size
+ *  Number of bytes to decommit, must be the size of a page
+ *  (hugepage or regular one).
+ *
+ * The *addr* and *size* must match location and size
+ * of a previously allocated or committed region.
+ *
+ * @return
+ *  0 on success, (-1) on failure.
+ */
+int eal_mem_decommit(void *addr, size_t size);
+
 #endif /* _EAL_WINDOWS_H_ */
diff --git a/lib/librte_eal/windows/include/meson.build b/lib/librte_eal/windows/include/meson.build
index 5fb1962ac..b3534b025 100644
--- a/lib/librte_eal/windows/include/meson.build
+++ b/lib/librte_eal/windows/include/meson.build
@@ -5,5 +5,6 @@ includes += include_directories('.')
 
 headers += files(
         'rte_os.h',
+        'rte_virt2phys.h',
         'rte_windows.h',
 )
diff --git a/lib/librte_eal/windows/include/rte_os.h b/lib/librte_eal/windows/include/rte_os.h
index 510e39e03..cb10d6494 100644
--- a/lib/librte_eal/windows/include/rte_os.h
+++ b/lib/librte_eal/windows/include/rte_os.h
@@ -14,6 +14,7 @@
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -36,6 +37,9 @@ extern "C" {
 
 #define strncasecmp(s1, s2, count)        _strnicmp(s1, s2, count)
 
+#define close _close
+#define unlink _unlink
+
 /* cpu_set macros implementation */
 #define RTE_CPU_AND(dst, src1, src2) CPU_AND(dst, src1, src2)
 #define RTE_CPU_OR(dst, src1, src2) CPU_OR(dst, src1, src2)
@@ -46,6 +50,7 @@ extern "C" {
 typedef long long ssize_t;
 
 #ifndef RTE_TOOLCHAIN_GCC
+
 static inline int
 asprintf(char **buffer, const char *format, ...)
 {
@@ -72,6 +77,18 @@ asprintf(char **buffer, const char *format, ...)
 	}
 	return ret;
 }
+
+static inline const char *
+eal_strerror(int code)
+{
+	static char buffer[128];
+
+	strerror_s(buffer, sizeof(buffer), code);
+	return buffer;
+}
+
+#define strerror eal_strerror
+
 #endif /* RTE_TOOLCHAIN_GCC */
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/windows/include/rte_virt2phys.h b/lib/librte_eal/windows/include/rte_virt2phys.h
new file mode 100644
index 000000000..4bb2b4aaf
--- /dev/null
+++ b/lib/librte_eal/windows/include/rte_virt2phys.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2020 Dmitry Kozlyuk
+ */
+
+/**
+ * @file virt2phys driver interface
+ */
+
+/**
+ * Driver device interface GUID {539c2135-793a-4926-afec-d3a1b61bbc8a}.
+ */
+DEFINE_GUID(GUID_DEVINTERFACE_VIRT2PHYS,
+	0x539c2135, 0x793a, 0x4926,
+	0xaf, 0xec, 0xd3, 0xa1, 0xb6, 0x1b, 0xbc, 0x8a);
+
+/**
+ * Driver device type for IO control codes.
+ */
+#define VIRT2PHYS_DEVTYPE 0x8000
+
+/**
+ * Translate a valid non-paged virtual address to a physical address.
+ *
+ * Note: A physical address zero (0) is reported if input address
+ * is paged out or not mapped. However, if input is a valid mapping
+ * of I/O port 0x0000, output is also zero. There is no way
+ * to distinguish between these cases by return value only.
+ *
+ * Input: a non-paged virtual address (PVOID).
+ *
+ * Output: the corresponding physical address (LARGE_INTEGER).
+ */
+#define IOCTL_VIRT2PHYS_TRANSLATE CTL_CODE( \
+	VIRT2PHYS_DEVTYPE, 0x800, METHOD_BUFFERED, FILE_ANY_ACCESS)
diff --git a/lib/librte_eal/windows/include/rte_windows.h b/lib/librte_eal/windows/include/rte_windows.h
index ed6e4c148..899ed7d87 100644
--- a/lib/librte_eal/windows/include/rte_windows.h
+++ b/lib/librte_eal/windows/include/rte_windows.h
@@ -23,6 +23,8 @@
 
 #include <basetsd.h>
 #include <psapi.h>
+#include <setupapi.h>
+#include <winioctl.h>
 
 /* Have GUIDs defined. */
 #ifndef INITGUID
diff --git a/lib/librte_eal/windows/include/unistd.h b/lib/librte_eal/windows/include/unistd.h
index 757b7f3c5..6b33005b2 100644
--- a/lib/librte_eal/windows/include/unistd.h
+++ b/lib/librte_eal/windows/include/unistd.h
@@ -9,4 +9,7 @@
  * as Microsoft libc does not contain unistd.h. This may be removed
  * in future releases.
  */
+
+#include <io.h>
+
 #endif /* _UNISTD_H_ */
diff --git a/lib/librte_eal/windows/meson.build b/lib/librte_eal/windows/meson.build
index 52978e9d7..ded5a2b80 100644
--- a/lib/librte_eal/windows/meson.build
+++ b/lib/librte_eal/windows/meson.build
@@ -6,10 +6,16 @@ subdir('include')
 sources += files(
 	'eal.c',
 	'eal_debug.c',
+	'eal_file.c',
 	'eal_hugepages.c',
 	'eal_lcore.c',
 	'eal_log.c',
+	'eal_memalloc.c',
+	'eal_memory.c',
+	'eal_mp.c',
 	'eal_thread.c',
 	'fnmatch.c',
 	'getopt.c',
 )
+
+dpdk_conf.set10('RTE_EAL_NUMA_AWARE_HUGEPAGES', true)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers Dmitry Kozlyuk
@ 2020-06-15  6:03                 ` Kinsella, Ray
  2020-06-15  7:41                   ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Kinsella, Ray @ 2020-06-15  6:03 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev
  Cc: Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Neil Horman


On 15/06/2020 01:43, Dmitry Kozlyuk wrote:
> Introduce OS-independent wrappers for memory management operations used
> across DPDK and specifically in common code of EAL:
>
> * rte_mem_map()
> * rte_mem_unmap()
> * rte_mem_page_size()
> * rte_mem_lock()
>
> Windows uses different APIs for memory mapping and reservation, while
> Unices reserve memory by mapping it. Introduce EAL private functions to
> support memory reservation in common code:
>
> * eal_mem_reserve()
> * eal_mem_free()
> * eal_mem_set_dump()
>
> Wrappers follow POSIX semantics limited to DPDK tasks, but their
> signatures deliberately differ from POSIX ones to be more safe and
> expressive. New symbols are internal. Being thin wrappers, they require
> no special maintenance.
>
> Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> ---
>
> Not adding rte_eal_paging.h to Doxygen index because, to my
> understanding, it only contains public API, and it was decided to keep
> rte_eal_paging.h functions private.
>
>  lib/librte_eal/common/eal_common_fbarray.c |  40 +++---
>  lib/librte_eal/common/eal_common_memory.c  |  61 ++++-----
>  lib/librte_eal/common/eal_private.h        |  78 ++++++++++-
>  lib/librte_eal/freebsd/Makefile            |   1 +
>  lib/librte_eal/include/rte_eal_paging.h    |  98 +++++++++++++
>  lib/librte_eal/linux/Makefile              |   1 +
>  lib/librte_eal/linux/eal_memalloc.c        |   5 +-
>  lib/librte_eal/rte_eal_version.map         |   9 ++
>  lib/librte_eal/unix/eal_unix_memory.c      | 152 +++++++++++++++++++++
>  lib/librte_eal/unix/meson.build            |   1 +
>  10 files changed, 381 insertions(+), 65 deletions(-)
>  create mode 100644 lib/librte_eal/include/rte_eal_paging.h
>  create mode 100644 lib/librte_eal/unix/eal_unix_memory.c
>
> diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c
> index c52ddb967..fd0292a64 100644
> --- a/lib/librte_eal/common/eal_common_fbarray.c
> +++ b/lib/librte_eal/common/eal_common_fbarray.c
> @@ -5,15 +5,16 @@
>  #include <fcntl.h>
>  #include <inttypes.h>
>  #include <limits.h>
> -#include <sys/mman.h>
>  #include <stdint.h>
>  #include <errno.h>
>  #include <string.h>
>  #include <unistd.h>
>  
>  #include <rte_common.h>
> -#include <rte_log.h>
> +#include <rte_eal_paging.h>
>  #include <rte_errno.h>
> +#include <rte_log.h>
> +#include <rte_memory.h>
>  #include <rte_spinlock.h>
>  #include <rte_tailq.h>
>  
> @@ -90,12 +91,9 @@ resize_and_map(int fd, void *addr, size_t len)
>  		return -1;
>  	}
>  
> -	map_addr = mmap(addr, len, PROT_READ | PROT_WRITE,
> -			MAP_SHARED | MAP_FIXED, fd, 0);
> +	map_addr = rte_mem_map(addr, len, RTE_PROT_READ | RTE_PROT_WRITE,
> +			RTE_MAP_SHARED | RTE_MAP_FORCE_ADDRESS, fd, 0);
>  	if (map_addr != addr) {
> -		RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno));
> -		/* pass errno up the chain */
> -		rte_errno = errno;
>  		return -1;
>  	}
>  	return 0;
> @@ -733,7 +731,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  		return -1;
>  	}
>  
> -	page_sz = sysconf(_SC_PAGESIZE);
> +	page_sz = rte_mem_page_size();
>  	if (page_sz == (size_t)-1) {
>  		free(ma);
>  		return -1;
> @@ -754,11 +752,13 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  
>  	if (internal_config.no_shconf) {
>  		/* remap virtual area as writable */
> -		void *new_data = mmap(data, mmap_len, PROT_READ | PROT_WRITE,
> -				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
> -		if (new_data == MAP_FAILED) {
> +		static const int flags = RTE_MAP_FORCE_ADDRESS |
> +			RTE_MAP_PRIVATE | RTE_MAP_ANONYMOUS;
> +		void *new_data = rte_mem_map(data, mmap_len,
> +			RTE_PROT_READ | RTE_PROT_WRITE, flags, fd, 0);
> +		if (new_data == NULL) {
>  			RTE_LOG(DEBUG, EAL, "%s(): couldn't remap anonymous memory: %s\n",
> -					__func__, strerror(errno));
> +					__func__, rte_strerror(rte_errno));
>  			goto fail;
>  		}
>  	} else {
> @@ -820,7 +820,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
>  	return 0;
>  fail:
>  	if (data)
> -		munmap(data, mmap_len);
> +		rte_mem_unmap(data, mmap_len);
>  	if (fd >= 0)
>  		close(fd);
>  	free(ma);
> @@ -858,7 +858,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  		return -1;
>  	}
>  
> -	page_sz = sysconf(_SC_PAGESIZE);
> +	page_sz = rte_mem_page_size();
>  	if (page_sz == (size_t)-1) {
>  		free(ma);
>  		return -1;
> @@ -909,7 +909,7 @@ rte_fbarray_attach(struct rte_fbarray *arr)
>  	return 0;
>  fail:
>  	if (data)
> -		munmap(data, mmap_len);
> +		rte_mem_unmap(data, mmap_len);
>  	if (fd >= 0)
>  		close(fd);
>  	free(ma);
> @@ -937,8 +937,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
>  	 * really do anything about it, things will blow up either way.
>  	 */
>  
> -	size_t page_sz = sysconf(_SC_PAGESIZE);
> -
> +	size_t page_sz = rte_mem_page_size();
>  	if (page_sz == (size_t)-1)
>  		return -1;
>  
> @@ -957,7 +956,7 @@ rte_fbarray_detach(struct rte_fbarray *arr)
>  		goto out;
>  	}
>  
> -	munmap(arr->data, mmap_len);
> +	rte_mem_unmap(arr->data, mmap_len);
>  
>  	/* area is unmapped, close fd and remove the tailq entry */
>  	if (tmp->fd >= 0)
> @@ -992,8 +991,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  	 * really do anything about it, things will blow up either way.
>  	 */
>  
> -	size_t page_sz = sysconf(_SC_PAGESIZE);
> -
> +	size_t page_sz = rte_mem_page_size();
>  	if (page_sz == (size_t)-1)
>  		return -1;
>  
> @@ -1042,7 +1040,7 @@ rte_fbarray_destroy(struct rte_fbarray *arr)
>  		}
>  		close(fd);
>  	}
> -	munmap(arr->data, mmap_len);
> +	rte_mem_unmap(arr->data, mmap_len);
>  
>  	/* area is unmapped, remove the tailq entry */
>  	TAILQ_REMOVE(&mem_area_tailq, tmp, next);
> diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
> index 4c897a13f..aa377990f 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> @@ -11,13 +11,13 @@
>  #include <string.h>
>  #include <unistd.h>
>  #include <inttypes.h>
> -#include <sys/mman.h>
>  #include <sys/queue.h>
>  
>  #include <rte_fbarray.h>
>  #include <rte_memory.h>
>  #include <rte_eal.h>
>  #include <rte_eal_memconfig.h>
> +#include <rte_eal_paging.h>
>  #include <rte_errno.h>
>  #include <rte_log.h>
>  
> @@ -40,18 +40,10 @@
>  static void *next_baseaddr;
>  static uint64_t system_page_sz;
>  
> -#ifdef RTE_EXEC_ENV_LINUX
> -#define RTE_DONTDUMP MADV_DONTDUMP
> -#elif defined RTE_EXEC_ENV_FREEBSD
> -#define RTE_DONTDUMP MADV_NOCORE
> -#else
> -#error "madvise doesn't support this OS"
> -#endif
> -
>  #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
>  void *
>  eal_get_virtual_area(void *requested_addr, size_t *size,
> -		size_t page_sz, int flags, int mmap_flags)
> +	size_t page_sz, int flags, int reserve_flags)
>  {
>  	bool addr_is_hint, allow_shrink, unmap, no_align;
>  	uint64_t map_sz;
> @@ -59,9 +51,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  	uint8_t try = 0;
>  
>  	if (system_page_sz == 0)
> -		system_page_sz = sysconf(_SC_PAGESIZE);
> -
> -	mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
> +		system_page_sz = rte_mem_page_size();
>  
>  	RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
>  
> @@ -105,24 +95,24 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  			return NULL;
>  		}
>  
> -		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_NONE,
> -				mmap_flags, -1, 0);
> -		if (mapped_addr == MAP_FAILED && allow_shrink)
> +		mapped_addr = eal_mem_reserve(
> +			requested_addr, (size_t)map_sz, reserve_flags);
> +		if ((mapped_addr == NULL) && allow_shrink)
>  			*size -= page_sz;
>  
> -		if (mapped_addr != MAP_FAILED && addr_is_hint &&
> -		    mapped_addr != requested_addr) {
> +		if ((mapped_addr != NULL) && addr_is_hint &&
> +				(mapped_addr != requested_addr)) {
>  			try++;
>  			next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
>  			if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
>  				/* hint was not used. Try with another offset */
> -				munmap(mapped_addr, map_sz);
> -				mapped_addr = MAP_FAILED;
> +				eal_mem_free(mapped_addr, map_sz);
> +				mapped_addr = NULL;
>  				requested_addr = next_baseaddr;
>  			}
>  		}
>  	} while ((allow_shrink || addr_is_hint) &&
> -		 mapped_addr == MAP_FAILED && *size > 0);
> +		(mapped_addr == NULL) && (*size > 0));
>  
>  	/* align resulting address - if map failed, we will ignore the value
>  	 * anyway, so no need to add additional checks.
> @@ -132,20 +122,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  
>  	if (*size == 0) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
> -			strerror(errno));
> -		rte_errno = errno;
> +			rte_strerror(rte_errno));
>  		return NULL;
> -	} else if (mapped_addr == MAP_FAILED) {
> +	} else if (mapped_addr == NULL) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
> -			strerror(errno));
> -		/* pass errno up the call chain */
> -		rte_errno = errno;
> +			rte_strerror(rte_errno));
>  		return NULL;
>  	} else if (requested_addr != NULL && !addr_is_hint &&
>  			aligned_addr != requested_addr) {
>  		RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
>  			requested_addr, aligned_addr);
> -		munmap(mapped_addr, map_sz);
> +		eal_mem_free(mapped_addr, map_sz);
>  		rte_errno = EADDRNOTAVAIL;
>  		return NULL;
>  	} else if (requested_addr != NULL && addr_is_hint &&
> @@ -161,7 +148,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  		aligned_addr, *size);
>  
>  	if (unmap) {
> -		munmap(mapped_addr, map_sz);
> +		eal_mem_free(mapped_addr, map_sz);
>  	} else if (!no_align) {
>  		void *map_end, *aligned_end;
>  		size_t before_len, after_len;
> @@ -179,19 +166,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>  		/* unmap space before aligned mmap address */
>  		before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
>  		if (before_len > 0)
> -			munmap(mapped_addr, before_len);
> +			eal_mem_free(mapped_addr, before_len);
>  
>  		/* unmap space after aligned end mmap address */
>  		after_len = RTE_PTR_DIFF(map_end, aligned_end);
>  		if (after_len > 0)
> -			munmap(aligned_end, after_len);
> +			eal_mem_free(aligned_end, after_len);
>  	}
>  
>  	if (!unmap) {
>  		/* Exclude these pages from a core dump. */
> -		if (madvise(aligned_addr, *size, RTE_DONTDUMP) != 0)
> -			RTE_LOG(DEBUG, EAL, "madvise failed: %s\n",
> -				strerror(errno));
> +		eal_mem_set_dump(aligned_addr, *size, false);
>  	}
>  
>  	return aligned_addr;
> @@ -547,10 +532,10 @@ rte_eal_memdevice_init(void)
>  int
>  rte_mem_lock_page(const void *virt)
>  {
> -	unsigned long virtual = (unsigned long)virt;
> -	int page_size = getpagesize();
> -	unsigned long aligned = (virtual & ~(page_size - 1));
> -	return mlock((void *)aligned, page_size);
> +	uintptr_t virtual = (uintptr_t)virt;
> +	size_t page_size = rte_mem_page_size();
> +	uintptr_t aligned = RTE_PTR_ALIGN_FLOOR(virtual, page_size);
> +	return rte_mem_lock((void *)aligned, page_size);
>  }
>  
>  int
> diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
> index 6733a2321..1696345c2 100644
> --- a/lib/librte_eal/common/eal_private.h
> +++ b/lib/librte_eal/common/eal_private.h
> @@ -11,6 +11,7 @@
>  
>  #include <rte_dev.h>
>  #include <rte_lcore.h>
> +#include <rte_memory.h>
>  
>  /**
>   * Structure storing internal configuration (per-lcore)
> @@ -202,6 +203,24 @@ int rte_eal_alarm_init(void);
>   */
>  int rte_eal_check_module(const char *module_name);
>  
> +/**
> + * Memory reservation flags.
> + */
> +enum eal_mem_reserve_flags {
> +	/**
> +	 * Reserve hugepages. May be unsupported by some platforms.
> +	 */
> +	EAL_RESERVE_HUGEPAGES = 1 << 0,
> +	/**
> +	 * Force reserving memory at the requested address.
> +	 * This can be a destructive action depending on the implementation.
> +	 *
> +	 * @see RTE_MAP_FORCE_ADDRESS for description of possible consequences
> +	 *      (although implementations are not required to use it).
> +	 */
> +	EAL_RESERVE_FORCE_ADDRESS = 1 << 1
> +};
> +
>  /**
>   * Get virtual area of specified size from the OS.
>   *
> @@ -215,8 +234,8 @@ int rte_eal_check_module(const char *module_name);
>   *   Page size on which to align requested virtual area.
>   * @param flags
>   *   EAL_VIRTUAL_AREA_* flags.
> - * @param mmap_flags
> - *   Extra flags passed directly to mmap().
> + * @param reserve_flags
> + *   Extra flags passed directly to eal_mem_reserve().
>   *
>   * @return
>   *   Virtual area address if successful.
> @@ -233,7 +252,7 @@ int rte_eal_check_module(const char *module_name);
>  /**< immediately unmap reserved virtual area. */
>  void *
>  eal_get_virtual_area(void *requested_addr, size_t *size,
> -		size_t page_sz, int flags, int mmap_flags);
> +		size_t page_sz, int flags, int reserve_flags);
>  
>  /**
>   * Get cpu core_id.
> @@ -493,4 +512,57 @@ eal_file_lock(int fd, enum eal_flock_op op, enum eal_flock_mode mode);
>  int
>  eal_file_truncate(int fd, ssize_t size);
>  
> +/**
> + * Reserve a region of virtual memory.
> + *
> + * Use eal_mem_free() to free reserved memory.
> + *
> + * @param requested_addr
> + *  A desired reservation address which must be page-aligned.
> + *  The system might not respect it.
> + *  NULL means the address will be chosen by the system.
> + * @param size
> + *  Reservation size. Must be a multiple of system page size.
> + * @param flags
> + *  Reservation options, a combination of eal_mem_reserve_flags.
> + * @returns
> + *  Starting address of the reserved area on success, NULL on failure.
> + *  Callers must not access this memory until remapping it.
> + */
> +void *
> +eal_mem_reserve(void *requested_addr, size_t size, int flags);
> +
> +/**
> + * Free memory obtained by eal_mem_reserve() or eal_mem_alloc().
> + *
> + * If *virt* and *size* describe a part of the reserved region,
> + * only this part of the region is freed (accurately up to the system
> + * page size). If *virt* points to allocated memory, *size* must match
> + * the one specified on allocation. The behavior is undefined
> + * if the memory pointed by *virt* is obtained from another source
> + * than listed above.
> + *
> + * @param virt
> + *  A virtual address in a region previously reserved.
> + * @param size
> + *  Number of bytes to unreserve.
> + */
> +void
> +eal_mem_free(void *virt, size_t size);
> +
> +/**
> + * Configure memory region inclusion into dumps.
> + *
> + * @param virt
> + *  Starting address of the region.
> + * @param size
> + *  Size of the region.
> + * @param dump
> + *  True to include memory into dumps, false to exclude.
> + * @return
> + *  0 on success, (-1) on failure and rte_errno is set.
> + */
> +int
> +eal_mem_set_dump(void *virt, size_t size, bool dump);
> +
>  #endif /* _EAL_PRIVATE_H_ */
> diff --git a/lib/librte_eal/freebsd/Makefile b/lib/librte_eal/freebsd/Makefile
> index 0f8741d96..2374ba0b7 100644
> --- a/lib/librte_eal/freebsd/Makefile
> +++ b/lib/librte_eal/freebsd/Makefile
> @@ -77,6 +77,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_reciprocal.c
>  
>  # from unix dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_file.c
> +SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += eal_unix_memory.c
>  
>  # from arch dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_FREEBSD) += rte_cpuflags.c
> diff --git a/lib/librte_eal/include/rte_eal_paging.h b/lib/librte_eal/include/rte_eal_paging.h
> new file mode 100644
> index 000000000..ed98e70e9
> --- /dev/null
> +++ b/lib/librte_eal/include/rte_eal_paging.h
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_compat.h>
> +
> +/**
> + * @file
> + * @internal
> + *
> + * Wrappers for OS facilities related to memory paging, used across DPDK.
> + */
> +
> +/** Memory protection flags. */
> +enum rte_mem_prot {
> +	RTE_PROT_READ = 1 << 0,   /**< Read access. */
> +	RTE_PROT_WRITE = 1 << 1,  /**< Write access. */
> +	RTE_PROT_EXECUTE = 1 << 2 /**< Code execution. */
> +};
> +
> +/** Additional flags for memory mapping. */
> +enum rte_map_flags {
> +	/** Changes to the mapped memory are visible to other processes. */
> +	RTE_MAP_SHARED = 1 << 0,
> +	/** Mapping is not backed by a regular file. */
> +	RTE_MAP_ANONYMOUS = 1 << 1,
> +	/** Copy-on-write mapping, changes are invisible to other processes. */
> +	RTE_MAP_PRIVATE = 1 << 2,
> +	/**
> +	 * Force mapping to the requested address. This flag should be used
> +	 * with caution, because to fulfill the request implementation
> +	 * may remove all other mappings in the requested region. However,
> +	 * it is not required to do so, thus mapping with this flag may fail.
> +	 */
> +	RTE_MAP_FORCE_ADDRESS = 1 << 3
> +};
> +
> +/**
> + * Map a portion of an opened file or the page file into memory.
> + *
> + * This function is similar to POSIX mmap(3) with common MAP_ANONYMOUS
> + * extension, except for the return value.
> + *
> + * @param requested_addr
> + *  Desired virtual address for mapping. Can be NULL to let OS choose.
> + * @param size
> + *  Size of the mapping in bytes.
> + * @param prot
> + *  Protection flags, a combination of rte_mem_prot values.
> + * @param flags
> + *  Additional mapping flags, a combination of rte_map_flags.
> + * @param fd
> + *  Mapped file descriptor. Can be negative for anonymous mapping.
> + * @param offset
> + *  Offset of the mapped region in fd. Must be 0 for anonymous mappings.
> + * @return
> + *  Mapped address or NULL on failure and rte_errno is set to OS error.
> + */
> +__rte_internal
> +void *
> +rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset);
> +
> +/**
> + * OS-independent implementation of POSIX munmap(3).
> + */
> +__rte_internal
> +int
> +rte_mem_unmap(void *virt, size_t size);
> +
> +/**
> + * Get system page size. This function never fails.
> + *
> + * @return
> + *   Page size in bytes.
> + */
> +__rte_internal
> +size_t
> +rte_mem_page_size(void);
> +
> +/**
> + * Lock in physical memory all pages crossed by the address region.
> + *
> + * @param virt
> + *   Base virtual address of the region.
> + * @param size
> + *   Size of the region.
> + * @return
> + *   0 on success, negative on error.
> + *
> + * @see rte_mem_page_size() to retrieve the page size.
> + * @see rte_mem_lock_page() to lock an entire single page.
> + */
> +__rte_internal
> +int
> +rte_mem_lock(const void *virt, size_t size);
> diff --git a/lib/librte_eal/linux/Makefile b/lib/librte_eal/linux/Makefile
> index 331489f99..8febf2212 100644
> --- a/lib/librte_eal/linux/Makefile
> +++ b/lib/librte_eal/linux/Makefile
> @@ -84,6 +84,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_reciprocal.c
>  
>  # from unix dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_file.c
> +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += eal_unix_memory.c
>  
>  # from arch dir
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUX) += rte_cpuflags.c
> diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
> index 2c717f8bd..bf29b83c6 100644
> --- a/lib/librte_eal/linux/eal_memalloc.c
> +++ b/lib/librte_eal/linux/eal_memalloc.c
> @@ -630,7 +630,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
>  mapped:
>  	munmap(addr, alloc_sz);
>  unmapped:
> -	flags = MAP_FIXED;
> +	flags = EAL_RESERVE_FORCE_ADDRESS;
>  	new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0, flags);
>  	if (new_addr != addr) {
>  		if (new_addr != NULL)
> @@ -687,8 +687,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
>  		return -1;
>  	}
>  
> -	if (madvise(ms->addr, ms->len, MADV_DONTDUMP) != 0)
> -		RTE_LOG(DEBUG, EAL, "madvise failed: %s\n", strerror(errno));
> +	eal_mem_set_dump(ms->addr, ms->len, false);
>  
>  	exit_early = false;
>  
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index d8038749a..196eef5af 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -387,3 +387,12 @@ EXPERIMENTAL {
>  	rte_trace_regexp;
>  	rte_trace_save;
>  };
> +
> +INTERNAL {
> +	global:
> +
> +	rte_mem_lock;
> +	rte_mem_map;
> +	rte_mem_page_size;
> +	rte_mem_unmap;
> +};

Don't

* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()

Belong in the map file also?

> diff --git a/lib/librte_eal/unix/eal_unix_memory.c b/lib/librte_eal/unix/eal_unix_memory.c
> new file mode 100644
> index 000000000..ec7156df9
> --- /dev/null
> +++ b/lib/librte_eal/unix/eal_unix_memory.c
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Dmitry Kozlyuk
> + */
> +
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_errno.h>
> +#include <rte_log.h>
> +
> +#include "eal_private.h"
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +#define EAL_DONTDUMP MADV_DONTDUMP
> +#define EAL_DODUMP   MADV_DODUMP
> +#elif defined RTE_EXEC_ENV_FREEBSD
> +#define EAL_DONTDUMP MADV_NOCORE
> +#define EAL_DODUMP   MADV_CORE
> +#else
> +#error "madvise doesn't support this OS"
> +#endif
> +
> +static void *
> +mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset)
> +{
> +	void *virt = mmap(requested_addr, size, prot, flags, fd, offset);
> +	if (virt == MAP_FAILED) {
> +		RTE_LOG(DEBUG, EAL,
> +			"Cannot mmap(%p, 0x%zx, 0x%x, 0x%x, %d, 0x%zx): %s\n",
> +			requested_addr, size, prot, flags, fd, offset,
> +			strerror(errno));
> +		rte_errno = errno;
> +		return NULL;
> +	}
> +	return virt;
> +}
> +
> +static int
> +mem_unmap(void *virt, size_t size)
> +{
> +	int ret = munmap(virt, size);
> +	if (ret < 0) {
> +		RTE_LOG(DEBUG, EAL, "Cannot munmap(%p, 0x%zx): %s\n",
> +			virt, size, strerror(errno));
> +		rte_errno = errno;
> +	}
> +	return ret;
> +}
> +
> +void *
> +eal_mem_reserve(void *requested_addr, size_t size, int flags)
> +{
> +	int sys_flags = MAP_PRIVATE | MAP_ANONYMOUS;
> +
> +	if (flags & EAL_RESERVE_HUGEPAGES) {
> +#ifdef MAP_HUGETLB
> +		sys_flags |= MAP_HUGETLB;
> +#else
> +		rte_errno = ENOTSUP;
> +		return NULL;
> +#endif
> +	}
> +
> +	if (flags & EAL_RESERVE_FORCE_ADDRESS)
> +		sys_flags |= MAP_FIXED;
> +
> +	return mem_map(requested_addr, size, PROT_NONE, sys_flags, -1, 0);
> +}
> +
> +void
> +eal_mem_free(void *virt, size_t size)
> +{
> +	mem_unmap(virt, size);
> +}
> +
> +int
> +eal_mem_set_dump(void *virt, size_t size, bool dump)
> +{
> +	int flags = dump ? EAL_DODUMP : EAL_DONTDUMP;
> +	int ret = madvise(virt, size, flags);
> +	if (ret) {
> +		RTE_LOG(DEBUG, EAL, "madvise(%p, %#zx, %d) failed: %s\n",
> +				virt, size, flags, strerror(rte_errno));
> +		rte_errno = errno;
> +	}
> +	return ret;
> +}
> +
> +static int
> +mem_rte_to_sys_prot(int prot)
> +{
> +	int sys_prot = PROT_NONE;
> +
> +	if (prot & RTE_PROT_READ)
> +		sys_prot |= PROT_READ;
> +	if (prot & RTE_PROT_WRITE)
> +		sys_prot |= PROT_WRITE;
> +	if (prot & RTE_PROT_EXECUTE)
> +		sys_prot |= PROT_EXEC;
> +
> +	return sys_prot;
> +}
> +
> +void *
> +rte_mem_map(void *requested_addr, size_t size, int prot, int flags,
> +	int fd, size_t offset)
> +{
> +	int sys_flags = 0;
> +	int sys_prot;
> +
> +	sys_prot = mem_rte_to_sys_prot(prot);
> +
> +	if (flags & RTE_MAP_SHARED)
> +		sys_flags |= MAP_SHARED;
> +	if (flags & RTE_MAP_ANONYMOUS)
> +		sys_flags |= MAP_ANONYMOUS;
> +	if (flags & RTE_MAP_PRIVATE)
> +		sys_flags |= MAP_PRIVATE;
> +	if (flags & RTE_MAP_FORCE_ADDRESS)
> +		sys_flags |= MAP_FIXED;
> +
> +	return mem_map(requested_addr, size, sys_prot, sys_flags, fd, offset);
> +}
> +
> +int
> +rte_mem_unmap(void *virt, size_t size)
> +{
> +	return mem_unmap(virt, size);
> +}
> +
> +size_t
> +rte_mem_page_size(void)
> +{
> +	static size_t page_size;
> +
> +	if (!page_size)
> +		page_size = sysconf(_SC_PAGESIZE);
> +
> +	return page_size;
> +}
> +
> +int
> +rte_mem_lock(const void *virt, size_t size)
> +{
> +	int ret = mlock(virt, size);
> +	if (ret)
> +		rte_errno = errno;
> +	return ret;
> +}
> diff --git a/lib/librte_eal/unix/meson.build b/lib/librte_eal/unix/meson.build
> index 21029ba1a..e733910a1 100644
> --- a/lib/librte_eal/unix/meson.build
> +++ b/lib/librte_eal/unix/meson.build
> @@ -3,4 +3,5 @@
>  
>  sources += files(
>  	'eal_file.c',
> +	'eal_unix_memory.c',
>  )

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers
  2020-06-15  6:03                 ` Kinsella, Ray
@ 2020-06-15  7:41                   ` Dmitry Kozlyuk
  2020-06-15  7:41                     ` Kinsella, Ray
  2020-06-15 10:53                     ` Neil Horman
  0 siblings, 2 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15  7:41 UTC (permalink / raw)
  To: Kinsella, Ray
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Neil Horman

On Mon, 15 Jun 2020 07:03:23 +0100
"Kinsella, Ray" <mdr@ashroe.eu> wrote:

[snip]
> > +
> > +INTERNAL {
> > +	global:
> > +
> > +	rte_mem_lock;
> > +	rte_mem_map;
> > +	rte_mem_page_size;
> > +	rte_mem_unmap;
> > +};  
> 
> Don't
> 
> * eal_mem_reserve()
> * eal_mem_free()
> * eal_mem_set_dump()
> 
> Belong in the map file also?

No need to export these funtions, they're only used by librte_eal.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers
  2020-06-15  7:41                   ` Dmitry Kozlyuk
@ 2020-06-15  7:41                     ` Kinsella, Ray
  2020-06-15 10:53                     ` Neil Horman
  1 sibling, 0 replies; 218+ messages in thread
From: Kinsella, Ray @ 2020-06-15  7:41 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson, Neil Horman

perfect.

On 15/06/2020 08:41, Dmitry Kozlyuk wrote:
> On Mon, 15 Jun 2020 07:03:23 +0100
> "Kinsella, Ray" <mdr@ashroe.eu> wrote:
>
> [snip]
>>> +
>>> +INTERNAL {
>>> +	global:
>>> +
>>> +	rte_mem_lock;
>>> +	rte_mem_map;
>>> +	rte_mem_page_size;
>>> +	rte_mem_unmap;
>>> +};  
>> Don't
>>
>> * eal_mem_reserve()
>> * eal_mem_free()
>> * eal_mem_set_dump()
>>
>> Belong in the map file also?
> No need to export these funtions, they're only used by librte_eal.
>

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers
  2020-06-15  7:41                   ` Dmitry Kozlyuk
  2020-06-15  7:41                     ` Kinsella, Ray
@ 2020-06-15 10:53                     ` Neil Horman
  2020-06-15 11:10                       ` Dmitry Kozlyuk
  1 sibling, 1 reply; 218+ messages in thread
From: Neil Horman @ 2020-06-15 10:53 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: Kinsella, Ray, dev, Dmitry Malloy, Narcisa Ana Maria Vasile,
	Fady Bader, Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

On Mon, Jun 15, 2020 at 10:41:20AM +0300, Dmitry Kozlyuk wrote:
> On Mon, 15 Jun 2020 07:03:23 +0100
> "Kinsella, Ray" <mdr@ashroe.eu> wrote:
> 
> [snip]
> > > +
> > > +INTERNAL {
> > > +	global:
> > > +
> > > +	rte_mem_lock;
> > > +	rte_mem_map;
> > > +	rte_mem_page_size;
> > > +	rte_mem_unmap;
> > > +};  
> > 
> > Don't
> > 
> > * eal_mem_reserve()
> > * eal_mem_free()
> > * eal_mem_set_dump()
> > 
> > Belong in the map file also?
> 
> No need to export these funtions, they're only used by librte_eal.
> 
But theres lots of locations in dpdk that could be using these functions.  I
count 57 calls sites in dpdk for sysconf(SC_PAGESIZE), spread throughout the
library collection, as well as some others for sysconf(_SC_IOV_MAX) and
sysconf(_SC_NPROCESSORS_CONF).  If the goal is to abstract away the use of
sysconf in dpdk, you probably at least want to export rte_mem_page_size.

the same is likely true for mmap/munmap

Neil

> -- 
> Dmitry Kozlyuk
> 

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers
  2020-06-15 10:53                     ` Neil Horman
@ 2020-06-15 11:10                       ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15 11:10 UTC (permalink / raw)
  To: Neil Horman
  Cc: Kinsella, Ray, dev, Dmitry Malloy, Narcisa Ana Maria Vasile,
	Fady Bader, Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

On Mon, 15 Jun 2020 06:53:45 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Jun 15, 2020 at 10:41:20AM +0300, Dmitry Kozlyuk wrote:
> > On Mon, 15 Jun 2020 07:03:23 +0100
> > "Kinsella, Ray" <mdr@ashroe.eu> wrote:
> > 
> > [snip]  
> > > > +
> > > > +INTERNAL {
> > > > +	global:
> > > > +
> > > > +	rte_mem_lock;
> > > > +	rte_mem_map;
> > > > +	rte_mem_page_size;
> > > > +	rte_mem_unmap;
> > > > +};    
> > > 
> > > Don't
> > > 
> > > * eal_mem_reserve()
> > > * eal_mem_free()
> > > * eal_mem_set_dump()
> > > 
> > > Belong in the map file also?  
> > 
> > No need to export these funtions, they're only used by librte_eal.
> >   
> But theres lots of locations in dpdk that could be using these functions.  I
> count 57 calls sites in dpdk for sysconf(SC_PAGESIZE), spread throughout the
> library collection, as well as some others for sysconf(_SC_IOV_MAX) and
> sysconf(_SC_NPROCESSORS_CONF).  If the goal is to abstract away the use of
> sysconf in dpdk, you probably at least want to export rte_mem_page_size.
> 
> the same is likely true for mmap/munmap

My comment (and Ray's one, I believe) was about eal_mem_*() functions. Those
you're talking about, rte_mem_*(), are exported from EAL, but only visible to
DPDK. Everything above is true, sysconf(), etc. can be replaced as the need
arises to make the calling code OS-agnostic.

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/12] eal/mem: extract common code for memseg list initialization
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 04/12] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
@ 2020-06-15 13:13                 ` Thomas Monjalon
  0 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-15 13:13 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Anatoly Burakov, Bruce Richardson

15/06/2020 02:43, Dmitry Kozlyuk:
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> +	RTE_LOG(DEBUG, EAL,
> +		"Memseg list allocated at socket %i, page size 0x%"PRIx64"kB\n",
> +		socket_id, (size_t)page_sz >> 10);

The cast to size_t must be removed to match PRIx64 expectation.
I am fixing while merging.



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 09/12] eal/windows: improve CPU and NUMA node detection
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 09/12] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
@ 2020-06-15 15:21                 ` Thomas Monjalon
  2020-06-15 15:39                   ` Dmitry Kozlyuk
  0 siblings, 1 reply; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-15 15:21 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Dmitry Kozlyuk, Harini Ramakrishnan,
	Omar Cardona, Pallavi Kadam, Ranjit Menon

15/06/2020 02:43, Dmitry Kozlyuk:
> +	infos_size = 0;
> +	if (!GetLogicalProcessorInformationEx(
> +			RelationNumaNode, NULL, &infos_size)) {
> +		DWORD error = GetLastError();
> +		if (error != ERROR_INSUFFICIENT_BUFFER) {
> +			log_early("Cannot get NUMA node info size, error %lu\n",
> +				GetLastError());
> +			rte_errno = ENOMEM;
> +			return -1;
> +		}
> +	}
> +
> +	infos = malloc(infos_size);
> +	if (infos == NULL) {
> +		log_early("Cannot allocate memory for NUMA node information\n");
> +		rte_errno = ENOMEM;
> +		return -1;
> +	}
> +
> +	if (!GetLogicalProcessorInformationEx(
> +			RelationNumaNode, infos, &infos_size)) {
> +		log_early("Cannot get NUMA node information, error %lu\n",
> +			GetLastError());
> +		rte_errno = EINVAL;
> +		return -1;
> +	}

rte_errno is unknown

It seems to be fixed in patch 12:

--- a/lib/librte_eal/windows/eal_windows.h
+++ b/lib/librte_eal/windows/eal_windows.h
@@ -9,8 +9,24 @@
  * @file Facilities private to Windows EAL
  */
 
+#include <rte_errno.h>
 #include <rte_windows.h>


I'll merge it in patch 9




^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 09/12] eal/windows: improve CPU and NUMA node detection
  2020-06-15 15:21                 ` Thomas Monjalon
@ 2020-06-15 15:39                   ` Dmitry Kozlyuk
  0 siblings, 0 replies; 218+ messages in thread
From: Dmitry Kozlyuk @ 2020-06-15 15:39 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, Harini Ramakrishnan, Omar Cardona,
	Pallavi Kadam, Ranjit Menon

[snip]
> > +		rte_errno = EINVAL;
> > +		return -1;
> > +	}  
> 
> rte_errno is unknown
> 
> It seems to be fixed in patch 12:
> 
> --- a/lib/librte_eal/windows/eal_windows.h
> +++ b/lib/librte_eal/windows/eal_windows.h
> @@ -9,8 +9,24 @@
>   * @file Facilities private to Windows EAL
>   */
>  
> +#include <rte_errno.h>
>  #include <rte_windows.h>
> 
> 
> I'll merge it in patch 9

OK. Thanks for both fixes while merging and sorry for the mess. I'll try to
automate local per-commit testing next time (test-meson-builds, etc).

-- 
Dmitry Kozlyuk

^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 00/12] Windows basic memory management
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (11 preceding siblings ...)
  2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 12/12] eal/windows: implement basic memory management Dmitry Kozlyuk
@ 2020-06-15 17:34               ` Thomas Monjalon
  2020-06-16  1:52               ` Ranjit Menon
  13 siblings, 0 replies; 218+ messages in thread
From: Thomas Monjalon @ 2020-06-15 17:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk
  Cc: dev, Dmitry Malloy, Narcisa Ana Maria Vasile, Fady Bader,
	Tal Shnaiderman, anatoly.burakov

15/06/2020 02:43, Dmitry Kozlyuk:
> Note for v9:
> rte_eal_memory.h renamed, dependent patchsets have to be updated.
> 
> This patchset implements basic MM with the following features:
> 
> * Hugepages are dynamically allocated in user-mode.
> * Only 2MB hugepages are supported.
> * IOVA is always PA, obtained through kernel-mode driver.
> * No 32-bit support (presumably not demanded).
> * Ni multi-process support (it is forcefully disabled).
> * No-huge mode for testing without IOVA is available.
> 
> Testing revealed Windows Server 2019 does not allow allocating hugepage
> memory at a reserved address, despite advertised API.  So allocator has
> to temporary free the region to be allocated.  This creates in inherent
> race condition. This issue is being discussed with Microsoft privately.
> 
> New EAL public functions for memory mapping are introduced to mitigate
> OS differences in DPDK libraries and applications: rte_mem_map,
> rte_mem_unmap, rte_mem_lock, rte_mem_page_size.
> 
> To support common MM routines, internal wrappers for low-level memory
> reservation and file management are introduced. These changes affect
> Linux and FreeBSD EAL. Shared code is placed unded /unix/ subdirectory
> (suggested by Thomas).
> 
> To avoid code duplication between Linux and Windows EAL, common code
> for EALs supporting dynamic memory allocation is extracted
> (discussed with Anatoly Burakov in v4 thread). This is a separate
> patch to ease the review, but it can be merged with the previous one.
> 
> EAL tracepoints save size_t values as long, which is invalid on Windows.
> New size_t emitter for tracepoints is introduced (suggested by Jerin
> Jacob to Fady Bader, see [1]). Also, to avoid workaround in every file
> using the tracepoints, stubs are added to Windows EAL.
> 
> Entire <sys/queue.h> is imported from FreeBSD, replacing existing
> partial import. There is already a license exception for this file.
> The file is imported as-is, so it causes a bunch of checkpatch warnings.
> 
> [1]: http://mails.dpdk.org/archives/dev/2020-May/168076.html
[...]
> Dmitry Kozlyuk (12):
>   eal: replace rte_page_sizes with a set of constants
>   eal: introduce internal wrappers for file operations
>   eal: introduce memory management wrappers
>   eal/mem: extract common code for memseg list initialization
>   eal/mem: extract common code for dynamic memory allocation
>   trace: add size_t field emitter
>   eal/windows: add tracing support stubs
>   eal/windows: replace sys/queue.h with a complete one from FreeBSD
>   eal/windows: improve CPU and NUMA node detection
>   doc/windows: split build and run instructions
>   eal/windows: initialize hugepage info
>   eal/windows: implement basic memory management

Applied, thanks for the huge work!



^ permalink raw reply	[flat|nested] 218+ messages in thread

* Re: [dpdk-dev] [PATCH v9 00/12] Windows basic memory management
  2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
                                 ` (12 preceding siblings ...)
  2020-06-15 17:34               ` [dpdk-dev] [PATCH v9 00/12] Windows " Thomas Monjalon
@ 2020-06-16  1:52               ` Ranjit Menon
  13 siblings, 0 replies; 218+ messages in thread
From: Ranjit Menon @ 2020-06-16  1:52 UTC (permalink / raw)
  To: dev


On 6/14/2020 5:43 PM, Dmitry Kozlyuk wrote:
> Note for v9:
> rte_eal_memory.h renamed, dependent patchsets have to be updated.
>
> This patchset implements basic MM with the following features:
>
> * Hugepages are dynamically allocated in user-mode.
> * Only 2MB hugepages are supported.
> * IOVA is always PA, obtained through kernel-mode driver.
> * No 32-bit support (presumably not demanded).
> * Ni multi-process support (it is forcefully disabled).
> * No-huge mode for testing without IOVA is available.
>
> Testing revealed Windows Server 2019 does not allow allocating hugepage
> memory at a reserved address, despite advertised API.  So allocator has
> to temporary free the region to be allocated.  This creates in inherent
> race condition. This issue is being discussed with Microsoft privately.
>
> New EAL public functions for memory mapping are introduced to mitigate
> OS differences in DPDK libraries and applications: rte_mem_map,
> rte_mem_unmap, rte_mem_lock, rte_mem_page_size.
>
> To support common MM routines, internal wrappers for low-level memory
> reservation and file management are introduced. These changes affect
> Linux and FreeBSD EAL. Shared code is placed unded /unix/ subdirectory
> (suggested by Thomas).
>
> To avoid code duplication between Linux and Windows EAL, common code
> for EALs supporting dynamic memory allocation is extracted
> (discussed with Anatoly Burakov in v4 thread). This is a separate
> patch to ease the review, but it can be merged with the previous one.
>
> EAL tracepoints save size_t values as long, which is invalid on Windows.
> New size_t emitter for tracepoints is introduced (suggested by Jerin
> Jacob to Fady Bader, see [1]). Also, to avoid workaround in every file
> using the tracepoints, stubs are added to Windows EAL.
>
> Entire <sys/queue.h> is imported from FreeBSD, replacing existing
> partial import. There is already a license exception for this file.
> The file is imported as-is, so it causes a bunch of checkpatch warnings.
>
> [1]: http://mails.dpdk.org/archives/dev/2020-May/168076.html
>
> ---
>
> v9:
>      * Fix build on 32-bit and FreeBSD.
>      * Rename rte_eal_memory.h to rte_eal_paging.h.
>      * Do not use rte_panic() in library code.
>      * Fix typos, comments, string formatting.
>      * Split documentation commits.
>
Great work on this, Dmitry!

I know the patch has already been applied, but:

Acked-by: Ranjit Menon <ranjit.menon@intel.com>



^ permalink raw reply	[flat|nested] 218+ messages in thread

end of thread, other threads:[~2020-06-16  1:52 UTC | newest]

Thread overview: 218+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30  4:10 [dpdk-dev] [RFC PATCH 0/9] Windows basic memory management Dmitry Kozlyuk
2020-03-30  4:10 ` [dpdk-dev] [PATCH 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
2020-03-30  6:58   ` Jerin Jacob
2020-03-30 13:41     ` Dmitry Kozlyuk
2020-04-10  1:45   ` Ranjit Menon
2020-04-10  2:50     ` Dmitry Kozlyuk
2020-04-10  2:59       ` Dmitry Kozlyuk
2020-04-10 19:39       ` Ranjit Menon
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 2/9] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 3/9] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 4/9] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 5/9] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-03-30  7:04   ` Jerin Jacob
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 6/9] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-03-30  7:31   ` Thomas Monjalon
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 7/9] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 8/9] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-03-30  4:10 ` [dpdk-dev] [RFC PATCH 9/9] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-04-10 16:43 ` [dpdk-dev] [PATCH v2 00/10] eal: Windows " Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
2020-04-13  5:32     ` Ranjit Menon
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 02/10] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 03/10] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 04/10] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-04-13  7:50     ` Tal Shnaiderman
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 08/10] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-04-10 16:43   ` [dpdk-dev] [PATCH v2 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-04-10 22:04     ` Narcisa Ana Maria Vasile
2020-04-11  1:16       ` Dmitry Kozlyuk
2020-04-14 19:44   ` [dpdk-dev] [PATCH v3 00/10] Windows " Dmitry Kozlyuk
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 1/1] virt2phys: virtual to physical address translator for Windows Dmitry Kozlyuk
2020-04-14 23:35       ` Ranjit Menon
2020-04-15 15:19         ` Thomas Monjalon
2020-04-21  6:23       ` Ophir Munk
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 02/10] eal/windows: do not expose private EAL facilities Dmitry Kozlyuk
2020-04-21 22:40       ` Thomas Monjalon
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 03/10] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 04/10] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 05/10] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-04-15 21:48       ` Thomas Monjalon
2020-04-17 12:24       ` Burakov, Anatoly
2020-04-28 23:50         ` Dmitry Kozlyuk
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 06/10] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-04-15 22:17       ` Thomas Monjalon
2020-04-15 23:32         ` Dmitry Kozlyuk
2020-04-17 12:43       ` Burakov, Anatoly
2020-04-20  5:59       ` Tal Shnaiderman
2020-04-21 23:36         ` Dmitry Kozlyuk
2020-04-22  0:55       ` Ranjit Menon
2020-04-22  2:07       ` Ranjit Menon
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 07/10] eal: extract common code for memseg list initialization Dmitry Kozlyuk
2020-04-15 22:19       ` Thomas Monjalon
2020-04-17 13:04       ` Burakov, Anatoly
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 08/10] eal/windows: fix rte_page_sizes with Clang on Windows Dmitry Kozlyuk
2020-04-15  9:34       ` Jerin Jacob
2020-04-15 10:32         ` Dmitry Kozlyuk
2020-04-15 10:57           ` Jerin Jacob
2020-04-15 11:09             ` Dmitry Kozlyuk
2020-04-15 11:17               ` Jerin Jacob
2020-05-06  5:41                 ` Ray Kinsella
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 09/10] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-04-14 19:44     ` [dpdk-dev] [PATCH v3 10/10] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-04-15  9:42       ` Jerin Jacob
2020-04-16 18:34       ` Ranjit Menon
2020-04-23  1:00         ` Dmitry Kozlyuk
2020-04-14 23:37     ` [dpdk-dev] [PATCH v3 00/10] Windows " Kadam, Pallavi
2020-04-28 23:50   ` [dpdk-dev] [PATCH v4 0/8] " Dmitry Kozlyuk
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 1/8] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 2/8] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-04-29 16:41       ` Burakov, Anatoly
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 3/8] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-04-29 17:13       ` Burakov, Anatoly
2020-04-30 13:59         ` Burakov, Anatoly
2020-05-01 19:00         ` Dmitry Kozlyuk
2020-05-05 14:43           ` Burakov, Anatoly
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 4/8] eal: extract common code for memseg list initialization Dmitry Kozlyuk
2020-05-05 16:08       ` Burakov, Anatoly
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 5/8] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 6/8] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 7/8] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-04-28 23:50     ` [dpdk-dev] [PATCH v4 8/8] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-04-29  1:18       ` Ranjit Menon
2020-05-01 19:19         ` Dmitry Kozlyuk
2020-05-05 16:24       ` Burakov, Anatoly
2020-05-05 23:20         ` Dmitry Kozlyuk
2020-05-06  9:46           ` Burakov, Anatoly
2020-05-06 21:53             ` Dmitry Kozlyuk
2020-05-07 11:57               ` Burakov, Anatoly
2020-05-13  8:24       ` Fady Bader
2020-05-13  8:42         ` Dmitry Kozlyuk
2020-05-13  9:09           ` Fady Bader
2020-05-13  9:22             ` Fady Bader
2020-05-13  9:38             ` Dmitry Kozlyuk
2020-05-13 12:25               ` Fady Bader
2020-05-18  0:17                 ` Dmitry Kozlyuk
2020-05-18 22:25                   ` Dmitry Kozlyuk
2020-05-25  0:37     ` [dpdk-dev] [PATCH v5 0/8] Windows " Dmitry Kozlyuk
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-05-28  7:59         ` Thomas Monjalon
2020-05-28 10:09           ` Dmitry Kozlyuk
2020-05-28 11:29             ` Thomas Monjalon
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-05-27  6:33         ` Ray Kinsella
2020-05-27 16:34           ` Dmitry Kozlyuk
2020-05-28 11:26         ` Burakov, Anatoly
2020-06-01 21:08           ` Thomas Monjalon
2020-05-28 11:52         ` Burakov, Anatoly
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
2020-05-28  7:31         ` Thomas Monjalon
2020-05-28 10:04           ` Dmitry Kozlyuk
2020-05-28 11:46         ` Burakov, Anatoly
2020-05-28 14:41           ` Dmitry Kozlyuk
2020-05-29  8:49             ` Burakov, Anatoly
2020-05-28 12:19         ` Burakov, Anatoly
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
2020-05-28  8:34         ` Thomas Monjalon
2020-05-28 12:21         ` Burakov, Anatoly
2020-05-28 13:24           ` Dmitry Kozlyuk
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 06/11] trace: add size_t field emitter Dmitry Kozlyuk
2020-05-25  5:53         ` Jerin Jacob
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-05-25  0:37       ` [dpdk-dev] [PATCH v5 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-06-02 23:03       ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
2020-06-03  1:59           ` Stephen Hemminger
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-06-03 12:07           ` Neil Horman
2020-06-03 12:34             ` Dmitry Kozlyuk
2020-06-04 21:07               ` Neil Horman
2020-06-05  0:16                 ` Dmitry Kozlyuk
2020-06-05 11:19                   ` Neil Horman
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
2020-06-09 13:36           ` Burakov, Anatoly
2020-06-09 14:17             ` Dmitry Kozlyuk
2020-06-10 10:26               ` Burakov, Anatoly
2020-06-10 14:31                 ` Dmitry Kozlyuk
2020-06-10 15:48                   ` Burakov, Anatoly
2020-06-10 16:39                     ` Dmitry Kozlyuk
2020-06-11  8:59                       ` Burakov, Anatoly
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 06/11] trace: add size_t field emitter Dmitry Kozlyuk
2020-06-03  3:29           ` Jerin Jacob
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-06-02 23:03         ` [dpdk-dev] [PATCH v6 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-06-08  7:41         ` [dpdk-dev] [PATCH v6 00/11] Windows " Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
2020-06-09 11:14             ` Tal Shnaiderman
2020-06-09 13:49               ` Burakov, Anatoly
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 06/11] trace: add size_t field emitter Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-06-08  7:41           ` [dpdk-dev] [PATCH v7 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-06-10 14:27           ` [dpdk-dev] [PATCH v8 00/11] Windows " Dmitry Kozlyuk
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 01/11] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 02/11] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-06-11 17:13               ` Thomas Monjalon
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 03/11] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-06-12 10:47               ` Thomas Monjalon
2020-06-12 13:44                 ` Dmitry Kozliuk
2020-06-12 13:54                   ` Thomas Monjalon
2020-06-12 20:24                     ` Dmitry Kozliuk
2020-06-12 21:37                       ` Thomas Monjalon
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 04/11] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
2020-06-12 15:39               ` Thomas Monjalon
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 05/11] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 06/11] trace: add size_t field emitter Dmitry Kozlyuk
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 07/11] eal/windows: add tracing support stubs Dmitry Kozlyuk
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 08/11] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 09/11] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-06-12 21:45               ` Thomas Monjalon
2020-06-12 22:09               ` Thomas Monjalon
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 10/11] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-06-12 21:55               ` Thomas Monjalon
2020-06-10 14:27             ` [dpdk-dev] [PATCH v8 11/11] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-06-12 22:12               ` Thomas Monjalon
2020-06-11 17:29             ` [dpdk-dev] [PATCH v8 00/11] Windows " Thomas Monjalon
2020-06-12 22:00               ` Thomas Monjalon
2020-06-15  0:43             ` [dpdk-dev] [PATCH v9 00/12] " Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 01/12] eal: replace rte_page_sizes with a set of constants Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 02/12] eal: introduce internal wrappers for file operations Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 03/12] eal: introduce memory management wrappers Dmitry Kozlyuk
2020-06-15  6:03                 ` Kinsella, Ray
2020-06-15  7:41                   ` Dmitry Kozlyuk
2020-06-15  7:41                     ` Kinsella, Ray
2020-06-15 10:53                     ` Neil Horman
2020-06-15 11:10                       ` Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 04/12] eal/mem: extract common code for memseg list initialization Dmitry Kozlyuk
2020-06-15 13:13                 ` Thomas Monjalon
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 05/12] eal/mem: extract common code for dynamic memory allocation Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 06/12] trace: add size_t field emitter Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 07/12] eal/windows: add tracing support stubs Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 08/12] eal/windows: replace sys/queue.h with a complete one from FreeBSD Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 09/12] eal/windows: improve CPU and NUMA node detection Dmitry Kozlyuk
2020-06-15 15:21                 ` Thomas Monjalon
2020-06-15 15:39                   ` Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 10/12] doc/windows: split build and run instructions Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 11/12] eal/windows: initialize hugepage info Dmitry Kozlyuk
2020-06-15  0:43               ` [dpdk-dev] [PATCH v9 12/12] eal/windows: implement basic memory management Dmitry Kozlyuk
2020-06-15 17:34               ` [dpdk-dev] [PATCH v9 00/12] Windows " Thomas Monjalon
2020-06-16  1:52               ` Ranjit Menon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).