DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC 0/4] DPDK multiprocess rework
@ 2017-05-19 16:39 Anatoly Burakov
  2017-05-19 16:39 ` [dpdk-dev] [RFC 1/4] vfio: refactor sockets into separate files Anatoly Burakov
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Anatoly Burakov @ 2017-05-19 16:39 UTC (permalink / raw)
  To: dev

This is a proof-of-concept proposal for rework of how DPDK secondary processes
work. While the code has some limitations, it works well enough to demonstrate
the concept, and it can successfully run all existing multiprocess applications.

Current problems with DPDK secondary processes:
* ASLR interferes with mappings
  * "Fixed" by disabling ASLR, but not really a solution
* Secondary process may map things into where we want to map shared memory
  * _Almost_ works with --base-virtaddr, but unreliable and tedious
* Function pointers don't work (so e.g. hash library is broken)

Proposed solution:

Instead of running secondary process and mapping resources from primary process,
the following is done:
0) compile all applications as position-indendent executables, compile DPDK as
   a shared library
1) fork() from primary process
2) dlopen() secondary process binary
3) use dlsym() to find entry point
4) run the application code while having all resources already mapped

Benefits:
* No more ASLR issues
* No need for --base-virtaddr
* Function pointers from primary process will work in secondaries
  * Hash library (and any other library that uses function pointers internally)
    will work correctly in multi-process scenario
  * ethdev data can be moved to shared memory
  * Primary process interrupt callbacks can be run by secondary process
* More secure as all applications are compiled as position-indendent binaries
  (default on Fedora)

Potential drawbacks (that we could think of):
* Kind of a hack
* Puts some code restrictions on secondary processes
  * Anything happening before EAL init will be run twice
* Some use cases are no longer possible (attaching to a dead primary)
* May impact binaries compiled to use a lot (kilobytes) of thread-local storage[1]
* Likely wouldn't work for static linking

There are also a number of issues that need to be resolved, but those are
implementation details and are out of scope for RFC.

What is explicitly out of scope:
* Fixing interrupts in secondary processes
* Fixing hotplug in secondary processes

These currently do not work in secondary processes, and this proposal does
nothing to change that. They are better addressed using dedicated EAL-internal
IPC proposal.


Technical nitty-gritty

Things quickly get confusing, so terminology:
- Original Primary is normal DPDK primary process
- Forked Primary is a "clean slate" primary process, from which all secondary
  processes will be forked (threads and fork don't mix well, so fork is done
  after all the hugepage and PCI data is mapped, but before all the threads are
  spun up)
- Original Secondary is a process that connects to Forked Primary, sends some
  data and and triggers a fork
- Forked Secondary is _actual_ secondary process (forked from Forked Primary)

Timeline:
- Original Primary starts
- Forked Primary is forked from Original Primary
- Original Secondary starts and connects to Forked Primary
- Forked Primary forks into Forked Secondary
- Original Secondary waits until Forked Secondary dies

During EAL init, Original Primary does a fork() to form a Forked Primary - a
"clean slate" starting point for secondary processes. Forked Primary opens a
local socket (a-la VFIO) and starts listening for incoming connections.

Original Secondary process connects to Forked Primary, sends stdout/log fd's,
command line parameters, etc. over local socket, and sits around waiting for
Forked Secondary to die, then exits (Original Secondary does _not_ map anything
or do any EAL init, it rte_exit()'s from inside rte_eal_init()). Forked
Secondary process then executes main(), passing all command-line arguments, and
execution of secondary process resumes.

Why pre-fork and not pthread like VFIO?

Pthreads and fork() don't mix well, because fork() stops the world (all threads
disappear, leaving behind thread stacks, locks and possibly inconsistent state
of both app data and system libraries). On the other hand, forking from single-
threaded context is safe. Current implementation doesn't _exactly_ fork from a
single-threaded context, but this can be fixed later by rearranging EAL init.

[1]: https://www.redhat.com/archives/phil-list/2003-February/msg00077.html

Anatoly Burakov (4):
  vfio: refactor sockets into separate files
  eal: enable experimental dlopen()-based secondary process support
  apps: enable new secondary process support in multiprocess apps
  mk: default to compiling shared libraries

 config/common_base                                 |   2 +-
 .../client_server_mp/mp_client/Makefile            |   2 +-
 examples/multi_process/simple_mp/Makefile          |   2 +-
 examples/multi_process/symmetric_mp/Makefile       |   2 +-
 lib/librte_eal/linuxapp/eal/Makefile               |   3 +
 lib/librte_eal/linuxapp/eal/eal.c                  | 105 ++++-
 lib/librte_eal/linuxapp/eal/eal_mp.h               |  54 +++
 lib/librte_eal/linuxapp/eal/eal_mp_primary.c       | 477 +++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/eal_mp_secondary.c     | 301 +++++++++++++
 lib/librte_eal/linuxapp/eal/eal_mp_socket.c        | 301 +++++++++++++
 lib/librte_eal/linuxapp/eal/eal_mp_socket.h        |  54 +++
 lib/librte_eal/linuxapp/eal/eal_vfio.c             |  20 +-
 lib/librte_eal/linuxapp/eal/eal_vfio.h             |  24 +-
 lib/librte_eal/linuxapp/eal/eal_vfio_mp_sync.c     | 243 ++---------
 14 files changed, 1347 insertions(+), 243 deletions(-)
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp.h
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_primary.c
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_secondary.c
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_socket.c
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_socket.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [dpdk-dev] [RFC 1/4] vfio: refactor sockets into separate files
  2017-05-19 16:39 [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Anatoly Burakov
@ 2017-05-19 16:39 ` Anatoly Burakov
  2017-05-19 16:39 ` [dpdk-dev] [RFC 2/4] eal: enable experimental dlopen()-based secondary process support Anatoly Burakov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Anatoly Burakov @ 2017-05-19 16:39 UTC (permalink / raw)
  To: dev

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linuxapp/eal/Makefile           |   1 +
 lib/librte_eal/linuxapp/eal/eal_mp_socket.c    | 301 +++++++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/eal_mp_socket.h    |  54 +++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c         |  20 +-
 lib/librte_eal/linuxapp/eal/eal_vfio.h         |  24 +-
 lib/librte_eal/linuxapp/eal/eal_vfio_mp_sync.c | 243 +++-----------------
 6 files changed, 410 insertions(+), 233 deletions(-)
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_socket.c
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_socket.h

diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 640afd0..24aab8d 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -60,6 +60,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_xen_memory.c
 endif
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_thread.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_log.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_mp_socket.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio_mp_sync.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_pci.c
diff --git a/lib/librte_eal/linuxapp/eal/eal_mp_socket.c b/lib/librte_eal/linuxapp/eal/eal_mp_socket.c
new file mode 100755
index 0000000..18c5a72
--- /dev/null
+++ b/lib/librte_eal/linuxapp/eal/eal_mp_socket.c
@@ -0,0 +1,301 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+/* sys/un.h with __USE_MISC uses strlen, which is unsafe */
+#ifdef __USE_MISC
+#define REMOVED_USE_MISC
+#undef __USE_MISC
+#endif
+#include <sys/un.h>
+/* make sure we redefine __USE_MISC only if it was previously undefined */
+#ifdef REMOVED_USE_MISC
+#define __USE_MISC
+#undef REMOVED_USE_MISC
+#endif
+
+#include <rte_log.h>
+
+#include "eal_mp_socket.h"
+
+/**
+ * @file
+ * Sockets for communication between primary and secondary processes.
+ */
+
+#define CMSGLEN (CMSG_LEN(sizeof(int)))
+#define FD_TO_CMSGHDR(fd, chdr) \
+	    do {\
+	        (chdr).cmsg_len = CMSGLEN;\
+	        (chdr).cmsg_level = SOL_SOCKET;\
+	        (chdr).cmsg_type = SCM_RIGHTS;\
+	        memcpy((chdr).__cmsg_data, &(fd), sizeof(fd));\
+	    } while (0)
+#define CMSGHDR_TO_FD(chdr, fd) \
+	        memcpy(&(fd), (chdr).__cmsg_data, sizeof(fd))
+
+/* send a request, return -1 on error */
+int
+eal_mp_sync_send_request(int socket, int req)
+{
+	struct msghdr hdr;
+	struct iovec iov;
+	int buf;
+	int ret;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	buf = req;
+
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+	iov.iov_base = (char *) &buf;
+	iov.iov_len = sizeof(buf);
+
+	ret = sendmsg(socket, &hdr, 0);
+	if (ret < 0)
+		return -1;
+	return 0;
+}
+
+/* receive a request and return it */
+int
+eal_mp_sync_receive_request(int socket)
+{
+	int buf;
+	struct msghdr hdr;
+	struct iovec iov;
+	int ret, req;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	buf = SOCKET_ERR;
+
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+	iov.iov_base = (char *) &buf;
+	iov.iov_len = sizeof(buf);
+
+	ret = recvmsg(socket, &hdr, 0);
+	if (ret < 0)
+		return -1;
+
+	req = buf;
+
+	return req;
+}
+
+/* send OK in message, fd in control message */
+int
+eal_mp_sync_send_fd(int socket, int fd)
+{
+	int buf;
+	struct msghdr hdr;
+	struct cmsghdr *chdr;
+	char chdr_buf[CMSGLEN];
+	struct iovec iov;
+	int ret;
+
+	chdr = (struct cmsghdr *) chdr_buf;
+	memset(chdr, 0, sizeof(chdr_buf));
+	memset(&hdr, 0, sizeof(hdr));
+
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+	iov.iov_base = (char *) &buf;
+	iov.iov_len = sizeof(buf);
+	hdr.msg_control = chdr;
+	hdr.msg_controllen = CMSGLEN;
+
+	buf = SOCKET_FD;
+	FD_TO_CMSGHDR(fd, *chdr);
+
+	ret = sendmsg(socket, &hdr, 0);
+	if (ret < 0)
+		return -1;
+	return 0;
+}
+
+/* receive OK in message, fd in control message */
+int
+eal_mp_sync_receive_fd(int socket)
+{
+	int buf;
+	struct msghdr hdr;
+	struct cmsghdr *chdr;
+	char chdr_buf[CMSGLEN];
+	struct iovec iov;
+	int ret, req, fd = -1;
+
+	buf = SOCKET_ERR;
+
+	chdr = (struct cmsghdr *) chdr_buf;
+	memset(chdr, 0, sizeof(chdr_buf));
+	memset(&hdr, 0, sizeof(hdr));
+
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+	iov.iov_base = (char *) &buf;
+	iov.iov_len = sizeof(buf);
+	hdr.msg_control = chdr;
+	hdr.msg_controllen = CMSGLEN;
+
+	ret = recvmsg(socket, &hdr, 0);
+	if (ret < 0)
+		return -1;
+
+	req = buf;
+
+	if (req != SOCKET_FD)
+		return -1;
+
+	CMSGHDR_TO_FD(*chdr, fd);
+
+	return fd;
+}
+
+/* send path, return -1 on error */
+int eal_mp_sync_send_data(int socket, void *data, int len)
+{
+	struct msghdr hdr;
+	struct iovec iov;
+	int ret;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+	iov.iov_base = data;
+	iov.iov_len = (size_t) len;
+
+	ret = sendmsg(socket, &hdr, 0);
+	if (ret < 0)
+		return -1;
+	return 0;
+}
+
+/* receive a path into buffer of specified sz */
+int eal_mp_sync_receive_data(int socket, void *data, int sz)
+{
+	struct msghdr hdr;
+	struct iovec iov;
+	int ret;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	/* receive path */
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+	iov.iov_base = data;
+	iov.iov_len = (size_t) sz;
+
+	ret = recvmsg(socket, &hdr, 0);
+	if (ret < 0 || (hdr.msg_flags & MSG_TRUNC))
+		return -1;
+
+	/* path received */
+
+	return 0;
+}
+
+
+/* connect socket_fd in secondary process to the primary process's socket */
+int
+eal_mp_sync_connect_to_primary(const char *path)
+{
+	struct sockaddr_un addr;
+	socklen_t sockaddr_len;
+	int socket_fd;
+
+	/* set up a socket */
+	socket_fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
+	if (socket_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to create socket!\n");
+		return -1;
+	}
+
+	snprintf(addr.sun_path, sizeof(addr.sun_path), "%s", path);
+	addr.sun_family = AF_UNIX;
+
+	sockaddr_len = sizeof(struct sockaddr_un);
+
+	if (connect(socket_fd, (struct sockaddr *) &addr, sockaddr_len) == 0)
+		return socket_fd;
+
+	/* if connect failed */
+	close(socket_fd);
+	return -1;
+}
+
+int
+eal_mp_sync_socket_setup(const char *path)
+{
+	int ret, socket_fd;
+	struct sockaddr_un addr;
+	socklen_t sockaddr_len;
+
+	/* set up a socket */
+	socket_fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
+	if (socket_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to create socket!\n");
+		return -1;
+	}
+
+	snprintf(addr.sun_path, sizeof(addr.sun_path), "%s", path);
+	addr.sun_family = AF_UNIX;
+
+	sockaddr_len = sizeof(struct sockaddr_un);
+
+	unlink(addr.sun_path);
+
+	ret = bind(socket_fd, (struct sockaddr *) &addr, sockaddr_len);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "Failed to bind socket: %s!\n", strerror(errno));
+		close(socket_fd);
+		return -1;
+	}
+
+	ret = listen(socket_fd, 50);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "Failed to listen: %s!\n", strerror(errno));
+		close(socket_fd);
+		return -1;
+	}
+
+	return socket_fd;
+}
diff --git a/lib/librte_eal/linuxapp/eal/eal_mp_socket.h b/lib/librte_eal/linuxapp/eal/eal_mp_socket.h
new file mode 100755
index 0000000..2c46969
--- /dev/null
+++ b/lib/librte_eal/linuxapp/eal/eal_mp_socket.h
@@ -0,0 +1,54 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef EAL_MP_SOCKET_H
+#define EAL_MP_SOCKET_H
+
+/*
+ * Function prototypes for multiprocess sync functions
+ */
+int eal_mp_sync_send_request(int socket, int req);
+int eal_mp_sync_receive_request(int socket);
+int eal_mp_sync_send_fd(int socket, int fd);
+int eal_mp_sync_receive_fd(int socket);
+int eal_mp_sync_send_data(int socket, void *data, int len);
+int eal_mp_sync_receive_data(int socket, void *data, int sz);
+int eal_mp_sync_connect_to_primary(const char *path);
+int eal_mp_sync_socket_setup(const char *path);
+
+#define SOCKET_REQ_USER 0x100
+#define SOCKET_OK 0
+#define SOCKET_FD 1
+#define SOCKET_ERR -1
+
+#endif // EAL_MP_SOCKET_H
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 53ac725..485fbbe 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -140,23 +140,23 @@ vfio_get_group_fd(int iommu_group_no)
 			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
 			return -1;
 		}
-		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
+		if (eal_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
 			RTE_LOG(ERR, EAL, "  cannot request container fd!\n");
 			close(socket_fd);
 			return -1;
 		}
-		if (vfio_mp_sync_send_request(socket_fd, iommu_group_no) < 0) {
+		if (eal_mp_sync_send_request(socket_fd, iommu_group_no) < 0) {
 			RTE_LOG(ERR, EAL, "  cannot send group number!\n");
 			close(socket_fd);
 			return -1;
 		}
-		ret = vfio_mp_sync_receive_request(socket_fd);
+		ret = eal_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
 			close(socket_fd);
 			return 0;
 		case SOCKET_OK:
-			vfio_group_fd = vfio_mp_sync_receive_fd(socket_fd);
+			vfio_group_fd = eal_mp_sync_receive_fd(socket_fd);
 			/* if we got the fd, return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
@@ -247,19 +247,19 @@ clear_group(int vfio_group_fd)
 		return -1;
 	}
 
-	if (vfio_mp_sync_send_request(socket_fd, SOCKET_CLR_GROUP) < 0) {
+	if (eal_mp_sync_send_request(socket_fd, SOCKET_CLR_GROUP) < 0) {
 		RTE_LOG(ERR, EAL, "  cannot request container fd!\n");
 		close(socket_fd);
 		return -1;
 	}
 
-	if (vfio_mp_sync_send_request(socket_fd, vfio_group_fd) < 0) {
+	if (eal_mp_sync_send_request(socket_fd, vfio_group_fd) < 0) {
 		RTE_LOG(ERR, EAL, "  cannot send group fd!\n");
 		close(socket_fd);
 		return -1;
 	}
 
-	ret = vfio_mp_sync_receive_request(socket_fd);
+	ret = eal_mp_sync_receive_request(socket_fd);
 	switch (ret) {
 	case SOCKET_NO_FD:
 		RTE_LOG(ERR, EAL, "  BAD VFIO group fd!\n");
@@ -628,12 +628,12 @@ vfio_get_container_fd(void)
 			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
 			return -1;
 		}
-		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_CONTAINER) < 0) {
+		if (eal_mp_sync_send_request(socket_fd, SOCKET_REQ_CONTAINER) < 0) {
 			RTE_LOG(ERR, EAL, "  cannot request container fd!\n");
 			close(socket_fd);
 			return -1;
 		}
-		vfio_container_fd = vfio_mp_sync_receive_fd(socket_fd);
+		vfio_container_fd = eal_mp_sync_receive_fd(socket_fd);
 		if (vfio_container_fd < 0) {
 			RTE_LOG(ERR, EAL, "  cannot get container fd!\n");
 			close(socket_fd);
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 5ff63e5..66e7139 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -42,6 +42,8 @@
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 6, 0)
 #include <linux/vfio.h>
 
+#include "eal_mp_socket.h"
+
 #if LINUX_VERSION_CODE < KERNEL_VERSION(3, 10, 0)
 #define RTE_PCI_MSIX_TABLE_BIR    0x7
 #define RTE_PCI_MSIX_TABLE_OFFSET 0xfffffff8
@@ -119,15 +121,6 @@ struct vfio_iommu_spapr_tce_info {
 #define VFIO_MAX_GROUPS 64
 
 /*
- * Function prototypes for VFIO multiprocess sync functions
- */
-int vfio_mp_sync_send_request(int socket, int req);
-int vfio_mp_sync_receive_request(int socket);
-int vfio_mp_sync_send_fd(int socket, int fd);
-int vfio_mp_sync_receive_fd(int socket);
-int vfio_mp_sync_connect_to_primary(void);
-
-/*
  * we don't need to store device fd's anywhere since they can be obtained from
  * the group fd via an ioctl() call.
  */
@@ -209,13 +202,12 @@ int pci_vfio_enable(void);
 int pci_vfio_is_enabled(void);
 
 int vfio_mp_sync_setup(void);
+int vfio_mp_sync_connect_to_primary(void);
 
-#define SOCKET_REQ_CONTAINER 0x100
-#define SOCKET_REQ_GROUP 0x200
-#define SOCKET_CLR_GROUP 0x300
-#define SOCKET_OK 0x0
-#define SOCKET_NO_FD 0x1
-#define SOCKET_ERR 0xFF
+#define SOCKET_REQ_CONTAINER SOCKET_REQ_USER + 0
+#define SOCKET_REQ_GROUP SOCKET_REQ_USER + 1
+#define SOCKET_CLR_GROUP SOCKET_REQ_USER + 2
+#define SOCKET_NO_FD SOCKET_REQ_USER + 3
 
 #define VFIO_PRESENT
 #endif /* kernel version */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio_mp_sync.c b/lib/librte_eal/linuxapp/eal/eal_vfio_mp_sync.c
index 7e8095c..17b0539 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio_mp_sync.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio_mp_sync.c
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -66,21 +66,10 @@
 
 #ifdef VFIO_PRESENT
 
-#define SOCKET_PATH_FMT "%s/.%s_mp_socket"
-#define CMSGLEN (CMSG_LEN(sizeof(int)))
-#define FD_TO_CMSGHDR(fd, chdr) \
-		do {\
-			(chdr).cmsg_len = CMSGLEN;\
-			(chdr).cmsg_level = SOL_SOCKET;\
-			(chdr).cmsg_type = SCM_RIGHTS;\
-			memcpy((chdr).__cmsg_data, &(fd), sizeof(fd));\
-		} while (0)
-#define CMSGHDR_TO_FD(chdr, fd) \
-			memcpy(&(fd), (chdr).__cmsg_data, sizeof(fd))
-
-static pthread_t socket_thread;
-static int mp_socket_fd;
+#define SOCKET_PATH_FMT "%s/.%s_mp_vfio_socket"
 
+static pthread_t vfio_socket_thread;
+static int mp_vfio_socket_fd;
 
 /* get socket path (/var/run if root, $HOME otherwise) */
 static void
@@ -111,156 +100,6 @@ get_socket_path(char *buffer, int bufsz)
  * in case of any error, socket is closed.
  */
 
-/* send a request, return -1 on error */
-int
-vfio_mp_sync_send_request(int socket, int req)
-{
-	struct msghdr hdr;
-	struct iovec iov;
-	int buf;
-	int ret;
-
-	memset(&hdr, 0, sizeof(hdr));
-
-	buf = req;
-
-	hdr.msg_iov = &iov;
-	hdr.msg_iovlen = 1;
-	iov.iov_base = (char *) &buf;
-	iov.iov_len = sizeof(buf);
-
-	ret = sendmsg(socket, &hdr, 0);
-	if (ret < 0)
-		return -1;
-	return 0;
-}
-
-/* receive a request and return it */
-int
-vfio_mp_sync_receive_request(int socket)
-{
-	int buf;
-	struct msghdr hdr;
-	struct iovec iov;
-	int ret, req;
-
-	memset(&hdr, 0, sizeof(hdr));
-
-	buf = SOCKET_ERR;
-
-	hdr.msg_iov = &iov;
-	hdr.msg_iovlen = 1;
-	iov.iov_base = (char *) &buf;
-	iov.iov_len = sizeof(buf);
-
-	ret = recvmsg(socket, &hdr, 0);
-	if (ret < 0)
-		return -1;
-
-	req = buf;
-
-	return req;
-}
-
-/* send OK in message, fd in control message */
-int
-vfio_mp_sync_send_fd(int socket, int fd)
-{
-	int buf;
-	struct msghdr hdr;
-	struct cmsghdr *chdr;
-	char chdr_buf[CMSGLEN];
-	struct iovec iov;
-	int ret;
-
-	chdr = (struct cmsghdr *) chdr_buf;
-	memset(chdr, 0, sizeof(chdr_buf));
-	memset(&hdr, 0, sizeof(hdr));
-
-	hdr.msg_iov = &iov;
-	hdr.msg_iovlen = 1;
-	iov.iov_base = (char *) &buf;
-	iov.iov_len = sizeof(buf);
-	hdr.msg_control = chdr;
-	hdr.msg_controllen = CMSGLEN;
-
-	buf = SOCKET_OK;
-	FD_TO_CMSGHDR(fd, *chdr);
-
-	ret = sendmsg(socket, &hdr, 0);
-	if (ret < 0)
-		return -1;
-	return 0;
-}
-
-/* receive OK in message, fd in control message */
-int
-vfio_mp_sync_receive_fd(int socket)
-{
-	int buf;
-	struct msghdr hdr;
-	struct cmsghdr *chdr;
-	char chdr_buf[CMSGLEN];
-	struct iovec iov;
-	int ret, req, fd;
-
-	buf = SOCKET_ERR;
-
-	chdr = (struct cmsghdr *) chdr_buf;
-	memset(chdr, 0, sizeof(chdr_buf));
-	memset(&hdr, 0, sizeof(hdr));
-
-	hdr.msg_iov = &iov;
-	hdr.msg_iovlen = 1;
-	iov.iov_base = (char *) &buf;
-	iov.iov_len = sizeof(buf);
-	hdr.msg_control = chdr;
-	hdr.msg_controllen = CMSGLEN;
-
-	ret = recvmsg(socket, &hdr, 0);
-	if (ret < 0)
-		return -1;
-
-	req = buf;
-
-	if (req != SOCKET_OK)
-		return -1;
-
-	CMSGHDR_TO_FD(*chdr, fd);
-
-	return fd;
-}
-
-/* connect socket_fd in secondary process to the primary process's socket */
-int
-vfio_mp_sync_connect_to_primary(void)
-{
-	struct sockaddr_un addr;
-	socklen_t sockaddr_len;
-	int socket_fd;
-
-	/* set up a socket */
-	socket_fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
-	if (socket_fd < 0) {
-		RTE_LOG(ERR, EAL, "Failed to create socket!\n");
-		return -1;
-	}
-
-	get_socket_path(addr.sun_path, sizeof(addr.sun_path));
-	addr.sun_family = AF_UNIX;
-
-	sockaddr_len = sizeof(struct sockaddr_un);
-
-	if (connect(socket_fd, (struct sockaddr *) &addr, sockaddr_len) == 0)
-		return socket_fd;
-
-	/* if connect failed */
-	close(socket_fd);
-	return -1;
-}
-
-
-
 /*
  * socket listening thread for primary process
  */
@@ -276,7 +115,7 @@ vfio_mp_sync_thread(void __rte_unused * arg)
 		socklen_t sockaddr_len = sizeof(addr);
 
 		/* this is a blocking call */
-		conn_sock = accept(mp_socket_fd, (struct sockaddr *) &addr,
+		conn_sock = accept(mp_vfio_socket_fd, (struct sockaddr *) &addr,
 				&sockaddr_len);
 
 		/* just restart on error */
@@ -292,20 +131,20 @@ vfio_mp_sync_thread(void __rte_unused * arg)
 			RTE_LOG(WARNING, EAL, "Cannot set SO_LINGER option "
 					"on listen socket (%s)\n", strerror(errno));
 
-		ret = vfio_mp_sync_receive_request(conn_sock);
+		ret = eal_mp_sync_receive_request(conn_sock);
 
 		switch (ret) {
 		case SOCKET_REQ_CONTAINER:
 			fd = vfio_get_container_fd();
 			if (fd < 0)
-				vfio_mp_sync_send_request(conn_sock, SOCKET_ERR);
+				eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
 			else
-				vfio_mp_sync_send_fd(conn_sock, fd);
+				eal_mp_sync_send_fd(conn_sock, fd);
 			close(fd);
 			break;
 		case SOCKET_REQ_GROUP:
 			/* wait for group number */
-			vfio_data = vfio_mp_sync_receive_request(conn_sock);
+			vfio_data = eal_mp_sync_receive_request(conn_sock);
 			if (vfio_data < 0) {
 				close(conn_sock);
 				continue;
@@ -314,19 +153,19 @@ vfio_mp_sync_thread(void __rte_unused * arg)
 			fd = vfio_get_group_fd(vfio_data);
 
 			if (fd < 0)
-				vfio_mp_sync_send_request(conn_sock, SOCKET_ERR);
+				eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
 			/* if VFIO group exists but isn't bound to VFIO driver */
 			else if (fd == 0)
-				vfio_mp_sync_send_request(conn_sock, SOCKET_NO_FD);
+				eal_mp_sync_send_request(conn_sock, SOCKET_NO_FD);
 			/* if group exists and is bound to VFIO driver */
 			else {
-				vfio_mp_sync_send_request(conn_sock, SOCKET_OK);
-				vfio_mp_sync_send_fd(conn_sock, fd);
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				eal_mp_sync_send_fd(conn_sock, fd);
 			}
 			break;
 		case SOCKET_CLR_GROUP:
 			/* wait for group fd */
-			vfio_data = vfio_mp_sync_receive_request(conn_sock);
+			vfio_data = eal_mp_sync_receive_request(conn_sock);
 			if (vfio_data < 0) {
 				close(conn_sock);
 				continue;
@@ -335,12 +174,12 @@ vfio_mp_sync_thread(void __rte_unused * arg)
 			ret = clear_group(vfio_data);
 
 			if (ret < 0)
-				vfio_mp_sync_send_request(conn_sock, SOCKET_NO_FD);
+				eal_mp_sync_send_request(conn_sock, SOCKET_NO_FD);
 			else
-				vfio_mp_sync_send_request(conn_sock, SOCKET_OK);
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
 			break;
 		default:
-			vfio_mp_sync_send_request(conn_sock, SOCKET_ERR);
+			eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
 			break;
 		}
 		close(conn_sock);
@@ -350,42 +189,32 @@ vfio_mp_sync_thread(void __rte_unused * arg)
 static int
 vfio_mp_sync_socket_setup(void)
 {
-	int ret, socket_fd;
-	struct sockaddr_un addr;
-	socklen_t sockaddr_len;
+	int socket_fd;
+	char path[PATH_MAX];
+
+	get_socket_path(path, sizeof(path));
 
-	/* set up a socket */
-	socket_fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
+	socket_fd = eal_mp_sync_socket_setup(path);
 	if (socket_fd < 0) {
 		RTE_LOG(ERR, EAL, "Failed to create socket!\n");
 		return -1;
 	}
 
-	get_socket_path(addr.sun_path, sizeof(addr.sun_path));
-	addr.sun_family = AF_UNIX;
-
-	sockaddr_len = sizeof(struct sockaddr_un);
-
-	unlink(addr.sun_path);
+	/* save the socket in local configuration */
+	mp_vfio_socket_fd = socket_fd;
 
-	ret = bind(socket_fd, (struct sockaddr *) &addr, sockaddr_len);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "Failed to bind socket: %s!\n", strerror(errno));
-		close(socket_fd);
-		return -1;
-	}
+	return 0;
+}
 
-	ret = listen(socket_fd, 50);
-	if (ret) {
-		RTE_LOG(ERR, EAL, "Failed to listen: %s!\n", strerror(errno));
-		close(socket_fd);
-		return -1;
-	}
+/* connect socket_fd in secondary process to the primary process's socket */
+int
+vfio_mp_sync_connect_to_primary(void)
+{
+	char path[PATH_MAX];
 
-	/* save the socket in local configuration */
-	mp_socket_fd = socket_fd;
+	get_socket_path(path, sizeof(path));
 
-	return 0;
+	return eal_mp_sync_connect_to_primary(path);
 }
 
 /*
@@ -402,18 +231,18 @@ vfio_mp_sync_setup(void)
 		return -1;
 	}
 
-	ret = pthread_create(&socket_thread, NULL,
+	ret = pthread_create(&vfio_socket_thread, NULL,
 			vfio_mp_sync_thread, NULL);
 	if (ret) {
 		RTE_LOG(ERR, EAL,
 			"Failed to create thread for communication with secondary processes!\n");
-		close(mp_socket_fd);
+		close(mp_vfio_socket_fd);
 		return -1;
 	}
 
 	/* Set thread_name for aid in debugging. */
 	snprintf(thread_name, RTE_MAX_THREAD_NAME_LEN, "vfio-sync");
-	ret = rte_thread_setname(socket_thread, thread_name);
+	ret = rte_thread_setname(vfio_socket_thread, thread_name);
 	if (ret)
 		RTE_LOG(DEBUG, EAL,
 			"Failed to set thread name for secondary processes!\n");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [dpdk-dev] [RFC 2/4] eal: enable experimental dlopen()-based secondary process support
  2017-05-19 16:39 [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Anatoly Burakov
  2017-05-19 16:39 ` [dpdk-dev] [RFC 1/4] vfio: refactor sockets into separate files Anatoly Burakov
@ 2017-05-19 16:39 ` Anatoly Burakov
  2017-05-19 17:39   ` Stephen Hemminger
  2017-05-19 16:39 ` [dpdk-dev] [RFC 3/4] apps: enable new secondary process support in multiprocess apps Anatoly Burakov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 7+ messages in thread
From: Anatoly Burakov @ 2017-05-19 16:39 UTC (permalink / raw)
  To: dev

Primary process forks itself into a new process that will be used as
basis for forking secondary processes. Secondary process then connects
to this forked process over a socket, and triggers a fork.

This new forked secondary dlopen()'s the original secondary process
binary and runs main() again. In the meantime, the original secondary
process waits until this new forked secondary dies, and exits.

"Waiting until secondary dies" is achieved through a blocking flock()
call - once it succeeds, secondary is dead as all locks are released at
process exit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/linuxapp/eal/Makefile           |   2 +
 lib/librte_eal/linuxapp/eal/eal.c              | 105 +++++-
 lib/librte_eal/linuxapp/eal/eal_mp.h           |  54 +++
 lib/librte_eal/linuxapp/eal/eal_mp_primary.c   | 477 +++++++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/eal_mp_secondary.c | 301 ++++++++++++++++
 5 files changed, 933 insertions(+), 6 deletions(-)
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp.h
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_primary.c
 create mode 100755 lib/librte_eal/linuxapp/eal/eal_mp_secondary.c

diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 24aab8d..f0ec382 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -63,6 +63,8 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_log.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_mp_socket.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio_mp_sync.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_mp_secondary.c
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_mp_primary.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_pci.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_pci_uio.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_pci_vfio.c
diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index 7c78f2d..3d646b9 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -80,6 +80,7 @@
 #include <malloc_heap.h>
 
 #include "eal_private.h"
+#include "eal_mp.h"
 #include "eal_thread.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
@@ -480,7 +481,7 @@ eal_parse_vfio_intr(const char *mode)
 
 /* Parse the arguments for --log-level only */
 static void
-eal_log_level_parse(int argc, char **argv)
+eal_early_parse(int argc, char **argv)
 {
 	int opt;
 	char **argvopt;
@@ -504,6 +505,9 @@ eal_log_level_parse(int argc, char **argv)
 		ret = (opt == OPT_LOG_LEVEL_NUM) ?
 			eal_parse_common_option(opt, optarg, &internal_config) : 0;
 
+		ret = (opt == OPT_PROC_TYPE_NUM) ?
+		    eal_parse_common_option(opt, optarg, &internal_config) : 0;
+
 		/* common parser is not happy */
 		if (ret < 0)
 			break;
@@ -745,6 +749,68 @@ static void rte_eal_init_alert(const char *msg)
 	RTE_LOG(ERR, EAL, "%s\n", msg);
 }
 
+/* secondary needs to pass parameters to the app */
+static int
+secondary_first_run(int argc, char **argv) {
+	rte_srand(rte_rdtsc());
+
+	if (eal_secondary_init(argc, argv) < 0)
+		rte_panic("Cannot init secondary\n");
+
+	RTE_LOG(ERR, EAL, "Secondary preliminary init\n");
+
+	return 0;
+}
+
+static int
+secondary_second_run(void) {
+	unsigned i;
+	int ret;
+	char thread_name[RTE_MAX_THREAD_NAME_LEN];
+
+	eal_thread_init_master(rte_config.master_lcore);
+
+	RTE_LCORE_FOREACH_SLAVE(i) {
+
+		/*
+		 * create communication pipes between master thread
+		 * and children
+		 */
+		if (pipe(lcore_config[i].pipe_master2slave) < 0)
+			rte_panic("Cannot create pipe\n");
+		if (pipe(lcore_config[i].pipe_slave2master) < 0)
+			rte_panic("Cannot create pipe\n");
+
+		lcore_config[i].state = WAIT;
+
+		/* create a thread for each lcore */
+		ret = pthread_create(&lcore_config[i].thread_id, NULL,
+		             eal_thread_loop, NULL);
+		if (ret != 0)
+			rte_panic("Cannot create thread\n");
+
+		/* Set thread_name for aid in debugging. */
+		snprintf(thread_name, RTE_MAX_THREAD_NAME_LEN,
+		    "lcore-slave-%d", i);
+		ret = rte_thread_setname(lcore_config[i].thread_id,
+		                thread_name);
+		if (ret != 0)
+			RTE_LOG(DEBUG, EAL,
+			    "Cannot set name for lcore thread\n");
+	}
+
+	/*
+	 * Launch a dummy function on all slave lcores, so that master lcore
+	 * knows they are all ready when this function returns.
+	 */
+	rte_eal_mp_remote_launch(sync_func, NULL, SKIP_MASTER);
+	rte_eal_mp_wait_lcore();
+
+	RTE_LOG(ERR, EAL, "Secondary finished init\n");
+
+	return 0;
+}
+
 /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
@@ -752,6 +818,7 @@ rte_eal_init(int argc, char **argv)
 	int i, fctret, ret;
 	pthread_t thread_id;
 	static rte_atomic32_t run_once = RTE_ATOMIC32_INIT(0);
+	static rte_atomic32_t run_once_secondary = RTE_ATOMIC32_INIT(0);
 	const char *logid;
 	char cpuset[RTE_CPU_AFFINITY_STR_LEN];
 	char thread_name[RTE_MAX_THREAD_NAME_LEN];
@@ -763,10 +830,28 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	/* short-circuit running secondary processes */
 	if (!rte_atomic32_test_and_set(&run_once)) {
-		rte_eal_init_alert("already called initialization.");
-		rte_errno = EALREADY;
-		return -1;
+		if (internal_config.process_type == RTE_PROC_SECONDARY) {
+			if (!rte_atomic32_test_and_set(&run_once_secondary)) {
+				RTE_LOG(ERR, EAL, "Can't run secondary init twice!\n");
+				rte_errno = EALREADY;
+				return -1;
+			} else {
+
+				/* parse EAL arguments before running secondary process */
+				fctret = eal_parse_args(argc, argv);
+				if (fctret < 0)
+					exit(1);
+
+				secondary_second_run();
+				return fctret;
+			}
+		} else {
+			rte_eal_init_alert("already called initialization.");
+			rte_errno = EALREADY;
+			return -1;
+		}
 	}
 
 	logid = strrchr(argv[0], '/');
@@ -776,8 +861,8 @@ rte_eal_init(int argc, char **argv)
 
 	eal_reset_internal_config(&internal_config);
 
-	/* set log level as early as possible */
-	eal_log_level_parse(argc, argv);
+	/* set log level and process type as early as possible */
+	eal_early_parse(argc, argv);
 
 	if (rte_eal_cpu_init() < 0) {
 		rte_eal_init_alert("Cannot detect lcores.");
@@ -785,6 +870,11 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	if (internal_config.process_type == RTE_PROC_SECONDARY) {
+		secondary_first_run(argc, argv);
+		rte_exit(EXIT_SUCCESS, "Done");
+	}
+
 	fctret = eal_parse_args(argc, argv);
 	if (fctret < 0) {
 		rte_eal_init_alert("Invalid 'command line' arguments.");
@@ -939,6 +1029,9 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	if (eal_secondary_mp_sync_setup() < 0)
+		RTE_LOG(WARNING, EAL, "Couldn't start multiprocess socket!\n");
+
 	rte_eal_mcfg_complete();
 
 	return fctret;
diff --git a/lib/librte_eal/linuxapp/eal/eal_mp.h b/lib/librte_eal/linuxapp/eal/eal_mp.h
new file mode 100755
index 0000000..43ff9df
--- /dev/null
+++ b/lib/librte_eal/linuxapp/eal/eal_mp.h
@@ -0,0 +1,54 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef EAL_MP_H
+#define EAL_MP_H
+
+#include "eal_mp_socket.h"
+
+#define SOCKET_REQ_FORK SOCKET_REQ_USER + 0
+#define SOCKET_REQ_STDIN SOCKET_REQ_USER + 1
+#define SOCKET_REQ_STDOUT SOCKET_REQ_USER + 2
+#define SOCKET_REQ_STDERR SOCKET_REQ_USER + 3
+#define SOCKET_REQ_LOGFILE SOCKET_REQ_USER + 4
+#define SOCKET_REQ_PATH SOCKET_REQ_USER + 5
+#define SOCKET_REQ_ARGC SOCKET_REQ_USER + 6
+#define SOCKET_REQ_ARGV SOCKET_REQ_USER + 7
+
+int eal_secondary_mp_sync_setup(void);
+int eal_secondary_mp_sync_connect_to_primary(void);
+void eal_secondary_mp_sync_get_socket_path(char *buffer, int bufsz);
+
+int eal_secondary_init(int argc, char **argv);
+
+#endif // EAL_MP_H
diff --git a/lib/librte_eal/linuxapp/eal/eal_mp_primary.c b/lib/librte_eal/linuxapp/eal/eal_mp_primary.c
new file mode 100755
index 0000000..32fce69
--- /dev/null
+++ b/lib/librte_eal/linuxapp/eal/eal_mp_primary.c
@@ -0,0 +1,477 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <limits.h>
+#include <pthread.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <dlfcn.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <time.h>
+#include <sys/file.h>
+#include <linux/version.h>
+#include <signal.h>
+#include <sys/prctl.h>
+
+/* sys/un.h with __USE_MISC uses strlen, which is unsafe */
+#ifdef __USE_MISC
+#define REMOVED_USE_MISC
+#undef __USE_MISC
+#endif
+#include <sys/un.h>
+/* make sure we redefine __USE_MISC only if it was previously undefined */
+#ifdef REMOVED_USE_MISC
+#define __USE_MISC
+#undef REMOVED_USE_MISC
+#endif
+
+#include <rte_log.h>
+#include <rte_pci.h>
+#include <rte_eal_memconfig.h>
+#include <rte_malloc.h>
+#include <rte_cycles.h>
+
+#include "eal_filesystem.h"
+#include "eal_pci_init.h"
+#include "eal_thread.h"
+#include "eal_mp.h"
+
+#define SOCKET_PATH_FMT "%s/.%s_mp_secondary_socket"
+#define LOCKFILE_PATH_FMT "%s/.%s_secondary_lock_%s"
+
+static
+const char *get_run_dir(void) {
+	const char *dir = "/var/run";
+	const char *home_dir = getenv("HOME");
+
+	if (getuid() != 0 && home_dir != NULL)
+		dir = home_dir;
+	return dir;
+}
+
+static
+void get_rand_str(char *str, int sz) {
+	char charset[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
+	for (int i = 0; i < sz - 1; i++) {
+		// this does not give us *true* randomness but it's good enough
+		int idx = rand() % sizeof(charset);
+		str[i] = charset[idx];
+	}
+	str[sz - 1] = '\0';
+}
+
+/* we need to know its length */
+static
+int get_lock_file_path(char *str, int sz) {
+	char rand_str[16];
+
+	get_rand_str(rand_str, 16);
+
+	return snprintf(str, sz, LOCKFILE_PATH_FMT, get_run_dir(),
+	                internal_config.hugefile_prefix, rand_str);
+}
+
+static int secondary_mp_socket_fd;
+
+/* get socket path (/var/run if root, $HOME otherwise) */
+void
+eal_secondary_mp_sync_get_socket_path(char *buffer, int bufsz)
+{
+	/* use current prefix as file path */
+	snprintf(buffer, bufsz, SOCKET_PATH_FMT, get_run_dir(),
+	        internal_config.hugefile_prefix);
+}
+
+static void *
+secondary_wait_thread(void * arg)
+{
+	int status;
+	pid_t pid = *(pid_t*) arg;
+
+	RTE_LOG(INFO, EAL, "Secondary process %i started\n", pid);
+
+	waitpid(pid, &status, 0);
+
+	RTE_LOG(INFO, EAL, "Secondary process %i died\n", pid);
+
+	/* TODO: notify others this one has died? */
+
+	pthread_exit(0);
+	return 0;
+}
+
+/* handle parent exit */
+static void
+parent_exit(int __rte_unused sig)
+{
+	exit(0);
+}
+
+/*
+ * data flow for socket comm protocol:
+ *
+ * in case of any error, socket is closed.
+ */
+
+static int
+secondary_mp_sync_socket_setup(void)
+{
+	int socket_fd;
+	char path[PATH_MAX];
+
+	/* generate random socket name */
+	eal_secondary_mp_sync_get_socket_path(path, sizeof(path));
+
+	socket_fd = eal_mp_sync_socket_setup(path);
+	if (socket_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to create socket!\n");
+		return -1;
+	}
+
+	/* save the socket in local configuration */
+	secondary_mp_socket_fd = socket_fd;
+
+	return 0;
+}
+
+/* connect socket_fd in secondary process to the primary process's socket */
+int
+eal_secondary_mp_sync_connect_to_primary(void)
+{
+	char path[PATH_MAX];
+
+	eal_secondary_mp_sync_get_socket_path(path, sizeof(path));
+
+	return eal_mp_sync_connect_to_primary(path);
+}
+
+/*
+ * listen for sockets
+ */
+static void
+secondary_mp_sync_listener(void)
+{
+	int cur_argv = 0, argc = 0;
+	char *argv[4096] = {0};
+	char *str;
+	int ret;
+
+	/* get seed from tsc */
+	srand((unsigned) rte_rdtsc());
+
+	if (secondary_mp_sync_socket_setup() < 0) {
+		RTE_LOG(ERR, EAL, "Failed to set up local socket!\n");
+		return;
+	}
+
+	/* wait for requests on the socket and the IPC */
+	for (;;) {
+		int conn_sock;
+		struct sockaddr_un addr;
+		socklen_t sockaddr_len = sizeof(addr);
+
+		/* this is a blocking call */
+		conn_sock = accept(secondary_mp_socket_fd, (struct sockaddr *) &addr,
+		        &sockaddr_len);
+
+		/* just restart on error */
+		if (conn_sock == -1)
+			continue;
+
+		/* set socket to linger after close */
+		struct linger l;
+		l.l_onoff = 1;
+		l.l_linger = 60;
+
+		if (setsockopt(conn_sock, SOL_SOCKET, SO_LINGER, &l, sizeof(l)) < 0)
+			RTE_LOG(WARNING, EAL, "Cannot set SO_LINGER option "
+			        "on listen socket (%s)\n", strerror(errno));
+
+		bool done = false;
+		bool is_fork = false;
+
+		/* forked process data */
+		char path[PATH_MAX] = "";
+		char lockfile[PATH_MAX] = "";
+		int sp_stdin = STDIN_FILENO;
+		int sp_stdout = STDOUT_FILENO;
+		int sp_stderr = STDERR_FILENO;
+		int sp_log = rte_logs.file == NULL ? sp_stderr : fileno(rte_logs.file);
+
+		while (!done) {
+			ret = eal_mp_sync_receive_request(conn_sock);
+
+			switch (ret) {
+			case SOCKET_REQ_STDIN:
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				ret = eal_mp_sync_receive_fd(conn_sock);
+				if (ret < 0) {
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				sp_stdin = ret;
+				break;
+			case SOCKET_REQ_STDOUT:
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				ret = eal_mp_sync_receive_fd(conn_sock);
+				if (ret < 0) {
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				sp_stdout = ret;
+				break;
+			case SOCKET_REQ_STDERR:
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				ret = eal_mp_sync_receive_fd(conn_sock);
+				if (ret < 0) {
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				sp_stderr = ret;
+				break;
+			case SOCKET_REQ_LOGFILE:
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				ret = eal_mp_sync_receive_fd(conn_sock);
+				if (ret < 0) {
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				sp_log = ret;
+				break;
+			case SOCKET_REQ_PATH:
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				/* receive path */
+				if (eal_mp_sync_receive_data(conn_sock, path, PATH_MAX)) {
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				break;
+			case SOCKET_REQ_ARGC:
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				/* receive argc */
+				if (eal_mp_sync_receive_data(conn_sock, &argc, sizeof(argc))) {
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				break;
+			case SOCKET_REQ_ARGV:
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				/* 4K should be enough for everyone */
+				str = (char*) calloc(1024, 4);
+
+				/* receive argv */
+				if (eal_mp_sync_receive_data(conn_sock, str, 4096)) {
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				argv[cur_argv++] = str;
+
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+				break;
+			case SOCKET_REQ_FORK:
+				/*
+				 * before we can fork, we need to make sure that argc matches
+				 * cur_argv
+				 */
+				if (argc != cur_argv) {
+					RTE_LOG(ERR, EAL, "Argument number mismatch\n");
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+				eal_mp_sync_send_request(conn_sock, SOCKET_OK);
+
+				// get_lock_file_path returns length, not total bytes
+				int len = get_lock_file_path(lockfile, sizeof(lockfile)) + 1;
+
+				pid_t id = fork();
+
+				if (id < 0) {
+					RTE_LOG(ERR, EAL, "Failed to fork\n");
+					eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+					done = true;
+					break;
+				}
+
+				/* we're going to be forked, so stop the loop */
+				done = true;
+				if (id == 0) {
+					/* pointer to exported function */
+					void (*exported)(int argc, char** argv);
+					is_fork = true;
+
+					/* touch the file */
+					int fd = creat(lockfile, O_EXCL);
+					flock(fd, LOCK_EX);
+
+					/* set up file descriptors */
+					dup2(sp_stdin, STDIN_FILENO);
+					dup2(sp_stdout, STDOUT_FILENO);
+					dup2(sp_stderr, STDERR_FILENO);
+					rte_openlog_stream(fdopen(sp_log, "w+"));
+
+					/* send lockfile path */
+					eal_mp_sync_send_data(conn_sock, lockfile, len);
+
+					/* close the sockets */
+					close(secondary_mp_socket_fd);
+					close(conn_sock);
+
+					/* let the magic happen! */
+					void *h = dlopen(path, RTLD_NOW | RTLD_GLOBAL);
+					if (!h) {
+						RTE_LOG(ERR, EAL, "Couldn't dlopen: %s\n", dlerror());
+						exit(1);
+					}
+					dlerror();
+					*(void **) (&exported) = dlsym(h, "main");
+
+					char *err = dlerror();
+					if (err) {
+						RTE_LOG(ERR, EAL, "Couldn't dlsym: %s\n", err);
+						exit(1);
+					}
+					/* prepare to run EAL second time */
+					internal_config.process_type = RTE_PROC_SECONDARY;
+					rte_eal_get_configuration()->process_type = RTE_PROC_SECONDARY;
+
+					(*exported)(argc, argv);
+
+					dlclose(h);
+				} else {
+					char thread_name[RTE_MAX_THREAD_NAME_LEN];
+
+					/* clean up after ourselves */
+					close(sp_stdin);
+					close(sp_stdout);
+					close(sp_stderr);
+					close(sp_log);
+					for (int i = 0; i < argc; i++) {
+						free(argv[i]);
+					}
+
+					pthread_t thread;
+					/* run a new thread waiting for child's termination */
+
+					/* TODO: store id somewhere, as this is dangerous */
+					ret = pthread_create(&thread, NULL,
+					        secondary_wait_thread, &id);
+					if (ret) {
+						RTE_LOG(ERR, EAL,
+						    "Failed to create thread for communication with secondary processes!\n");
+					}
+
+					/* Set thread_name for aid in debugging. */
+					snprintf(thread_name, RTE_MAX_THREAD_NAME_LEN, "secondary_%u", id);
+					RTE_LOG(DEBUG, EAL, "Attempting to create thread %s\n", thread_name);
+					ret = rte_thread_setname(thread, thread_name);
+					if (ret)
+						RTE_LOG(DEBUG, EAL,
+						    "Failed to set thread name for secondary processes!\n");
+
+				}
+				break;
+			default:
+				eal_mp_sync_send_request(conn_sock, SOCKET_ERR);
+				done = true;
+				break;
+			}
+		}
+		/* forked process probably closed this already but we don't care */
+		close(conn_sock);
+		if (is_fork) {
+			/* fork executable doesn't need to listen on socket */
+			return;
+		}
+	}
+}
+
+/*
+ * set up a local socket and tell it to listen for incoming connections
+ */
+int
+eal_secondary_mp_sync_setup(void)
+{
+	/* pre-fork instead of creating a listening thread */
+	pid_t id = fork();
+	if (id < 0) {
+		RTE_LOG(ERR, EAL, "Failed to fork!\n");
+		return -1;
+	} else if (id == 0) {
+		/* child process */
+
+		if (prctl(PR_SET_PDEATHSIG, SIGUSR1, 0, 0, 0, 0) != 0)
+			RTE_LOG(ERR, EAL, "Can't register parent exit handler\n");
+		else {
+			struct sigaction act;
+			memset(&act, 0 , sizeof(act));
+			act.sa_handler = parent_exit;
+			if (sigaction(SIGUSR1, &act, NULL) != 0)
+				RTE_LOG(ERR, EAL, "Can't register parent exit signal callback\n");
+		}
+
+		secondary_mp_sync_listener();
+		rte_exit(EXIT_SUCCESS, "Secondary process finished\n");
+	} else {
+		/* what if socket setup fails? do we care? */
+		RTE_LOG(INFO, EAL, "Fork successful\n");
+	}
+
+	return 0;
+}
diff --git a/lib/librte_eal/linuxapp/eal/eal_mp_secondary.c b/lib/librte_eal/linuxapp/eal/eal_mp_secondary.c
new file mode 100755
index 0000000..5ebfbc9
--- /dev/null
+++ b/lib/librte_eal/linuxapp/eal/eal_mp_secondary.c
@@ -0,0 +1,301 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <pthread.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <dlfcn.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <time.h>
+#include <sys/file.h>
+#include <linux/version.h>
+
+/* sys/un.h with __USE_MISC uses strlen, which is unsafe */
+#ifdef __USE_MISC
+#define REMOVED_USE_MISC
+#undef __USE_MISC
+#endif
+#include <sys/un.h>
+/* make sure we redefine __USE_MISC only if it was previously undefined */
+#ifdef REMOVED_USE_MISC
+#define __USE_MISC
+#undef REMOVED_USE_MISC
+#endif
+
+#include <rte_log.h>
+#include <rte_pci.h>
+#include <rte_eal_memconfig.h>
+#include <rte_malloc.h>
+#include <rte_common.h>
+
+#include "eal_filesystem.h"
+#include "eal_pci_init.h"
+#include "eal_thread.h"
+#include "eal_mp.h"
+
+#define EXPORT __attribute__((visibility("default")))
+
+#define SELF_PATH "/proc/self/exe"
+
+enum fd_type {
+	STDIN,
+	STDOUT,
+	STDERR,
+	LOGFILE
+};
+
+/* connect socket_fd in secondary process to the primary process's socket */
+static
+int connect_to_primary(void)
+{
+	struct sockaddr_un addr;
+	socklen_t sockaddr_len;
+	int socket_fd;
+
+	/* set up a socket */
+	socket_fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);
+	if (socket_fd < 0) {
+		RTE_LOG(INFO, EAL, "Failed to create socket!\n");
+		return -1;
+	}
+
+	eal_secondary_mp_sync_get_socket_path(addr.sun_path, sizeof(addr.sun_path));
+	addr.sun_family = AF_UNIX;
+
+	sockaddr_len = sizeof(struct sockaddr_un);
+
+	if (connect(socket_fd, (struct sockaddr *) &addr, sockaddr_len) == 0)
+		return socket_fd;
+
+	/* if connect failed */
+	close(socket_fd);
+	return -1;
+}
+
+static int
+sendpath(int socket) {
+	char path[PATH_MAX] = "";
+	int len = readlink(SELF_PATH, path, PATH_MAX - 1) + 1;
+	if (len < 0) {
+		RTE_LOG(INFO, EAL, "Failed to get current path\n");
+		return -1;
+	}
+	if (eal_mp_sync_send_request(socket, SOCKET_REQ_PATH)) {
+		RTE_LOG(INFO, EAL, "Couldn't send path request\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+		RTE_LOG(INFO, EAL, "Didn't get path ack\n");
+		return -1;
+	}
+	if (eal_mp_sync_send_data(socket, path, len)) {
+		RTE_LOG(INFO, EAL, "Couldn't send path\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+		RTE_LOG(INFO, EAL, "Didn't get path ack\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+sendargs(int socket, int argc, char **argv) {
+	if (argc == 0) {
+		/* no arguments to be sent */
+		return 0;
+	}
+	if (eal_mp_sync_send_request(socket, SOCKET_REQ_ARGC)) {
+		RTE_LOG(INFO, EAL, "Couldn't send argc request\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+		RTE_LOG(INFO, EAL, "Didn't get argc ack\n");
+		return -1;
+	}
+	if (eal_mp_sync_send_data(socket, &argc, sizeof(argc))) {
+		RTE_LOG(INFO, EAL, "Couldn't send argc\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+		RTE_LOG(INFO, EAL, "Didn't get argc ack\n");
+		return -1;
+	}
+
+	for (int i = 0; i < argc; i++) {
+		char *str = argv[i];
+		int len = strlen(str) + 1;
+
+		if (eal_mp_sync_send_request(socket, SOCKET_REQ_ARGV)) {
+			RTE_LOG(INFO, EAL, "Couldn't send argv request\n");
+			return -1;
+		}
+		if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+			RTE_LOG(INFO, EAL, "Didn't get argc ack\n");
+			return -1;
+		}
+		if (eal_mp_sync_send_data(socket, str, len)) {
+			RTE_LOG(INFO, EAL, "Couldn't send argv\n");
+			return -1;
+		}
+		if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+			RTE_LOG(INFO, EAL, "Didn't get argv ack\n");
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int
+sendfd(int socket, enum fd_type t) {
+	int fd, req;
+	switch (t) {
+		case STDIN:
+			fd = STDIN_FILENO;
+			req = SOCKET_REQ_STDIN;
+			break;
+		case STDOUT:
+			fd = STDOUT_FILENO;
+			req = SOCKET_REQ_STDOUT;
+			break;
+		case STDERR:
+			fd = STDERR_FILENO;
+			req = SOCKET_REQ_STDERR;
+			break;
+		case LOGFILE:
+			fd = rte_logs.file == NULL ? STDERR_FILENO : fileno(rte_logs.file);
+			req = SOCKET_REQ_LOGFILE;
+			break;
+	}
+	if (eal_mp_sync_send_request(socket, req)) {
+		RTE_LOG(INFO, EAL, "Couldn't send fd request\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+		RTE_LOG(INFO, EAL, "Didn't get fd request ack\n");
+		return -1;
+	}
+	if (eal_mp_sync_send_fd(socket, fd)) {
+		RTE_LOG(INFO, EAL, "Couldn't send fd\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+		RTE_LOG(INFO, EAL, "Didn't get fd ack\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+reqfork(int socket, char *str, int sz) {
+	if (eal_mp_sync_send_request(socket, SOCKET_REQ_FORK)) {
+		RTE_LOG(INFO, EAL, "Couldn't send fork request\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_request(socket) != SOCKET_OK) {
+		RTE_LOG(INFO, EAL, "Didn't get fork request ack\n");
+		return -1;
+	}
+	if (eal_mp_sync_receive_data(socket, str, sz)) {
+		RTE_LOG(INFO, EAL, "Couldn't receive lockfile path\n");
+		return -1;
+	}
+	return 0;
+}
+
+int eal_secondary_init(int argc, char **argv) {
+	RTE_LOG(INFO, EAL, "Secondary process initializing\n");
+
+	char path[4096];
+
+	int sock = connect_to_primary();
+
+	if (sock < 0) {
+		RTE_LOG(INFO, EAL, "Couldn't connect to primary\n");
+		return -1;
+	}
+	if (sendpath(sock)) {
+		RTE_LOG(INFO, EAL, "Sending path failed\n");
+		return -1;
+	}
+	if (sendfd(sock, STDIN)) {
+		RTE_LOG(INFO, EAL, "Sending stdin failed\n");
+		return -1;
+	}
+	if (sendfd(sock, STDOUT)) {
+		RTE_LOG(INFO, EAL, "Sending stdout failed\n");
+		return -1;
+	}
+	if (sendfd(sock, STDERR)) {
+		RTE_LOG(INFO, EAL, "Sending stderr failed\n");
+		return -1;
+	}
+	if (sendfd(sock, LOGFILE)) {
+		RTE_LOG(INFO, EAL, "Sending logfile failed\n");
+		return -1;
+	}
+	if (sendargs(sock, argc, argv)) {
+		RTE_LOG(INFO, EAL, "Sending args failed\n");
+		return -1;
+	}
+	if (reqfork(sock, path, sizeof(path))) {
+		RTE_LOG(INFO, EAL, "Fork failed\n");
+		return -1;
+	}
+	close(sock);
+
+	/* at this point, the file is locked by the primary */
+
+	int fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		RTE_LOG(INFO, EAL, "open failed for %s: %s\n", path, strerror(errno));
+		return -1;
+	}
+
+	// blocking call - if succeeded, that means secondary is dead
+	if (flock(fd, LOCK_EX) < 0) {
+		RTE_LOG(INFO, EAL, "Lock failed: %s\n", strerror(errno));
+		return -1;
+	} else {
+		RTE_LOG(INFO, EAL, "Secondary process exited\n");
+		close(fd);
+		unlink(path);
+	}
+
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [dpdk-dev] [RFC 3/4] apps: enable new secondary process support in multiprocess apps
  2017-05-19 16:39 [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Anatoly Burakov
  2017-05-19 16:39 ` [dpdk-dev] [RFC 1/4] vfio: refactor sockets into separate files Anatoly Burakov
  2017-05-19 16:39 ` [dpdk-dev] [RFC 2/4] eal: enable experimental dlopen()-based secondary process support Anatoly Burakov
@ 2017-05-19 16:39 ` Anatoly Burakov
  2017-05-19 16:39 ` [dpdk-dev] [RFC 4/4] mk: default to compiling shared libraries Anatoly Burakov
  2017-07-10 10:18 ` [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Burakov, Anatoly
  4 siblings, 0 replies; 7+ messages in thread
From: Anatoly Burakov @ 2017-05-19 16:39 UTC (permalink / raw)
  To: dev

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 examples/multi_process/client_server_mp/mp_client/Makefile | 2 +-
 examples/multi_process/simple_mp/Makefile                  | 2 +-
 examples/multi_process/symmetric_mp/Makefile               | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/examples/multi_process/client_server_mp/mp_client/Makefile b/examples/multi_process/client_server_mp/mp_client/Makefile
index 2688fed..081832e 100644
--- a/examples/multi_process/client_server_mp/mp_client/Makefile
+++ b/examples/multi_process/client_server_mp/mp_client/Makefile
@@ -42,7 +42,7 @@ APP = mp_client
 # all source are stored in SRCS-y
 SRCS-y := client.c
 
-CFLAGS += $(WERROR_FLAGS) -O3
+CFLAGS += $(WERROR_FLAGS) -O3 -rdynamic -fPIC -pie
 CFLAGS += -I$(SRCDIR)/../shared
 
 include $(RTE_SDK)/mk/rte.extapp.mk
diff --git a/examples/multi_process/simple_mp/Makefile b/examples/multi_process/simple_mp/Makefile
index 31ec0c8..f74f456 100644
--- a/examples/multi_process/simple_mp/Makefile
+++ b/examples/multi_process/simple_mp/Makefile
@@ -44,7 +44,7 @@ APP = simple_mp
 # all source are stored in SRCS-y
 SRCS-y := main.c mp_commands.c
 
-CFLAGS += -O3
+CFLAGS += -O3 -rdynamic -fPIC -pie
 CFLAGS += $(WERROR_FLAGS)
 
 include $(RTE_SDK)/mk/rte.extapp.mk
diff --git a/examples/multi_process/symmetric_mp/Makefile b/examples/multi_process/symmetric_mp/Makefile
index c789f3c..e9e9bf8 100644
--- a/examples/multi_process/symmetric_mp/Makefile
+++ b/examples/multi_process/symmetric_mp/Makefile
@@ -44,7 +44,7 @@ APP = symmetric_mp
 # all source are stored in SRCS-y
 SRCS-y := main.c
 
-CFLAGS += -O3
+CFLAGS += -O3 -rdynamic -fPIC -pie
 CFLAGS += $(WERROR_FLAGS)
 
 include $(RTE_SDK)/mk/rte.extapp.mk
-- 
2.7.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [dpdk-dev] [RFC 4/4] mk: default to compiling shared libraries
  2017-05-19 16:39 [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Anatoly Burakov
                   ` (2 preceding siblings ...)
  2017-05-19 16:39 ` [dpdk-dev] [RFC 3/4] apps: enable new secondary process support in multiprocess apps Anatoly Burakov
@ 2017-05-19 16:39 ` Anatoly Burakov
  2017-07-10 10:18 ` [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Burakov, Anatoly
  4 siblings, 0 replies; 7+ messages in thread
From: Anatoly Burakov @ 2017-05-19 16:39 UTC (permalink / raw)
  To: dev

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 config/common_base | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/config/common_base b/config/common_base
index 8907bea..1c088e9 100644
--- a/config/common_base
+++ b/config/common_base
@@ -67,7 +67,7 @@ CONFIG_RTE_ARCH_STRICT_ALIGN=n
 #
 # Compile to share library
 #
-CONFIG_RTE_BUILD_SHARED_LIB=n
+CONFIG_RTE_BUILD_SHARED_LIB=y
 
 #
 # Use newest code breaking previous ABI
-- 
2.7.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [RFC 2/4] eal: enable experimental dlopen()-based secondary process support
  2017-05-19 16:39 ` [dpdk-dev] [RFC 2/4] eal: enable experimental dlopen()-based secondary process support Anatoly Burakov
@ 2017-05-19 17:39   ` Stephen Hemminger
  0 siblings, 0 replies; 7+ messages in thread
From: Stephen Hemminger @ 2017-05-19 17:39 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev

On Fri, 19 May 2017 17:39:44 +0100
Anatoly Burakov <anatoly.burakov@intel.com> wrote:

> This new forked secondary dlopen()'s the original secondary process
> binary and runs main() again. In the meantime, the original secondary
> process waits until this new forked secondary dies, and exits.


You don't have to use a lock file. Just using a pipe for process
standard input and detecting close on process exit is often simpler.

> +static
> +const char *get_run_dir(void) {
> +	const char *dir = "/var/run";
> +	const char *home_dir = getenv("HOME");
> +
> +	if (getuid() != 0 && home_dir != NULL)
> +		dir = home_dir;
> +	return dir;
> +}
> +
> +static
> +void get_rand_str(char *str, int sz) {
> +	char charset[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
> +	for (int i = 0; i < sz - 1; i++) {
> +		// this does not give us *true* randomness but it's good enough
> +		int idx = rand() % sizeof(charset);
> +		str[i] = charset[idx];
> +	}
> +	str[sz - 1] = '\0';
> +}
> +
> +/* we need to know its length */
> +static
> +int get_lock_file_path(char *str, int sz) {
> +	char rand_str[16];
> +
> +	get_rand_str(rand_str, 16);
> +
> +	return snprintf(str, sz, LOCKFILE_PATH_FMT, get_run_dir(),
> +	                internal_config.hugefile_prefix, rand_str);
> +}
> +

Why reinvent all the stuff in mkstemp and friends?

Also don't use C++ style comments in DPDK code.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [RFC 0/4] DPDK multiprocess rework
  2017-05-19 16:39 [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Anatoly Burakov
                   ` (3 preceding siblings ...)
  2017-05-19 16:39 ` [dpdk-dev] [RFC 4/4] mk: default to compiling shared libraries Anatoly Burakov
@ 2017-07-10 10:18 ` Burakov, Anatoly
  4 siblings, 0 replies; 7+ messages in thread
From: Burakov, Anatoly @ 2017-07-10 10:18 UTC (permalink / raw)
  To: Burakov, Anatoly, dev

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Anatoly Burakov
> Sent: Friday, May 19, 2017 5:40 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [RFC 0/4] DPDK multiprocess rework
> 
> This is a proof-of-concept proposal for rework of how DPDK secondary
> processes work. While the code has some limitations, it works well enough to
> demonstrate the concept, and it can successfully run all existing multiprocess
> applications.
> 
> Current problems with DPDK secondary processes:
> * ASLR interferes with mappings
>   * "Fixed" by disabling ASLR, but not really a solution
> * Secondary process may map things into where we want to map shared
> memory
>   * _Almost_ works with --base-virtaddr, but unreliable and tedious
> * Function pointers don't work (so e.g. hash library is broken)
> 
> Proposed solution:
> 
> Instead of running secondary process and mapping resources from primary
> process, the following is done:
> 0) compile all applications as position-indendent executables, compile DPDK
> as
>    a shared library
> 1) fork() from primary process
> 2) dlopen() secondary process binary
> 3) use dlsym() to find entry point
> 4) run the application code while having all resources already mapped
> 
> Benefits:
> * No more ASLR issues
> * No need for --base-virtaddr
> * Function pointers from primary process will work in secondaries
>   * Hash library (and any other library that uses function pointers internally)
>     will work correctly in multi-process scenario
>   * ethdev data can be moved to shared memory
>   * Primary process interrupt callbacks can be run by secondary process
> * More secure as all applications are compiled as position-indendent binaries
>   (default on Fedora)
> 
> Potential drawbacks (that we could think of):
> * Kind of a hack
> * Puts some code restrictions on secondary processes
>   * Anything happening before EAL init will be run twice
> * Some use cases are no longer possible (attaching to a dead primary)
> * May impact binaries compiled to use a lot (kilobytes) of thread-local
> storage[1]
> * Likely wouldn't work for static linking
> 
> There are also a number of issues that need to be resolved, but those are
> implementation details and are out of scope for RFC.
> 
> What is explicitly out of scope:
> * Fixing interrupts in secondary processes
> * Fixing hotplug in secondary processes
> 
> These currently do not work in secondary processes, and this proposal does
> nothing to change that. They are better addressed using dedicated EAL-
> internal IPC proposal.
> 
> 
> Technical nitty-gritty
> 
> Things quickly get confusing, so terminology:
> - Original Primary is normal DPDK primary process
> - Forked Primary is a "clean slate" primary process, from which all secondary
>   processes will be forked (threads and fork don't mix well, so fork is done
>   after all the hugepage and PCI data is mapped, but before all the threads
> are
>   spun up)
> - Original Secondary is a process that connects to Forked Primary, sends
> some
>   data and and triggers a fork
> - Forked Secondary is _actual_ secondary process (forked from Forked
> Primary)
> 
> Timeline:
> - Original Primary starts
> - Forked Primary is forked from Original Primary
> - Original Secondary starts and connects to Forked Primary
> - Forked Primary forks into Forked Secondary
> - Original Secondary waits until Forked Secondary dies
> 
> During EAL init, Original Primary does a fork() to form a Forked Primary - a
> "clean slate" starting point for secondary processes. Forked Primary opens a
> local socket (a-la VFIO) and starts listening for incoming connections.
> 
> Original Secondary process connects to Forked Primary, sends stdout/log
> fd's, command line parameters, etc. over local socket, and sits around waiting
> for Forked Secondary to die, then exits (Original Secondary does _not_ map
> anything or do any EAL init, it rte_exit()'s from inside rte_eal_init()). Forked
> Secondary process then executes main(), passing all command-line
> arguments, and execution of secondary process resumes.
> 
> Why pre-fork and not pthread like VFIO?
> 
> Pthreads and fork() don't mix well, because fork() stops the world (all
> threads disappear, leaving behind thread stacks, locks and possibly
> inconsistent state of both app data and system libraries). On the other hand,
> forking from single- threaded context is safe. Current implementation
> doesn't _exactly_ fork from a single-threaded context, but this can be fixed
> later by rearranging EAL init.
> 
> [1]: https://www.redhat.com/archives/phil-list/2003-
> February/msg00077.html
> 

Ping

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-07-10 10:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-19 16:39 [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Anatoly Burakov
2017-05-19 16:39 ` [dpdk-dev] [RFC 1/4] vfio: refactor sockets into separate files Anatoly Burakov
2017-05-19 16:39 ` [dpdk-dev] [RFC 2/4] eal: enable experimental dlopen()-based secondary process support Anatoly Burakov
2017-05-19 17:39   ` Stephen Hemminger
2017-05-19 16:39 ` [dpdk-dev] [RFC 3/4] apps: enable new secondary process support in multiprocess apps Anatoly Burakov
2017-05-19 16:39 ` [dpdk-dev] [RFC 4/4] mk: default to compiling shared libraries Anatoly Burakov
2017-07-10 10:18 ` [dpdk-dev] [RFC 0/4] DPDK multiprocess rework Burakov, Anatoly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).