* [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
@ 2021-03-17 20:25 Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
` (6 more replies)
0 siblings, 7 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-17 20:25 UTC (permalink / raw)
To: Maxime Coquelin
Cc: Chenbo Xia, dev, Adrian Moreno, Stefan Hajnoczi, Julia Suvorova,
Ilya Maximets
TL;DR;
Managing socket files is too much fun. :) And here is how this
could be improved:
https://github.com/igsilya/one-socket
https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst
In particular for vhost-user case.
In modern virtualization setups there are tens or hundreds of different
socket files for different purposes. Sockets to manage various
daemons, vhost-user sockets for various virtual devices, memif sockets
for memif network interfaces and so on.
In order to make things work in containerized environments software
systems has to share these sockets with containers. In most cases
this sharing is implemented as a shared directory mounted inside the
container, because socket files could be re-created in runtime or even
not be available at the container startup. For example, if they are
created by the application inside the container.
Even more configuration tricks required in order to share some sockets
between different containers and not only with the host, e.g. to
create service chains.
And some housekeeping usually required for applications in case the
socket server terminated abnormally and socket files left on a file
system:
"failed to bind to vhu: Address already in use; remove it and try again"
Additionally, all applications (system and user's!) should follow
naming conventions and place socket files in particular location on a
file system to make things work.
In particular, this applies to vhost-user sockets.
This patch-set aims to eliminate most of the inconveniences by
leveraging an infrastructure service provided by a SocketPair Broker.
*SocketPair Broker* is a daemon that mediates establishment of direct
socket-based connections between clients.
*One Socket* is a reference implementation of a SocketPair Broker
Daemon, SocketPair Broker Protocol and a helper library for client
applications (libspbroker):
https://github.com/igsilya/one-socket
It's fully functional, but not completely ready for production use
for now. See 'todo' section in README.rst in one-socket repository.
Basically, it's a daemon that listens on a single unix socket
(broker socket) and accepts clients. Client connects and provides a
'key'. If two clients provided the same 'key', One Socket daemon
creates a pair of connected sockets with socketpair() and sends
sides of this pair to these two clients. At this point two clients
have a direct communication channel between them. They will disconnect
from the broker and continue to operate and communicate normally.
Workflow overview with pictures available here:
https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst
Communication with a broker based on a SocketPair Broker Protocol:
https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker-proto-spec.rst
This patch-set extends vhost library, vhost pmd and virtio-user pmd to
support SocketPair Broker as one of the connection methods.
Usage example:
# Starting a One Socket daemon with socket './one.socket':
$ ONE_SOCKET_PATH=./one.socket ./one-socket
# Starting testpmd #1 with virtio-user device in server mode:
$ dpdk-testpmd --no-pci --in-memory --single-file-segments \
--vdev="net_virtio_user,path=./one.socket,broker-key=MY-KEY,server=1"
# Starting testpmd #2 with vhost pmd in client mode:
$ dpdk-testpmd --no-pci --in-memory --single-file-segments \
--vdev="eth_vhost0,iface=./one.socket,broker-key=MY-KEY,client=1"
Details how to build and install One Socket are in README.rst in
one-socket repository.
DPDK side is the first step of implementation. Once available in DPDK,
support could be easily added to Open vSwith or VPP or any DPDK-based
application. Same support could be added to QEMU (found a volunteer
for this part).
Since SocketPair Broker is completely independent from the purposes
connection will be used for, it has a potential to unify and replace
all one-to-one unix socket connections on a host. This one persistent
broker socket could be passed to any containers and can be used by
any application greatly simplifying system management.
Any feedback or suggestions on any component of this solution including
this patch-set, One Socket Daemon, SocketPair Broker Protocol or
libspbroker library are very welcome.
*Note* about the patch set:
First patch in a series is a *bug* fix, so it should be considered even
outside of this series. It basically fixes unregistering of a
listening socket that never happens in current code.
The virtio-user part of the series heavily depends on this bug fix
since broker connection unlike listening socket will not persist and
will generate lots of interrupts if not unregistered.
Ilya Maximets (4):
net/virtio: fix interrupt unregistering for listening socket
vhost: add support for SocketPair Broker
net/vhost: add support for SocketPair Broker
net/virtio: add support for SocketPair Broker
doc/guides/nics/vhost.rst | 5 +
doc/guides/nics/virtio.rst | 5 +
doc/guides/prog_guide/vhost_lib.rst | 10 +
drivers/net/vhost/rte_eth_vhost.c | 42 ++-
drivers/net/virtio/meson.build | 6 +
drivers/net/virtio/virtio_user/vhost_user.c | 122 ++++++++-
.../net/virtio/virtio_user/virtio_user_dev.c | 142 +++++++---
.../net/virtio/virtio_user/virtio_user_dev.h | 6 +-
drivers/net/virtio/virtio_user_ethdev.c | 30 ++-
lib/librte_vhost/meson.build | 7 +
lib/librte_vhost/rte_vhost.h | 1 +
lib/librte_vhost/socket.c | 245 ++++++++++++++++--
12 files changed, 550 insertions(+), 71 deletions(-)
--
2.26.2
^ permalink raw reply [flat|nested] 38+ messages in thread
* [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
@ 2021-03-17 20:25 ` Ilya Maximets
2021-03-25 8:32 ` Maxime Coquelin
2021-03-17 20:25 ` [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker Ilya Maximets
` (5 subsequent siblings)
6 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-17 20:25 UTC (permalink / raw)
To: Maxime Coquelin
Cc: Chenbo Xia, dev, Adrian Moreno, Stefan Hajnoczi, Julia Suvorova,
Ilya Maximets, stable, Zhiyong Yang
virtio_user_dev_server_reconnect() is typically called from the
interrupt context while checking the link state:
vhost_user_update_link_state()
--> virtio_user_dev_server_reconnect()
Under this conditions callback unregistering always fails. This means
that listenfd is never unregistered and continue to trigger interrupts.
For example, if second client will try to connect to the same socket,
the server will receive interrupts infinitely because it will not
accept them while listen fd is readable and generates epoll events.
Fix that by moving reconfiguration of interrupts out of the
interrupt context to alarm handler.
'virtio_user_dev_delayed_handler' renamed to
'virtio_user_dev_delayed_disconnect_handler' to better reflect its
purpose.
Additionally improved error logging around interrupt management.
Fixes: bd8f50a45d0f ("net/virtio-user: support server mode")
Cc: stable@dpdk.org
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
CC: Zhiyong Yang <zhiyong.yang@intel.com>
drivers/net/virtio/virtio_user/vhost_user.c | 4 +-
.../net/virtio/virtio_user/virtio_user_dev.c | 70 ++++++++++++++-----
.../net/virtio/virtio_user/virtio_user_dev.h | 2 +-
3 files changed, 57 insertions(+), 19 deletions(-)
diff --git a/drivers/net/virtio/virtio_user/vhost_user.c b/drivers/net/virtio/virtio_user/vhost_user.c
index f8569f6e6f..25f74c625b 100644
--- a/drivers/net/virtio/virtio_user/vhost_user.c
+++ b/drivers/net/virtio/virtio_user/vhost_user.c
@@ -957,7 +957,9 @@ vhost_user_update_link_state(struct virtio_user_dev *dev)
* of interrupt handling, callback cannot be
* unregistered here, set an alarm to do it.
*/
- rte_eal_alarm_set(1, virtio_user_dev_delayed_handler, (void *)dev);
+ rte_eal_alarm_set(1,
+ virtio_user_dev_delayed_disconnect_handler,
+ (void *)dev);
} else {
dev->net_status |= VIRTIO_NET_S_LINK_UP;
}
diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.c b/drivers/net/virtio/virtio_user/virtio_user_dev.c
index 1b54d55bd8..68cece42d3 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.c
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.c
@@ -13,6 +13,7 @@
#include <sys/types.h>
#include <sys/stat.h>
+#include <rte_alarm.h>
#include <rte_string_fns.h>
#include <rte_eal_memconfig.h>
@@ -885,7 +886,7 @@ virtio_user_dev_reset_queues_packed(struct rte_eth_dev *eth_dev)
}
void
-virtio_user_dev_delayed_handler(void *param)
+virtio_user_dev_delayed_disconnect_handler(void *param)
{
struct virtio_user_dev *dev = param;
struct rte_eth_dev *eth_dev = &rte_eth_devices[dev->port_id];
@@ -894,14 +895,27 @@ virtio_user_dev_delayed_handler(void *param)
PMD_DRV_LOG(ERR, "interrupt disable failed");
return;
}
- rte_intr_callback_unregister(eth_dev->intr_handle,
- virtio_interrupt_handler, eth_dev);
+ PMD_DRV_LOG(DEBUG, "Unregistering intr fd: %d",
+ eth_dev->intr_handle->fd);
+ if (rte_intr_callback_unregister(eth_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev) != 1)
+ PMD_DRV_LOG(ERR, "interrupt unregister failed");
+
if (dev->is_server) {
if (dev->ops->server_disconnect)
dev->ops->server_disconnect(dev);
+
eth_dev->intr_handle->fd = dev->ops->get_intr_fd(dev);
- rte_intr_callback_register(eth_dev->intr_handle,
- virtio_interrupt_handler, eth_dev);
+
+ PMD_DRV_LOG(DEBUG, "Registering intr fd: %d",
+ eth_dev->intr_handle->fd);
+
+ if (rte_intr_callback_register(eth_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev))
+ PMD_DRV_LOG(ERR, "interrupt register failed");
+
if (rte_intr_enable(eth_dev->intr_handle) < 0) {
PMD_DRV_LOG(ERR, "interrupt enable failed");
return;
@@ -909,6 +923,32 @@ virtio_user_dev_delayed_handler(void *param)
}
}
+static void
+virtio_user_dev_delayed_intr_reconfig_handler(void *param)
+{
+ struct virtio_user_dev *dev = param;
+ struct rte_eth_dev *eth_dev = &rte_eth_devices[dev->port_id];
+
+ PMD_DRV_LOG(DEBUG, "Unregistering intr fd: %d",
+ eth_dev->intr_handle->fd);
+
+ if (rte_intr_callback_unregister(eth_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev) != 1)
+ PMD_DRV_LOG(ERR, "interrupt unregister failed");
+
+ eth_dev->intr_handle->fd = dev->ops->get_intr_fd(dev);
+
+ PMD_DRV_LOG(DEBUG, "Registering intr fd: %d", eth_dev->intr_handle->fd);
+
+ if (rte_intr_callback_register(eth_dev->intr_handle,
+ virtio_interrupt_handler, eth_dev))
+ PMD_DRV_LOG(ERR, "interrupt register failed");
+
+ if (rte_intr_enable(eth_dev->intr_handle) < 0)
+ PMD_DRV_LOG(ERR, "interrupt enable failed");
+}
+
int
virtio_user_dev_server_reconnect(struct virtio_user_dev *dev)
{
@@ -974,18 +1014,14 @@ virtio_user_dev_server_reconnect(struct virtio_user_dev *dev)
PMD_DRV_LOG(ERR, "interrupt disable failed");
return -1;
}
- rte_intr_callback_unregister(eth_dev->intr_handle,
- virtio_interrupt_handler,
- eth_dev);
-
- eth_dev->intr_handle->fd = dev->ops->get_intr_fd(dev);
- rte_intr_callback_register(eth_dev->intr_handle,
- virtio_interrupt_handler, eth_dev);
-
- if (rte_intr_enable(eth_dev->intr_handle) < 0) {
- PMD_DRV_LOG(ERR, "interrupt enable failed");
- return -1;
- }
+ /*
+ * This function can be called from the interrupt handler, so
+ * we can't unregister interrupt handler here. Setting
+ * alarm to do that later.
+ */
+ rte_eal_alarm_set(1,
+ virtio_user_dev_delayed_intr_reconfig_handler,
+ (void *)dev);
}
PMD_INIT_LOG(NOTICE, "server mode virtio-user reconnection succeeds!");
return 0;
diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.h b/drivers/net/virtio/virtio_user/virtio_user_dev.h
index 8a62f7ea79..15d177ccd2 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.h
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.h
@@ -78,7 +78,7 @@ uint8_t virtio_user_handle_mq(struct virtio_user_dev *dev, uint16_t q_pairs);
int virtio_user_dev_set_status(struct virtio_user_dev *dev, uint8_t status);
int virtio_user_dev_update_status(struct virtio_user_dev *dev);
int virtio_user_dev_update_link_state(struct virtio_user_dev *dev);
-void virtio_user_dev_delayed_handler(void *param);
+void virtio_user_dev_delayed_disconnect_handler(void *param);
int virtio_user_dev_server_reconnect(struct virtio_user_dev *dev);
extern const char * const virtio_user_backend_strings[];
#endif
--
2.26.2
^ permalink raw reply [flat|nested] 38+ messages in thread
* [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
@ 2021-03-17 20:25 ` Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 3/4] net/vhost: " Ilya Maximets
` (4 subsequent siblings)
6 siblings, 0 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-17 20:25 UTC (permalink / raw)
To: Maxime Coquelin
Cc: Chenbo Xia, dev, Adrian Moreno, Stefan Hajnoczi, Julia Suvorova,
Ilya Maximets
New flag RTE_VHOST_USER_SOCKETPAIR_BROKER to say that provided socket
is a path to SocketPair Broker socket in a following format:
'/path/to/socketpair/broker.sock,broker-key=<key>'
This format is chosen to avoid lots of code changes and refactoring
inside the vhost library, mainly because vhost library treats
socket path as a unique device identifier.
'<key>' is a broker key that will be used by a broker to identify
two clients that needs to be paired together, i.e. vhost device
will be connected with a client that provided the same key.
libspbroker needed for a build.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
doc/guides/prog_guide/vhost_lib.rst | 10 ++
lib/librte_vhost/meson.build | 7 +
lib/librte_vhost/rte_vhost.h | 1 +
lib/librte_vhost/socket.c | 245 +++++++++++++++++++++++++---
4 files changed, 237 insertions(+), 26 deletions(-)
diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index dc29229167..f0f0d3fde7 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -118,6 +118,16 @@ The following is an overview of some key Vhost API functions:
It is disabled by default.
+ - ``RTE_VHOST_USER_SOCKETPAIR_BROKER``
+
+ Enabling of this flag makes vhost library to treat socket ``path`` as a
+ path to SocketPair Broker. In this case ``path`` should include
+ ``,broker-key=<key>`` after the actual broker's socket path. ``<key>``
+ will be used as a broker key, so it will be able to connect 2 processes
+ that provided the same key.
+
+ Incompatible with ``RTE_VHOST_USER_NO_RECONNECT``.
+
* ``rte_vhost_driver_set_features(path, features)``
This function sets the feature bits the vhost-user driver supports. The
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index 6185deab33..3292edcb52 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -15,6 +15,13 @@ elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
cflags += '-DVHOST_ICC_UNROLL_PRAGMA'
endif
+
+spbroker_dep = dependency('spbroker', required: false)
+if spbroker_dep.found()
+ dpdk_conf.set('RTE_LIBRTE_VHOST_SOCKETPAIR_BROKER', 1)
+ ext_deps += spbroker_dep
+endif
+
dpdk_conf.set('RTE_LIBRTE_VHOST_POSTCOPY',
cc.has_header('linux/userfaultfd.h'))
cflags += '-fno-strict-aliasing'
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 010f160869..87662c9f7f 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -36,6 +36,7 @@ extern "C" {
/* support only linear buffers (no chained mbufs) */
#define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6)
#define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7)
+#define RTE_VHOST_USER_SOCKETPAIR_BROKER (1ULL << 8)
/* Features. */
#ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 0169d36481..f0a1c9044c 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -16,6 +16,10 @@
#include <fcntl.h>
#include <pthread.h>
+#ifdef RTE_LIBRTE_VHOST_SOCKETPAIR_BROKER
+#include <socketpair-broker/helper.h>
+#endif
+
#include <rte_log.h>
#include "fd_man.h"
@@ -33,9 +37,11 @@ struct vhost_user_socket {
struct vhost_user_connection_list conn_list;
pthread_mutex_t conn_mutex;
char *path;
+ char *broker_key;
int socket_fd;
struct sockaddr_un un;
bool is_server;
+ bool is_broker;
bool reconnect;
bool iommu_support;
bool use_builtin_virtio_net;
@@ -81,7 +87,8 @@ struct vhost_user {
static void vhost_user_server_new_connection(int fd, void *data, int *remove);
static void vhost_user_read_cb(int fd, void *dat, int *remove);
static int create_unix_socket(struct vhost_user_socket *vsocket);
-static int vhost_user_start_client(struct vhost_user_socket *vsocket);
+static int recreate_unix_socket(struct vhost_user_socket *vsocket);
+static int vhost_user_start(struct vhost_user_socket *vsocket);
static struct vhost_user vhost_user = {
.fdset = {
@@ -283,6 +290,81 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
close(fd);
}
+#ifdef RTE_LIBRTE_VHOST_SOCKETPAIR_BROKER
+static int
+vhost_user_broker_msg_handler(int fd, struct vhost_user_socket *vsocket)
+{
+ int peer_fd;
+ char *err;
+
+ peer_fd = sp_broker_receive_set_pair(fd, &err);
+ if (peer_fd < 0) {
+ VHOST_LOG_CONFIG(ERR,
+ "failed to receive SP_BROKER_SET_PAIR on fd %d: %s\n",
+ fd, err);
+ free(err);
+ return -1;
+ }
+
+ VHOST_LOG_CONFIG(INFO, "new vhost user connection is %d\n", peer_fd);
+ vhost_user_add_connection(peer_fd, vsocket);
+ return 0;
+}
+
+static void
+vhost_user_broker_msg_cb(int connfd, void *dat, int *remove)
+{
+ struct vhost_user_socket *vsocket = dat;
+ int ret;
+
+ ret = vhost_user_broker_msg_handler(connfd, vsocket);
+
+ /* Don't need a broker connection anymore. */
+ *remove = 1;
+
+ if (ret < 0) {
+ recreate_unix_socket(vsocket);
+ vhost_user_start(vsocket);
+ }
+}
+
+static int
+vhost_user_start_broker_connection(
+ int fd __rte_unused,
+ struct vhost_user_socket *vsocket __rte_unused)
+{
+ char *err;
+ int ret;
+
+ ret = sp_broker_send_get_pair(fd, vsocket->broker_key,
+ vsocket->is_server, &err);
+ if (ret) {
+ VHOST_LOG_CONFIG(ERR,
+ "failed to send SP_BROKER_GET_PAIR request on fd %d: %s\n",
+ fd, err);
+ free(err);
+ return -1;
+ }
+
+ ret = fdset_add(&vhost_user.fdset, fd, vhost_user_broker_msg_cb,
+ NULL, vsocket);
+ if (ret < 0) {
+ VHOST_LOG_CONFIG(ERR,
+ "failed to add broker fd %d to vhost fdset\n", fd);
+ return -1;
+ }
+ return 0;
+}
+#else
+static int
+vhost_user_start_broker_connection(
+ int fd __rte_unused,
+ struct vhost_user_socket *vsocket __rte_unused)
+{
+ return -1;
+}
+#endif
+
/* call back when there is new vhost-user connection from client */
static void
vhost_user_server_new_connection(int fd, void *dat, int *remove __rte_unused)
@@ -321,7 +403,7 @@ vhost_user_read_cb(int connfd, void *dat, int *remove)
if (vsocket->reconnect) {
create_unix_socket(vsocket);
- vhost_user_start_client(vsocket);
+ vhost_user_start(vsocket);
}
pthread_mutex_lock(&vsocket->conn_mutex);
@@ -337,14 +419,17 @@ create_unix_socket(struct vhost_user_socket *vsocket)
{
int fd;
struct sockaddr_un *un = &vsocket->un;
+ char *broker_key = strstr(vsocket->path, ",broker-key=");
fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0)
return -1;
- VHOST_LOG_CONFIG(INFO, "vhost-user %s: socket created, fd: %d\n",
- vsocket->is_server ? "server" : "client", fd);
+ VHOST_LOG_CONFIG(INFO, "vhost-user %s: %ssocket created, fd: %d\n",
+ vsocket->is_server ? "server" : "client",
+ vsocket->is_broker ? "broker " : "", fd);
- if (!vsocket->is_server && fcntl(fd, F_SETFL, O_NONBLOCK)) {
+ if ((!vsocket->is_server || vsocket->is_broker)
+ && fcntl(fd, F_SETFL, O_NONBLOCK)) {
VHOST_LOG_CONFIG(ERR,
"vhost-user: can't set nonblocking mode for socket, fd: "
"%d (%s)\n", fd, strerror(errno));
@@ -352,12 +437,21 @@ create_unix_socket(struct vhost_user_socket *vsocket)
return -1;
}
+ /* Temporarily limiting the path by the actual path. */
+ if (vsocket->is_broker && broker_key)
+ broker_key[0] = '\0';
+
memset(un, 0, sizeof(*un));
un->sun_family = AF_UNIX;
strncpy(un->sun_path, vsocket->path, sizeof(un->sun_path));
un->sun_path[sizeof(un->sun_path) - 1] = '\0';
vsocket->socket_fd = fd;
+
+ /* Restoring original path. */
+ if (vsocket->is_broker && broker_key)
+ broker_key[0] = ',';
+
return 0;
}
@@ -425,7 +519,8 @@ static struct vhost_user_reconnect_list reconn_list;
static pthread_t reconn_tid;
static int
-vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
+vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz,
+ bool disable_nonblock)
{
int ret, flags;
@@ -433,6 +528,9 @@ vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
if (ret < 0 && errno != EISCONN)
return -1;
+ if (!disable_nonblock)
+ return 0;
+
flags = fcntl(fd, F_GETFL, 0);
if (flags < 0) {
VHOST_LOG_CONFIG(ERR,
@@ -447,8 +545,22 @@ vhost_user_connect_nonblock(int fd, struct sockaddr *un, size_t sz)
return 0;
}
+static int
+recreate_unix_socket(struct vhost_user_socket *vsocket)
+{
+ close(vsocket->socket_fd);
+ if (create_unix_socket(vsocket) < 0) {
+ VHOST_LOG_CONFIG(ERR,
+ "Failed to re-create socket for %s\n",
+ vsocket->un.sun_path);
+ vsocket->socket_fd = -1;
+ return -2;
+ }
+ return 0;
+}
+
static void *
-vhost_user_client_reconnect(void *arg __rte_unused)
+vhost_user_reconnect(void *arg __rte_unused)
{
int ret;
struct vhost_user_reconnect *reconn, *next;
@@ -466,7 +578,8 @@ vhost_user_client_reconnect(void *arg __rte_unused)
ret = vhost_user_connect_nonblock(reconn->fd,
(struct sockaddr *)&reconn->un,
- sizeof(reconn->un));
+ sizeof(reconn->un),
+ !reconn->vsocket->is_broker);
if (ret == -2) {
close(reconn->fd);
VHOST_LOG_CONFIG(ERR,
@@ -478,8 +591,26 @@ vhost_user_client_reconnect(void *arg __rte_unused)
continue;
VHOST_LOG_CONFIG(INFO,
- "%s: connected\n", reconn->vsocket->path);
- vhost_user_add_connection(reconn->fd, reconn->vsocket);
+ "%s: connected\n",
+ reconn->vsocket->un.sun_path);
+
+ if (reconn->vsocket->is_broker) {
+ struct vhost_user_socket *vsocket;
+
+ vsocket = reconn->vsocket;
+ if (vhost_user_start_broker_connection(
+ reconn->fd, vsocket)) {
+ if (recreate_unix_socket(vsocket)) {
+ goto remove_fd;
+ } else {
+ reconn->fd = vsocket->socket_fd;
+ continue;
+ }
+ }
+ } else {
+ vhost_user_add_connection(reconn->fd,
+ reconn->vsocket);
+ }
remove_fd:
TAILQ_REMOVE(&reconn_list.head, reconn, next);
free(reconn);
@@ -505,7 +636,7 @@ vhost_user_reconnect_init(void)
TAILQ_INIT(&reconn_list.head);
ret = rte_ctrl_thread_create(&reconn_tid, "vhost_reconn", NULL,
- vhost_user_client_reconnect, NULL);
+ vhost_user_reconnect, NULL);
if (ret != 0) {
VHOST_LOG_CONFIG(ERR, "failed to create reconnect thread");
if (pthread_mutex_destroy(&reconn_list.mutex)) {
@@ -518,18 +649,25 @@ vhost_user_reconnect_init(void)
}
static int
-vhost_user_start_client(struct vhost_user_socket *vsocket)
+vhost_user_start(struct vhost_user_socket *vsocket)
{
int ret;
int fd = vsocket->socket_fd;
- const char *path = vsocket->path;
+ const char *path = vsocket->un.sun_path;
struct vhost_user_reconnect *reconn;
ret = vhost_user_connect_nonblock(fd, (struct sockaddr *)&vsocket->un,
- sizeof(vsocket->un));
+ sizeof(vsocket->un),
+ !vsocket->is_broker);
if (ret == 0) {
- vhost_user_add_connection(fd, vsocket);
- return 0;
+ if (!vsocket->is_broker) {
+ vhost_user_add_connection(fd, vsocket);
+ return 0;
+ } else if (vhost_user_start_broker_connection(fd, vsocket)) {
+ ret = recreate_unix_socket(vsocket);
+ } else {
+ return 0;
+ }
}
VHOST_LOG_CONFIG(WARNING,
@@ -822,6 +960,11 @@ rte_vhost_driver_get_queue_num(const char *path, uint32_t *queue_num)
static void
vhost_user_socket_mem_free(struct vhost_user_socket *vsocket)
{
+ if (vsocket && vsocket->broker_key) {
+ free(vsocket->broker_key);
+ vsocket->broker_key = NULL;
+ }
+
if (vsocket && vsocket->path) {
free(vsocket->path);
vsocket->path = NULL;
@@ -946,15 +1089,50 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
#endif
}
- if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
+ if ((flags & RTE_VHOST_USER_SOCKETPAIR_BROKER) != 0) {
+#ifndef RTE_LIBRTE_VHOST_SOCKETPAIR_BROKER
+ VHOST_LOG_CONFIG(ERR,
+ "SocketPair Broker requested but not compiled\n");
+ ret = -1;
+ goto out_mutex;
+#endif
+ char *broker_key = strstr(vsocket->path, ",broker-key=");
+
+ if (!broker_key || !broker_key[12]) {
+ VHOST_LOG_CONFIG(ERR,
+ "Connection to SocketPair Broker requested but"
+ " key is not provided: %s\n", vsocket->path);
+ ret = -1;
+ goto out_mutex;
+ }
+ vsocket->is_broker = true;
+ vsocket->broker_key = strdup(broker_key + 12);
+ if (vsocket->broker_key == NULL) {
+ VHOST_LOG_CONFIG(ERR,
+ "error: failed to copy broker key\n");
+ ret = -1;
+ goto out_mutex;
+ }
+ }
+
+ if ((flags & RTE_VHOST_USER_CLIENT) == 0)
+ vsocket->is_server = true;
+
+ if (!vsocket->is_server || vsocket->is_broker) {
vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
+ if (vsocket->is_broker && !vsocket->reconnect) {
+ VHOST_LOG_CONFIG(ERR,
+ "SocketPair Broker with NO_RECONNECT "
+ "is not supported\n");
+ ret = -1;
+ goto out_mutex;
+ }
if (vsocket->reconnect && reconn_tid == 0) {
if (vhost_user_reconnect_init() != 0)
goto out_mutex;
}
- } else {
- vsocket->is_server = true;
}
+
ret = create_unix_socket(vsocket);
if (ret < 0) {
goto out_mutex;
@@ -1052,7 +1230,23 @@ rte_vhost_driver_unregister(const char *path)
}
pthread_mutex_unlock(&vsocket->conn_mutex);
- if (vsocket->is_server) {
+ if (vsocket->reconnect) {
+ if (vhost_user_remove_reconnect(vsocket)) {
+ /*
+ * reconn->fd is a socket_fd for
+ * client and broker connections and
+ * it's closed now.
+ */
+ vsocket->socket_fd = -1;
+ }
+ }
+
+ /*
+ * socket_fd is still valid for server connection
+ * or broker connection that is currently connected
+ * to the broker.
+ */
+ if (vsocket->socket_fd != -1) {
/*
* If r/wcb is executing, release vhost_user's
* mutex lock, and try again since the r/wcb
@@ -1063,13 +1257,12 @@ rte_vhost_driver_unregister(const char *path)
pthread_mutex_unlock(&vhost_user.mutex);
goto again;
}
-
close(vsocket->socket_fd);
- unlink(path);
- } else if (vsocket->reconnect) {
- vhost_user_remove_reconnect(vsocket);
}
+ if (vsocket->is_server && !vsocket->is_broker)
+ unlink(path);
+
pthread_mutex_destroy(&vsocket->conn_mutex);
vhost_user_socket_mem_free(vsocket);
@@ -1152,8 +1345,8 @@ rte_vhost_driver_start(const char *path)
}
}
- if (vsocket->is_server)
+ if (vsocket->is_server && !vsocket->is_broker)
return vhost_user_start_server(vsocket);
else
- return vhost_user_start_client(vsocket);
+ return vhost_user_start(vsocket);
}
--
2.26.2
^ permalink raw reply [flat|nested] 38+ messages in thread
* [dpdk-dev] [RFC 3/4] net/vhost: add support for SocketPair Broker
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker Ilya Maximets
@ 2021-03-17 20:25 ` Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 4/4] net/virtio: " Ilya Maximets
` (3 subsequent siblings)
6 siblings, 0 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-17 20:25 UTC (permalink / raw)
To: Maxime Coquelin
Cc: Chenbo Xia, dev, Adrian Moreno, Stefan Hajnoczi, Julia Suvorova,
Ilya Maximets
New configuration option "broker-key" to identify that "iface" points
to the socket of SocketPair Broker and provide the pairing key.
Some functions inside rte_eth_vhost.c are using socket path as a
unique identifier, but that is not true. Simply concatinating
key to iface name to avoid big code refactoring. And vhost library
already expects the socket path in this format.
Ex.:
--vdev="eth_vhost0,iface=./one.socket,broker-key=mykey,client=1"
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
doc/guides/nics/vhost.rst | 5 ++++
drivers/net/vhost/rte_eth_vhost.c | 42 +++++++++++++++++++++++++------
2 files changed, 39 insertions(+), 8 deletions(-)
diff --git a/doc/guides/nics/vhost.rst b/doc/guides/nics/vhost.rst
index ee802ec4a8..1e6b8464c7 100644
--- a/doc/guides/nics/vhost.rst
+++ b/doc/guides/nics/vhost.rst
@@ -34,6 +34,11 @@ The user can specify below arguments in `--vdev` option.
It is used to specify a path to connect to a QEMU virtio-net device.
+#. ``broker-key``:
+
+ It is used to specify that ``iface`` points to a SocketPair Broker and
+ value is used as a pairing key.
+
#. ``queues``:
It is used to specify the number of queues virtio-net device has.
diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index d198fc8a8e..1b0ca47b47 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -27,6 +27,7 @@ RTE_LOG_REGISTER(vhost_logtype, pmd.net.vhost, NOTICE);
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
#define ETH_VHOST_IFACE_ARG "iface"
+#define ETH_VHOST_BROKER_KEY_ARG "broker-key"
#define ETH_VHOST_QUEUES_ARG "queues"
#define ETH_VHOST_CLIENT_ARG "client"
#define ETH_VHOST_IOMMU_SUPPORT "iommu-support"
@@ -38,6 +39,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
static const char *valid_arguments[] = {
ETH_VHOST_IFACE_ARG,
+ ETH_VHOST_BROKER_KEY_ARG,
ETH_VHOST_QUEUES_ARG,
ETH_VHOST_CLIENT_ARG,
ETH_VHOST_IOMMU_SUPPORT,
@@ -104,6 +106,7 @@ struct vhost_queue {
struct pmd_internal {
rte_atomic32_t dev_attached;
char *iface_name;
+ char *broker_key;
uint64_t flags;
uint64_t disable_flags;
uint16_t max_queues;
@@ -1403,14 +1406,15 @@ static const struct eth_dev_ops ops = {
static int
eth_dev_vhost_create(struct rte_vdev_device *dev, char *iface_name,
- int16_t queues, const unsigned int numa_node, uint64_t flags,
- uint64_t disable_flags)
+ char *broker_key, int16_t queues, const unsigned int numa_node,
+ uint64_t flags, uint64_t disable_flags)
{
const char *name = rte_vdev_device_name(dev);
struct rte_eth_dev_data *data;
struct pmd_internal *internal = NULL;
struct rte_eth_dev *eth_dev = NULL;
struct rte_ether_addr *eth_addr = NULL;
+ int iface_name_len, name_len;
VHOST_LOG(INFO, "Creating VHOST-USER backend on numa socket %u\n",
numa_node);
@@ -1434,11 +1438,22 @@ eth_dev_vhost_create(struct rte_vdev_device *dev, char *iface_name,
* - and point eth_dev structure to new eth_dev_data structure
*/
internal = eth_dev->data->dev_private;
- internal->iface_name = rte_malloc_socket(name, strlen(iface_name) + 1,
+
+ iface_name_len = strlen(iface_name);
+ name_len = iface_name_len;
+ if (broker_key)
+ name_len += strlen(broker_key) + 12;
+
+ internal->iface_name = rte_malloc_socket(name, name_len + 1,
0, numa_node);
if (internal->iface_name == NULL)
goto error;
+
strcpy(internal->iface_name, iface_name);
+ if (broker_key) {
+ strcpy(internal->iface_name + iface_name_len, ",broker-key=");
+ strcpy(internal->iface_name + iface_name_len + 12, broker_key);
+ }
data->nb_rx_queues = queues;
data->nb_tx_queues = queues;
@@ -1471,14 +1486,14 @@ eth_dev_vhost_create(struct rte_vdev_device *dev, char *iface_name,
}
static inline int
-open_iface(const char *key __rte_unused, const char *value, void *extra_args)
+open_str(const char *key __rte_unused, const char *value, void *extra_args)
{
- const char **iface_name = extra_args;
+ const char **str = extra_args;
if (value == NULL)
return -1;
- *iface_name = value;
+ *str = value;
return 0;
}
@@ -1504,6 +1519,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
struct rte_kvargs *kvlist = NULL;
int ret = 0;
char *iface_name;
+ char *broker_key = NULL;
uint16_t queues;
uint64_t flags = 0;
uint64_t disable_flags = 0;
@@ -1540,7 +1556,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
if (rte_kvargs_count(kvlist, ETH_VHOST_IFACE_ARG) == 1) {
ret = rte_kvargs_process(kvlist, ETH_VHOST_IFACE_ARG,
- &open_iface, &iface_name);
+ &open_str, &iface_name);
if (ret < 0)
goto out_free;
} else {
@@ -1548,6 +1564,16 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
goto out_free;
}
+ if (rte_kvargs_count(kvlist, ETH_VHOST_BROKER_KEY_ARG) == 1) {
+ ret = rte_kvargs_process(kvlist, ETH_VHOST_BROKER_KEY_ARG,
+ &open_str, &broker_key);
+ if (ret < 0)
+ goto out_free;
+
+ if (broker_key)
+ flags |= RTE_VHOST_USER_SOCKETPAIR_BROKER;
+ }
+
if (rte_kvargs_count(kvlist, ETH_VHOST_QUEUES_ARG) == 1) {
ret = rte_kvargs_process(kvlist, ETH_VHOST_QUEUES_ARG,
&open_int, &queues);
@@ -1625,7 +1651,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
if (dev->device.numa_node == SOCKET_ID_ANY)
dev->device.numa_node = rte_socket_id();
- ret = eth_dev_vhost_create(dev, iface_name, queues,
+ ret = eth_dev_vhost_create(dev, iface_name, broker_key, queues,
dev->device.numa_node, flags, disable_flags);
if (ret == -1)
VHOST_LOG(ERR, "Failed to create %s\n", name);
--
2.26.2
^ permalink raw reply [flat|nested] 38+ messages in thread
* [dpdk-dev] [RFC 4/4] net/virtio: add support for SocketPair Broker
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
` (2 preceding siblings ...)
2021-03-17 20:25 ` [dpdk-dev] [RFC 3/4] net/vhost: " Ilya Maximets
@ 2021-03-17 20:25 ` Ilya Maximets
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
` (2 subsequent siblings)
6 siblings, 0 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-17 20:25 UTC (permalink / raw)
To: Maxime Coquelin
Cc: Chenbo Xia, dev, Adrian Moreno, Stefan Hajnoczi, Julia Suvorova,
Ilya Maximets
New configuration option 'broker-key' to specify that the socket
path is actually a path to SocketPair Broker's socket. The value
of a 'broker-key' argument will be used as a broker key, so the
broker will be able to identify which vhost device virtio-user
should be paired with.
Ex.:
--vdev="net_virtio_user,path=broker.socket,broker-key=<key>,server=1"
libspbroker needed for a build.
Had to implement few tricks to trigger interrupts by alarm for the
case where we have no any sockets open, i.e. broker is disconnected
and vhost connection is dead. OTOH, this infrastructure might be
used later to implement client reconnection for virtio-user.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
doc/guides/nics/virtio.rst | 5 +
drivers/net/virtio/meson.build | 6 +
drivers/net/virtio/virtio_user/vhost_user.c | 118 +++++++++++++++++-
.../net/virtio/virtio_user/virtio_user_dev.c | 92 ++++++++++----
.../net/virtio/virtio_user/virtio_user_dev.h | 4 +-
drivers/net/virtio/virtio_user_ethdev.c | 30 ++++-
6 files changed, 227 insertions(+), 28 deletions(-)
diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 02e74a6e77..e3ca5b57f8 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -376,6 +376,11 @@ Below devargs are supported by the virtio-user vdev:
It is used to specify a path to connect to vhost backend.
+#. ``broker-key``:
+
+ It is used to specify that ``path`` points to a SocketPair Broker, value
+ is used as a pairing key.
+
#. ``mac``:
It is used to specify the MAC address.
diff --git a/drivers/net/virtio/meson.build b/drivers/net/virtio/meson.build
index d595cfdcab..aada19d8da 100644
--- a/drivers/net/virtio/meson.build
+++ b/drivers/net/virtio/meson.build
@@ -7,6 +7,12 @@ if is_windows
subdir_done()
endif
+spbroker_dep = dependency('spbroker', required: false)
+if spbroker_dep.found()
+ dpdk_conf.set('VIRTIO_SOCKETPAIR_BROKER', 1)
+ ext_deps += spbroker_dep
+endif
+
sources += files('virtio.c',
'virtio_ethdev.c',
'virtio_pci_ethdev.c',
diff --git a/drivers/net/virtio/virtio_user/vhost_user.c b/drivers/net/virtio/virtio_user/vhost_user.c
index 25f74c625b..9d87cda2f5 100644
--- a/drivers/net/virtio/virtio_user/vhost_user.c
+++ b/drivers/net/virtio/virtio_user/vhost_user.c
@@ -11,6 +11,11 @@
#include <string.h>
#include <errno.h>
+#ifdef VIRTIO_SOCKETPAIR_BROKER
+#include <socketpair-broker/helper.h>
+#endif
+
+
#include <rte_alarm.h>
#include <rte_string_fns.h>
#include <rte_fbarray.h>
@@ -21,6 +26,7 @@
struct vhost_user_data {
int vhostfd;
int listenfd;
+ int brokerfd;
uint64_t protocol_features;
};
@@ -790,12 +796,78 @@ vhost_user_server_disconnect(struct virtio_user_dev *dev)
return 0;
}
+static void
+vhost_user_update_broker_fd(struct virtio_user_dev *dev __rte_unused)
+{
+#ifdef VIRTIO_SOCKETPAIR_BROKER
+ struct vhost_user_data *data = dev->backend_data;
+ char *err = NULL;
+ int flags;
+
+ if (data->brokerfd != -1)
+ return;
+
+ data->brokerfd = sp_broker_connect(dev->path, false, &err);
+ if (data->brokerfd < 0) {
+ PMD_DRV_LOG(WARNING, "failed to connect to broker: %s", err);
+ free(err);
+ data->brokerfd = -1;
+ return;
+ }
+
+ if (sp_broker_send_get_pair(data->brokerfd, dev->broker_key,
+ dev->is_server, &err)) {
+ PMD_DRV_LOG(WARNING,
+ "failed to send GET_PAIR request: %s", err);
+ free(err);
+ close(data->brokerfd);
+ data->brokerfd = -1;
+ return;
+ }
+
+ flags = fcntl(data->brokerfd, F_GETFL);
+ if (fcntl(data->brokerfd, F_SETFL, flags | O_NONBLOCK) == -1) {
+ PMD_DRV_LOG(ERR, "error setting O_NONBLOCK flag");
+ close(data->brokerfd);
+ data->brokerfd = -1;
+ return;
+ }
+#endif
+}
+
static int
vhost_user_server_reconnect(struct virtio_user_dev *dev)
{
struct vhost_user_data *data = dev->backend_data;
int fd;
+#ifdef VIRTIO_SOCKETPAIR_BROKER
+ if (dev->broker_key) {
+ char *err;
+
+ vhost_user_update_broker_fd(dev);
+ if (data->brokerfd == -1)
+ return -1;
+
+ fd = sp_broker_receive_set_pair(data->brokerfd, &err);
+ if (fd < 0) {
+ PMD_DRV_LOG(DEBUG,
+ "failed to receive SET_PAIR: %s", err);
+ free(err);
+ /*
+ * Unfortunately, since connection is non-blocking
+ * we can't reliably say if the other end is dead for
+ * a case where it didn't close the socket gracefully.
+ * Closing current connection to re-establish later.
+ */
+ close(data->brokerfd);
+ data->brokerfd = -1;
+ return -1;
+ }
+ data->vhostfd = fd;
+ return 0;
+ }
+#endif
fd = accept(data->listenfd, NULL, NULL);
if (fd < 0)
return -1;
@@ -832,6 +904,29 @@ vhost_user_setup(struct virtio_user_dev *dev)
data->vhostfd = -1;
data->listenfd = -1;
+ data->brokerfd = -1;
+
+ if (dev->broker_key) {
+#ifdef VIRTIO_SOCKETPAIR_BROKER
+ char *err;
+
+ fd = sp_broker_get_pair(dev->path, dev->broker_key,
+ dev->is_server, &err);
+ if (fd < 0) {
+ PMD_DRV_LOG(ERR,
+ "virtio-user failed to connect to broker: %s",
+ err);
+ free(err);
+ goto err_data;
+ }
+ data->vhostfd = fd;
+ return 0;
+#else
+ PMD_DRV_LOG(ERR, "virtio-user broker connection requested "
+ "but not compiled");
+ goto err_data;
+#endif
+ }
fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0) {
@@ -890,6 +985,11 @@ vhost_user_destroy(struct virtio_user_dev *dev)
data->listenfd = -1;
}
+ if (data->brokerfd >= 0) {
+ close(data->brokerfd);
+ data->brokerfd = -1;
+ }
+
free(data);
dev->backend_data = NULL;
@@ -983,8 +1083,22 @@ vhost_user_get_intr_fd(struct virtio_user_dev *dev)
{
struct vhost_user_data *data = dev->backend_data;
- if (dev->is_server && data->vhostfd == -1)
- return data->listenfd;
+ if (data->vhostfd == -1) {
+ if (dev->broker_key) {
+ vhost_user_update_broker_fd(dev);
+ return data->brokerfd;
+ }
+ if (dev->is_server)
+ return data->listenfd;
+ } else if (data->brokerfd != -1) {
+ /*
+ * Broker not needed anymore and we need to close the socket
+ * because other end will be closed by the broker anyway.
+ */
+ PMD_DRV_LOG(DEBUG, "Closing broker fd: %d", data->brokerfd);
+ close(data->brokerfd);
+ data->brokerfd = -1;
+ }
return data->vhostfd;
}
diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.c b/drivers/net/virtio/virtio_user/virtio_user_dev.c
index 68cece42d3..8b53af0724 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.c
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.c
@@ -414,11 +414,16 @@ virtio_user_mem_event_cb(enum rte_mem_event type __rte_unused,
static int
virtio_user_dev_setup(struct virtio_user_dev *dev)
{
- if (dev->is_server) {
- if (dev->backend_type != VIRTIO_USER_BACKEND_VHOST_USER) {
- PMD_DRV_LOG(ERR, "Server mode only supports vhost-user!");
- return -1;
- }
+ if (dev->is_server &&
+ dev->backend_type != VIRTIO_USER_BACKEND_VHOST_USER) {
+ PMD_DRV_LOG(ERR, "Server mode only supports vhost-user!");
+ return -1;
+ }
+
+ if (dev->broker_key &&
+ dev->backend_type != VIRTIO_USER_BACKEND_VHOST_USER) {
+ PMD_DRV_LOG(ERR, "Broker connection only supports vhost-user!");
+ return -1;
}
switch (dev->backend_type) {
@@ -483,10 +488,10 @@ virtio_user_dev_setup(struct virtio_user_dev *dev)
1ULL << VIRTIO_F_RING_PACKED)
int
-virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
- int cq, int queue_size, const char *mac, char **ifname,
- int server, int mrg_rxbuf, int in_order, int packed_vq,
- enum virtio_user_backend_type backend_type)
+virtio_user_dev_init(struct virtio_user_dev *dev, char *path, char **broker_key,
+ int queues, int cq, int queue_size, const char *mac,
+ char **ifname, int server, int mrg_rxbuf, int in_order,
+ int packed_vq, enum virtio_user_backend_type backend_type)
{
uint64_t backend_features;
int i;
@@ -511,6 +516,11 @@ virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
parse_mac(dev, mac);
+ if (*broker_key) {
+ dev->broker_key = *broker_key;
+ *broker_key = NULL;
+ }
+
if (*ifname) {
dev->ifname = *ifname;
*ifname = NULL;
@@ -595,6 +605,10 @@ virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
void
virtio_user_dev_uninit(struct virtio_user_dev *dev)
{
+ struct rte_eth_dev *eth_dev = &rte_eth_devices[dev->port_id];
+
+ rte_eal_alarm_cancel(virtio_interrupt_handler, eth_dev);
+
virtio_user_stop_device(dev);
rte_mem_event_callback_unregister(VIRTIO_USER_MEM_EVENT_CLB_NAME, dev);
@@ -603,9 +617,11 @@ virtio_user_dev_uninit(struct virtio_user_dev *dev)
free(dev->ifname);
- if (dev->is_server)
+ if (dev->is_server && !dev->broker_key)
unlink(dev->path);
+ free(dev->broker_key);
+
dev->ops->destroy(dev);
}
@@ -929,21 +945,44 @@ virtio_user_dev_delayed_intr_reconfig_handler(void *param)
struct virtio_user_dev *dev = param;
struct rte_eth_dev *eth_dev = &rte_eth_devices[dev->port_id];
- PMD_DRV_LOG(DEBUG, "Unregistering intr fd: %d",
- eth_dev->intr_handle->fd);
+ if (rte_intr_disable(eth_dev->intr_handle) < 0) {
+ PMD_DRV_LOG(ERR, "interrupt disable failed");
+ return;
+ }
- if (rte_intr_callback_unregister(eth_dev->intr_handle,
- virtio_interrupt_handler,
- eth_dev) != 1)
- PMD_DRV_LOG(ERR, "interrupt unregister failed");
+ if (eth_dev->intr_handle->fd != -1) {
+ PMD_DRV_LOG(DEBUG, "Unregistering intr fd: %d",
+ eth_dev->intr_handle->fd);
+
+ if (rte_intr_callback_unregister(eth_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev) != 1)
+ PMD_DRV_LOG(ERR, "interrupt unregister failed");
+ }
eth_dev->intr_handle->fd = dev->ops->get_intr_fd(dev);
- PMD_DRV_LOG(DEBUG, "Registering intr fd: %d", eth_dev->intr_handle->fd);
+ rte_eal_alarm_cancel(virtio_interrupt_handler, eth_dev);
+ if (eth_dev->intr_handle->fd == -1) {
+ PMD_DRV_LOG(DEBUG, "Scheduling interrupt as alarm");
+ /*
+ * There is no interrupt source. Setting alarm
+ * instead to try to re-connect later.
+ */
+ rte_eal_alarm_set(US_PER_S, virtio_interrupt_handler,
+ (void *)eth_dev);
+ return;
+ }
+
+ if (eth_dev->intr_handle->fd != -1) {
+ PMD_DRV_LOG(DEBUG, "Registering intr fd: %d",
+ eth_dev->intr_handle->fd);
- if (rte_intr_callback_register(eth_dev->intr_handle,
- virtio_interrupt_handler, eth_dev))
- PMD_DRV_LOG(ERR, "interrupt register failed");
+ if (rte_intr_callback_register(eth_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev))
+ PMD_DRV_LOG(ERR, "interrupt register failed");
+ }
if (rte_intr_enable(eth_dev->intr_handle) < 0)
PMD_DRV_LOG(ERR, "interrupt enable failed");
@@ -963,6 +1002,15 @@ virtio_user_dev_server_reconnect(struct virtio_user_dev *dev)
if (dev->ops->server_reconnect(dev)) {
PMD_DRV_LOG(ERR, "(%s) Reconnect callback call failed", dev->path);
+ if (dev->broker_key) {
+ /*
+ * We might need to re-establish broker connection, so
+ * we need to re-configure interrupts for it.
+ */
+ rte_eal_alarm_set(1,
+ virtio_user_dev_delayed_intr_reconfig_handler,
+ (void *)dev);
+ }
return -1;
}
@@ -1010,10 +1058,6 @@ virtio_user_dev_server_reconnect(struct virtio_user_dev *dev)
}
}
if (eth_dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC) {
- if (rte_intr_disable(eth_dev->intr_handle) < 0) {
- PMD_DRV_LOG(ERR, "interrupt disable failed");
- return -1;
- }
/*
* This function can be called from the interrupt handler, so
* we can't unregister interrupt handler here. Setting
diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.h b/drivers/net/virtio/virtio_user/virtio_user_dev.h
index 15d177ccd2..f1d75c76d3 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.h
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.h
@@ -47,6 +47,7 @@ struct virtio_user_dev {
uint8_t mac_addr[RTE_ETHER_ADDR_LEN];
char path[PATH_MAX];
char *ifname;
+ char *broker_key;
union {
struct vring vrings[VIRTIO_MAX_VIRTQUEUES];
@@ -65,7 +66,8 @@ struct virtio_user_dev {
int virtio_user_dev_set_features(struct virtio_user_dev *dev);
int virtio_user_start_device(struct virtio_user_dev *dev);
int virtio_user_stop_device(struct virtio_user_dev *dev);
-int virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
+int virtio_user_dev_init(struct virtio_user_dev *dev, char *path,
+ char **broker_key, int queues,
int cq, int queue_size, const char *mac, char **ifname,
int server, int mrg_rxbuf, int in_order,
int packed_vq,
diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c
index 1810a54694..bace903658 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -315,6 +315,8 @@ static const char *valid_args[] = {
VIRTIO_USER_ARG_SPEED,
#define VIRTIO_USER_ARG_VECTORIZED "vectorized"
VIRTIO_USER_ARG_VECTORIZED,
+#define VIRTIO_USER_ARG_BROKER_KEY "broker-key"
+ VIRTIO_USER_ARG_BROKER_KEY,
NULL
};
@@ -467,6 +469,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *vdev)
uint64_t packed_vq = 0;
uint64_t vectorized = 0;
char *path = NULL;
+ char *broker_key = NULL;
char *ifname = NULL;
char *mac_addr = NULL;
int ret = -1;
@@ -578,6 +581,28 @@ virtio_user_pmd_probe(struct rte_vdev_device *vdev)
}
}
+ if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_BROKER_KEY) == 1) {
+#ifndef VIRTIO_SOCKETPAIR_BROKER
+ PMD_INIT_LOG(ERR,
+ "arg %s requested but support for SocketPair Broker "
+ "is not compiled",
+ VIRTIO_USER_ARG_BROKER_KEY);
+ goto end;
+#endif
+ if (backend_type != VIRTIO_USER_BACKEND_VHOST_USER) {
+ PMD_INIT_LOG(ERR,
+ "arg %s applies only to vhost-user backend",
+ VIRTIO_USER_ARG_BROKER_KEY);
+ goto end;
+ }
+ if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_BROKER_KEY,
+ &get_string_arg, &broker_key) < 0) {
+ PMD_INIT_LOG(ERR, "error to parse %s",
+ VIRTIO_USER_ARG_BROKER_KEY);
+ goto end;
+ }
+ }
+
if (rte_kvargs_count(kvlist, VIRTIO_USER_ARG_CQ_NUM) == 1) {
if (rte_kvargs_process(kvlist, VIRTIO_USER_ARG_CQ_NUM,
&get_integer_arg, &cq) < 0) {
@@ -645,7 +670,7 @@ virtio_user_pmd_probe(struct rte_vdev_device *vdev)
dev = eth_dev->data->dev_private;
hw = &dev->hw;
- if (virtio_user_dev_init(dev, path, queues, cq,
+ if (virtio_user_dev_init(dev, path, &broker_key, queues, cq,
queue_size, mac_addr, &ifname, server_mode,
mrg_rxbuf, in_order, packed_vq, backend_type) < 0) {
PMD_INIT_LOG(ERR, "virtio_user_dev_init fails");
@@ -682,6 +707,8 @@ virtio_user_pmd_probe(struct rte_vdev_device *vdev)
rte_kvargs_free(kvlist);
if (path)
free(path);
+ if (broker_key)
+ free(broker_key);
if (mac_addr)
free(mac_addr);
if (ifname)
@@ -777,6 +804,7 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
"queue_size=<int> "
"queues=<int> "
"iface=<string> "
+ "broker_key=<string> "
"server=<0|1> "
"mrg_rxbuf=<0|1> "
"in_order=<0|1> "
--
2.26.2
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
` (3 preceding siblings ...)
2021-03-17 20:25 ` [dpdk-dev] [RFC 4/4] net/virtio: " Ilya Maximets
@ 2021-03-18 17:52 ` Stefan Hajnoczi
2021-03-18 19:47 ` Ilya Maximets
2021-03-19 14:39 ` Stefan Hajnoczi
2023-06-30 3:45 ` Stephen Hemminger
6 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-18 17:52 UTC (permalink / raw)
To: Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 2430 bytes --]
On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
Hi,
Some questions to understand the problems that SocketPair Broker solves:
> Even more configuration tricks required in order to share some sockets
> between different containers and not only with the host, e.g. to
> create service chains.
How does SocketPair Broker solve this? I guess the idea is that
SocketPair Broker must be started before other containers. That way
applications don't need to sleep and reconnect when a socket isn't
available yet.
On the other hand, the SocketPair Broker might be unavailable (OOM
killer, crash, etc), so applications still need to sleep and reconnect
to the broker itself. I'm not sure the problem has actually been solved
unless there is a reason why the broker is always guaranteed to be
available?
> And some housekeeping usually required for applications in case the
> socket server terminated abnormally and socket files left on a file
> system:
> "failed to bind to vhu: Address already in use; remove it and try again"
QEMU avoids this by unlinking before binding. The drawback is that users
might accidentally hijack an existing listen socket, but that can be
solved with a pidfile.
> Additionally, all applications (system and user's!) should follow
> naming conventions and place socket files in particular location on a
> file system to make things work.
Does SocketPair Broker solve this? Applications now need to use a naming
convention for keys, so it seems like this issue has not been
eliminated.
> This patch-set aims to eliminate most of the inconveniences by
> leveraging an infrastructure service provided by a SocketPair Broker.
I don't understand yet why this is useful for vhost-user, where the
creation of the vhost-user device backend and its use by a VMM are
closely managed by one piece of software:
1. Unlink the socket path.
2. Create, bind, and listen on the socket path.
3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
RPC, spawn a process, etc) and pass in the listen fd.
4. In the meantime the VMM can open the socket path and call connect(2).
As soon as the vhost-user device backend calls accept(2) the
connection will proceed (there is no need for sleeping).
This approach works across containers without a broker.
BTW what is the security model of the broker? Unlike pathname UNIX
domain sockets there is no ownership permission check.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
@ 2021-03-18 19:47 ` Ilya Maximets
2021-03-18 20:14 ` Ilya Maximets
` (2 more replies)
0 siblings, 3 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-18 19:47 UTC (permalink / raw)
To: Stefan Hajnoczi, Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> Hi,
> Some questions to understand the problems that SocketPair Broker solves:
>
>> Even more configuration tricks required in order to share some sockets
>> between different containers and not only with the host, e.g. to
>> create service chains.
>
> How does SocketPair Broker solve this? I guess the idea is that
> SocketPair Broker must be started before other containers. That way
> applications don't need to sleep and reconnect when a socket isn't
> available yet.
>
> On the other hand, the SocketPair Broker might be unavailable (OOM
> killer, crash, etc), so applications still need to sleep and reconnect
> to the broker itself. I'm not sure the problem has actually been solved
> unless there is a reason why the broker is always guaranteed to be
> available?
Hi, Stefan. Thanks for your feedback!
The idea is to have the SocketPair Broker running right from the
boot of the host. If it will use a systemd socket-based service
activation, the socket should persist while systemd is alive, IIUC.
OOM, crash and restart of the broker should not affect existence
of the socket and systemd will spawn a service if it's not running
for any reason without loosing incoming connections.
>
>> And some housekeeping usually required for applications in case the
>> socket server terminated abnormally and socket files left on a file
>> system:
>> "failed to bind to vhu: Address already in use; remove it and try again"
>
> QEMU avoids this by unlinking before binding. The drawback is that users
> might accidentally hijack an existing listen socket, but that can be
> solved with a pidfile.
How exactly this could be solved with a pidfile? And what if this is
a different application that tries to create a socket on a same path?
e.g. QEMU creates a socket (started in a server mode) and user
accidentally created dpdkvhostuser port in Open vSwitch instead of
dpdkvhostuserclient. This way rte_vhost library will try to bind
to an existing socket file and will fail. Subsequently port creation
in OVS will fail. We can't allow OVS to unlink files because this
way OVS users will have ability to unlink random sockets that OVS has
access to and we also has no idea if it's a QEMU that created a file
or it was a virtio-user application or someone else.
There are, probably, ways to detect if there is any alive process that
has this socket open, but that sounds like too much for this purpose,
also I'm not sure if it's possible if actual user is in a different
container.
So I don't see a good reliable way to detect these conditions. This
falls on shoulders of a higher level management software or a user to
clean these socket files up before adding ports.
>
>> Additionally, all applications (system and user's!) should follow
>> naming conventions and place socket files in particular location on a
>> file system to make things work.
>
> Does SocketPair Broker solve this? Applications now need to use a naming
> convention for keys, so it seems like this issue has not been
> eliminated.
Key is an arbitrary sequence of bytes, so it's hard to call it a naming
convention. But they need to know keys, you're right. And to be
careful I said "eliminates most of the inconveniences". :)
>
>> This patch-set aims to eliminate most of the inconveniences by
>> leveraging an infrastructure service provided by a SocketPair Broker.
>
> I don't understand yet why this is useful for vhost-user, where the
> creation of the vhost-user device backend and its use by a VMM are
> closely managed by one piece of software:
>
> 1. Unlink the socket path.
> 2. Create, bind, and listen on the socket path.
> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> RPC, spawn a process, etc) and pass in the listen fd.
> 4. In the meantime the VMM can open the socket path and call connect(2).
> As soon as the vhost-user device backend calls accept(2) the
> connection will proceed (there is no need for sleeping).
>
> This approach works across containers without a broker.
Not sure if I fully understood a question here, but anyway.
This approach works fine if you know what application to run.
In case of a k8s cluster, it might be a random DPDK application
with virtio-user ports running inside a container and want to
have a network connection. Also, this application needs to run
virtio-user in server mode, otherwise restart of the OVS will
require restart of the application. So, you basically need to
rely on a third-party application to create a socket with a right
name and in a correct location that is shared with a host, so
OVS can find it and connect.
In a VM world everything is much more simple, since you have
a libvirt and QEMU that will take care of all of these stuff
and which are also under full control of management software
and a system administrator.
In case of a container with a "random" DPDK application inside
there is no such entity that can help. Of course, some solution
might be implemented in docker/podman daemon to create and manage
outside-looking sockets for an application inside the container,
but that is not available today AFAIK and I'm not sure if it
ever will.
>
> BTW what is the security model of the broker? Unlike pathname UNIX
> domain sockets there is no ownership permission check.
I thought about this. Yes, we should allow connection to this socket
for a wide group of applications. That might be a problem.
However, 2 applications need to know the 1024 (at most) byte key in
order to connect to each other. This might be considered as a
sufficient security model in case these keys are not predictable.
Suggestions on how to make this more secure are welcome.
If it's really necessary to completely isolate some connections
from other ones, one more broker could be started. But I'm not
sure what the case it should be.
Broker itself closes the socketpair on its side, so the connection
between 2 applications is direct and should be secure as far as
kernel doesn't allow other system processes to intercept data on
arbitrary unix sockets.
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-18 19:47 ` Ilya Maximets
@ 2021-03-18 20:14 ` Ilya Maximets
2021-03-19 14:16 ` Stefan Hajnoczi
2021-03-19 8:51 ` Marc-André Lureau
2021-03-19 14:05 ` Stefan Hajnoczi
2 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-18 20:14 UTC (permalink / raw)
To: Ilya Maximets, Stefan Hajnoczi
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
On 3/18/21 8:47 PM, Ilya Maximets wrote:
> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>> Hi,
>> Some questions to understand the problems that SocketPair Broker solves:
>>
>>> Even more configuration tricks required in order to share some sockets
>>> between different containers and not only with the host, e.g. to
>>> create service chains.
>>
>> How does SocketPair Broker solve this? I guess the idea is that
>> SocketPair Broker must be started before other containers. That way
>> applications don't need to sleep and reconnect when a socket isn't
>> available yet.
>>
>> On the other hand, the SocketPair Broker might be unavailable (OOM
>> killer, crash, etc), so applications still need to sleep and reconnect
>> to the broker itself. I'm not sure the problem has actually been solved
>> unless there is a reason why the broker is always guaranteed to be
>> available?
>
> Hi, Stefan. Thanks for your feedback!
>
> The idea is to have the SocketPair Broker running right from the
> boot of the host. If it will use a systemd socket-based service
> activation, the socket should persist while systemd is alive, IIUC.
> OOM, crash and restart of the broker should not affect existence
> of the socket and systemd will spawn a service if it's not running
> for any reason without loosing incoming connections.
>
>>
>>> And some housekeeping usually required for applications in case the
>>> socket server terminated abnormally and socket files left on a file
>>> system:
>>> "failed to bind to vhu: Address already in use; remove it and try again"
>>
>> QEMU avoids this by unlinking before binding. The drawback is that users
>> might accidentally hijack an existing listen socket, but that can be
>> solved with a pidfile.
>
> How exactly this could be solved with a pidfile? And what if this is
> a different application that tries to create a socket on a same path?
> e.g. QEMU creates a socket (started in a server mode) and user
> accidentally created dpdkvhostuser port in Open vSwitch instead of
> dpdkvhostuserclient. This way rte_vhost library will try to bind
> to an existing socket file and will fail. Subsequently port creation
> in OVS will fail. We can't allow OVS to unlink files because this
> way OVS users will have ability to unlink random sockets that OVS has
> access to and we also has no idea if it's a QEMU that created a file
> or it was a virtio-user application or someone else.
> There are, probably, ways to detect if there is any alive process that
> has this socket open, but that sounds like too much for this purpose,
> also I'm not sure if it's possible if actual user is in a different
> container.
> So I don't see a good reliable way to detect these conditions. This
> falls on shoulders of a higher level management software or a user to
> clean these socket files up before adding ports.
>
>>
>>> Additionally, all applications (system and user's!) should follow
>>> naming conventions and place socket files in particular location on a
>>> file system to make things work.
>>
>> Does SocketPair Broker solve this? Applications now need to use a naming
>> convention for keys, so it seems like this issue has not been
>> eliminated.
>
> Key is an arbitrary sequence of bytes, so it's hard to call it a naming
> convention. But they need to know keys, you're right. And to be
> careful I said "eliminates most of the inconveniences". :)
>
>>
>>> This patch-set aims to eliminate most of the inconveniences by
>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>
>> I don't understand yet why this is useful for vhost-user, where the
>> creation of the vhost-user device backend and its use by a VMM are
>> closely managed by one piece of software:
>>
>> 1. Unlink the socket path.
>> 2. Create, bind, and listen on the socket path.
>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>> RPC, spawn a process, etc) and pass in the listen fd.
>> 4. In the meantime the VMM can open the socket path and call connect(2).
>> As soon as the vhost-user device backend calls accept(2) the
>> connection will proceed (there is no need for sleeping).
>>
>> This approach works across containers without a broker.
>
> Not sure if I fully understood a question here, but anyway.
>
> This approach works fine if you know what application to run.
> In case of a k8s cluster, it might be a random DPDK application
> with virtio-user ports running inside a container and want to
> have a network connection. Also, this application needs to run
> virtio-user in server mode, otherwise restart of the OVS will
> require restart of the application. So, you basically need to
> rely on a third-party application to create a socket with a right
> name and in a correct location that is shared with a host, so
> OVS can find it and connect.
>
> In a VM world everything is much more simple, since you have
> a libvirt and QEMU that will take care of all of these stuff
> and which are also under full control of management software
> and a system administrator.
> In case of a container with a "random" DPDK application inside
> there is no such entity that can help. Of course, some solution
> might be implemented in docker/podman daemon to create and manage
> outside-looking sockets for an application inside the container,
> but that is not available today AFAIK and I'm not sure if it
> ever will.
>
>>
>> BTW what is the security model of the broker? Unlike pathname UNIX
>> domain sockets there is no ownership permission check.
>
> I thought about this. Yes, we should allow connection to this socket
> for a wide group of applications. That might be a problem.
> However, 2 applications need to know the 1024 (at most) byte key in
> order to connect to each other. This might be considered as a
> sufficient security model in case these keys are not predictable.
> Suggestions on how to make this more secure are welcome.
Digging more into unix sockets, I think that broker might use
SO_PEERCRED to identify at least a uid and gid of a client.
This way we can implement policies, e.g. one client might
request to pair it only with clients from the same group or
from the same user.
This is actually a great extension for the SocketPair Broker Protocol.
Might even use SO_PEERSEC to enforce even stricter policies
based on selinux context.
>
> If it's really necessary to completely isolate some connections
> from other ones, one more broker could be started. But I'm not
> sure what the case it should be.
>
> Broker itself closes the socketpair on its side, so the connection
> between 2 applications is direct and should be secure as far as
> kernel doesn't allow other system processes to intercept data on
> arbitrary unix sockets.
>
> Best regards, Ilya Maximets.
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-18 19:47 ` Ilya Maximets
2021-03-18 20:14 ` Ilya Maximets
@ 2021-03-19 8:51 ` Marc-André Lureau
2021-03-19 11:25 ` Ilya Maximets
2021-03-19 14:05 ` Stefan Hajnoczi
2 siblings, 1 reply; 38+ messages in thread
From: Marc-André Lureau @ 2021-03-19 8:51 UTC (permalink / raw)
To: Ilya Maximets
Cc: Stefan Hajnoczi, Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno,
Julia Suvorova, Daniel Berrange
Hi
On Thu, Mar 18, 2021 at 11:47 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> > Hi,
> > Some questions to understand the problems that SocketPair Broker solves:
> >
> >> Even more configuration tricks required in order to share some sockets
> >> between different containers and not only with the host, e.g. to
> >> create service chains.
> >
> > How does SocketPair Broker solve this? I guess the idea is that
> > SocketPair Broker must be started before other containers. That way
> > applications don't need to sleep and reconnect when a socket isn't
> > available yet.
> >
> > On the other hand, the SocketPair Broker might be unavailable (OOM
> > killer, crash, etc), so applications still need to sleep and reconnect
> > to the broker itself. I'm not sure the problem has actually been solved
> > unless there is a reason why the broker is always guaranteed to be
> > available?
>
> Hi, Stefan. Thanks for your feedback!
>
> The idea is to have the SocketPair Broker running right from the
> boot of the host. If it will use a systemd socket-based service
> activation, the socket should persist while systemd is alive, IIUC.
> OOM, crash and restart of the broker should not affect existence
> of the socket and systemd will spawn a service if it's not running
> for any reason without loosing incoming connections.
>
>
Since the solution relies on systemd, why not use DBus to perform
authentication, service discovery and setup the socketpair between peers?
You don't need an extra service broker in this case.
When the org.foo service shows up, call org.foo.Connect() to return the fd
of the client end (or throw an error etc)
I don't think establishing socketpair connection between process peers
sharing some ID, without any other context, is going to be so useful. The
relation is usually not symmetrical, and you usually have associated
setup/configuration details.
--
Marc-André Lureau
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 8:51 ` Marc-André Lureau
@ 2021-03-19 11:25 ` Ilya Maximets
0 siblings, 0 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-19 11:25 UTC (permalink / raw)
To: Marc-André Lureau, Ilya Maximets
Cc: Stefan Hajnoczi, Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno,
Julia Suvorova, Daniel Berrange
On 3/19/21 9:51 AM, Marc-André Lureau wrote:
> Hi
>
> On Thu, Mar 18, 2021 at 11:47 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>
> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> > Hi,
> > Some questions to understand the problems that SocketPair Broker solves:
> >
> >> Even more configuration tricks required in order to share some sockets
> >> between different containers and not only with the host, e.g. to
> >> create service chains.
> >
> > How does SocketPair Broker solve this? I guess the idea is that
> > SocketPair Broker must be started before other containers. That way
> > applications don't need to sleep and reconnect when a socket isn't
> > available yet.
> >
> > On the other hand, the SocketPair Broker might be unavailable (OOM
> > killer, crash, etc), so applications still need to sleep and reconnect
> > to the broker itself. I'm not sure the problem has actually been solved
> > unless there is a reason why the broker is always guaranteed to be
> > available?
>
> Hi, Stefan. Thanks for your feedback!
>
> The idea is to have the SocketPair Broker running right from the
> boot of the host. If it will use a systemd socket-based service
> activation, the socket should persist while systemd is alive, IIUC.
> OOM, crash and restart of the broker should not affect existence
> of the socket and systemd will spawn a service if it's not running
> for any reason without loosing incoming connections.
>
>
> Since the solution relies on systemd, why not use DBus to perform authentication, service discovery and setup the socketpair between peers? You don't need an extra service broker in this case.
>
> When the org.foo service shows up, call org.foo.Connect() to return the fd of the client end (or throw an error etc)
Yes, that's a possible solution. And I even thought about running
SocketPair Broker as a dbus service (it's in a 'todo' list with
a question mark). However, there are few issues with DBus-based
solution:
1. I'd like to not bind the solution to systemd just because it's
not always required. There are cases where you don't really
need a persistent and never-changing socket file. For example,
dpdk implementation for vhost and virtio-user in this patch set
will work just fine, i.e. will re-connect as soon as socket is
available.
Also, socket-based activation is a cool feature, but it's not the
only solution on how to make socket file appear before starting
a certain application.
2. It should be easier for a developer of an existing client-server
application to just use a different socket in compare to learning
how to use DBus and how to integrate it into application.
Especially, it's much easier to use another socket if they want
to keep traditional way of connection as alternative to connection
with a SocketPair Broker.
3. Unclear security implication. I tried to research how to use
host DBus from the container and I didn't find a good convenient
way to do that. Please, point me to a correct documentation if
I missed something. Solution that I managed to google includes
mounting of a /run/user/<user-id> or dbus sessions directory into
the container and copying a dbus configuration files.
Some articles also points out that communication is only possible
from a privileged containers. To be clear, I know very little
about DBus, so any pointers on how to use it in a convenient way
from the inside of a container will be appreciated.
>
> I don't think establishing socketpair connection between process peers sharing some ID, without any other context, is going to be so useful. The relation is usually not symmetrical, and you usually have associated setup/configuration details.
There is an "operation mode" that user can specify in a GET_PAIR
request to the SocketPair Broker. It might be NONE, CLIENT or
SERVER. Broker will pair users that provided NONE together as
they likely want to have a symmetrical connection. And it will
pair users that declared themselves as CLIENTs with users that
specified SEVER. This is to ensure that in asymmetrical connections
there will be no two "clients" or two "servers".
See:
https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker-proto-spec.rst
If you have any idea what else could be added to the protocol to
make it better, I'd love to hear.
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-18 19:47 ` Ilya Maximets
2021-03-18 20:14 ` Ilya Maximets
2021-03-19 8:51 ` Marc-André Lureau
@ 2021-03-19 14:05 ` Stefan Hajnoczi
2021-03-19 15:29 ` Ilya Maximets
2 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-19 14:05 UTC (permalink / raw)
To: Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 5200 bytes --]
On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> And some housekeeping usually required for applications in case the
> >> socket server terminated abnormally and socket files left on a file
> >> system:
> >> "failed to bind to vhu: Address already in use; remove it and try again"
> >
> > QEMU avoids this by unlinking before binding. The drawback is that users
> > might accidentally hijack an existing listen socket, but that can be
> > solved with a pidfile.
>
> How exactly this could be solved with a pidfile?
A pidfile prevents two instances of the same service from running at the
same time.
The same effect can be achieved by the container orchestrator, systemd,
etc too because it refuses to run the same service twice.
> And what if this is
> a different application that tries to create a socket on a same path?
> e.g. QEMU creates a socket (started in a server mode) and user
> accidentally created dpdkvhostuser port in Open vSwitch instead of
> dpdkvhostuserclient. This way rte_vhost library will try to bind
> to an existing socket file and will fail. Subsequently port creation
> in OVS will fail. We can't allow OVS to unlink files because this
> way OVS users will have ability to unlink random sockets that OVS has
> access to and we also has no idea if it's a QEMU that created a file
> or it was a virtio-user application or someone else.
If rte_vhost unlinks the socket then the user will find that networking
doesn't work. They can either hot unplug the QEMU vhost-user-net device
or restart QEMU, depending on whether they need to keep the guest
running or not. This is a misconfiguration that is recoverable.
Regarding letting OVS unlink files, I agree that it shouldn't if this
create a security issue. I don't know the security model of OVS.
> There are, probably, ways to detect if there is any alive process that
> has this socket open, but that sounds like too much for this purpose,
> also I'm not sure if it's possible if actual user is in a different
> container.
> So I don't see a good reliable way to detect these conditions. This
> falls on shoulders of a higher level management software or a user to
> clean these socket files up before adding ports.
Does OVS always run in the same net namespace (pod) as the DPDK
application? If yes, then abstract AF_UNIX sockets can be used. Abstract
AF_UNIX sockets don't have a filesystem path and the socket address
disappears when there is no process listening anymore.
> >> This patch-set aims to eliminate most of the inconveniences by
> >> leveraging an infrastructure service provided by a SocketPair Broker.
> >
> > I don't understand yet why this is useful for vhost-user, where the
> > creation of the vhost-user device backend and its use by a VMM are
> > closely managed by one piece of software:
> >
> > 1. Unlink the socket path.
> > 2. Create, bind, and listen on the socket path.
> > 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> > RPC, spawn a process, etc) and pass in the listen fd.
> > 4. In the meantime the VMM can open the socket path and call connect(2).
> > As soon as the vhost-user device backend calls accept(2) the
> > connection will proceed (there is no need for sleeping).
> >
> > This approach works across containers without a broker.
>
> Not sure if I fully understood a question here, but anyway.
>
> This approach works fine if you know what application to run.
> In case of a k8s cluster, it might be a random DPDK application
> with virtio-user ports running inside a container and want to
> have a network connection. Also, this application needs to run
> virtio-user in server mode, otherwise restart of the OVS will
> require restart of the application. So, you basically need to
> rely on a third-party application to create a socket with a right
> name and in a correct location that is shared with a host, so
> OVS can find it and connect.
>
> In a VM world everything is much more simple, since you have
> a libvirt and QEMU that will take care of all of these stuff
> and which are also under full control of management software
> and a system administrator.
> In case of a container with a "random" DPDK application inside
> there is no such entity that can help. Of course, some solution
> might be implemented in docker/podman daemon to create and manage
> outside-looking sockets for an application inside the container,
> but that is not available today AFAIK and I'm not sure if it
> ever will.
Wait, when you say there is no entity like management software or a
system administrator, then how does OVS know to instantiate the new
port? I guess something still needs to invoke ovs-ctl add-port?
Can you describe the steps used today (without the broker) for
instantiating a new DPDK app container and connecting it to OVS?
Although my interest is in the vhost-user protocol I think it's
necessary to understand the OVS requirements here and I know little
about them.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-18 20:14 ` Ilya Maximets
@ 2021-03-19 14:16 ` Stefan Hajnoczi
2021-03-19 15:37 ` Ilya Maximets
0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-19 14:16 UTC (permalink / raw)
To: Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 1569 bytes --]
On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> > On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> BTW what is the security model of the broker? Unlike pathname UNIX
> >> domain sockets there is no ownership permission check.
> >
> > I thought about this. Yes, we should allow connection to this socket
> > for a wide group of applications. That might be a problem.
> > However, 2 applications need to know the 1024 (at most) byte key in
> > order to connect to each other. This might be considered as a
> > sufficient security model in case these keys are not predictable.
> > Suggestions on how to make this more secure are welcome.
>
> Digging more into unix sockets, I think that broker might use
> SO_PEERCRED to identify at least a uid and gid of a client.
> This way we can implement policies, e.g. one client might
> request to pair it only with clients from the same group or
> from the same user.
>
> This is actually a great extension for the SocketPair Broker Protocol.
>
> Might even use SO_PEERSEC to enforce even stricter policies
> based on selinux context.
Some piece of software or an administrator would need to understand the
pid/uid/gid mappings used by specific containers in order to configure
security policies in the broker like "app1 is allowed to connect to
app2's socket". This is probably harder than it looks (and DBus already
has everything to do this and more).
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
` (4 preceding siblings ...)
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
@ 2021-03-19 14:39 ` Stefan Hajnoczi
2021-03-19 16:11 ` Ilya Maximets
2023-06-30 3:45 ` Stephen Hemminger
6 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-19 14:39 UTC (permalink / raw)
To: Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova
[-- Attachment #1: Type: text/plain, Size: 329 bytes --]
Hi Ilya,
By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
is restarted then existing vhost-user connections drop with an error but
QEMU could attempt to reconnect to the UNIX domain socket which the new
OVS instance will set up.
Why is it impossible to reconnect when OVS owns the listen socket?
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 14:05 ` Stefan Hajnoczi
@ 2021-03-19 15:29 ` Ilya Maximets
2021-03-19 17:21 ` Stefan Hajnoczi
0 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-19 15:29 UTC (permalink / raw)
To: Stefan Hajnoczi, Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>> And some housekeeping usually required for applications in case the
>>>> socket server terminated abnormally and socket files left on a file
>>>> system:
>>>> "failed to bind to vhu: Address already in use; remove it and try again"
>>>
>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>> might accidentally hijack an existing listen socket, but that can be
>>> solved with a pidfile.
>>
>> How exactly this could be solved with a pidfile?
>
> A pidfile prevents two instances of the same service from running at the
> same time.
>
> The same effect can be achieved by the container orchestrator, systemd,
> etc too because it refuses to run the same service twice.
Sure. I understand that. My point was that these could be 2 different
applications and they might not know which process to look for.
>
>> And what if this is
>> a different application that tries to create a socket on a same path?
>> e.g. QEMU creates a socket (started in a server mode) and user
>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>> dpdkvhostuserclient. This way rte_vhost library will try to bind
>> to an existing socket file and will fail. Subsequently port creation
>> in OVS will fail. We can't allow OVS to unlink files because this
>> way OVS users will have ability to unlink random sockets that OVS has
>> access to and we also has no idea if it's a QEMU that created a file
>> or it was a virtio-user application or someone else.
>
> If rte_vhost unlinks the socket then the user will find that networking
> doesn't work. They can either hot unplug the QEMU vhost-user-net device
> or restart QEMU, depending on whether they need to keep the guest
> running or not. This is a misconfiguration that is recoverable.
True, it's recoverable, but with a high cost. Restart of a VM is rarely
desirable. And the application inside the guest might not feel itself
well after hot re-plug of a device that it actively used. I'd expect
a DPDK application that runs inside a guest on some virtio-net device
to crash after this kind of manipulations. Especially, if it uses some
older versions of DPDK.
>
> Regarding letting OVS unlink files, I agree that it shouldn't if this
> create a security issue. I don't know the security model of OVS.
In general privileges of a ovs-vswitchd daemon might be completely
different from privileges required to invoke control utilities or
to access the configuration database. SO, yes, we should not allow
that.
>
>> There are, probably, ways to detect if there is any alive process that
>> has this socket open, but that sounds like too much for this purpose,
>> also I'm not sure if it's possible if actual user is in a different
>> container.
>> So I don't see a good reliable way to detect these conditions. This
>> falls on shoulders of a higher level management software or a user to
>> clean these socket files up before adding ports.
>
> Does OVS always run in the same net namespace (pod) as the DPDK
> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
> AF_UNIX sockets don't have a filesystem path and the socket address
> disappears when there is no process listening anymore.
OVS is usually started right on the host in a main network namespace.
In case it's started in a pod, it will run in a separate container but
configured with a host network. Applications almost exclusively runs
in separate pods.
>
>>>> This patch-set aims to eliminate most of the inconveniences by
>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>
>>> I don't understand yet why this is useful for vhost-user, where the
>>> creation of the vhost-user device backend and its use by a VMM are
>>> closely managed by one piece of software:
>>>
>>> 1. Unlink the socket path.
>>> 2. Create, bind, and listen on the socket path.
>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>> RPC, spawn a process, etc) and pass in the listen fd.
>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>> As soon as the vhost-user device backend calls accept(2) the
>>> connection will proceed (there is no need for sleeping).
>>>
>>> This approach works across containers without a broker.
>>
>> Not sure if I fully understood a question here, but anyway.
>>
>> This approach works fine if you know what application to run.
>> In case of a k8s cluster, it might be a random DPDK application
>> with virtio-user ports running inside a container and want to
>> have a network connection. Also, this application needs to run
>> virtio-user in server mode, otherwise restart of the OVS will
>> require restart of the application. So, you basically need to
>> rely on a third-party application to create a socket with a right
>> name and in a correct location that is shared with a host, so
>> OVS can find it and connect.
>>
>> In a VM world everything is much more simple, since you have
>> a libvirt and QEMU that will take care of all of these stuff
>> and which are also under full control of management software
>> and a system administrator.
>> In case of a container with a "random" DPDK application inside
>> there is no such entity that can help. Of course, some solution
>> might be implemented in docker/podman daemon to create and manage
>> outside-looking sockets for an application inside the container,
>> but that is not available today AFAIK and I'm not sure if it
>> ever will.
>
> Wait, when you say there is no entity like management software or a
> system administrator, then how does OVS know to instantiate the new
> port? I guess something still needs to invoke ovs-ctl add-port?
I didn't mean that there is no any application that configures
everything. Of course, there is. I mean that there is no such
entity that abstracts all that socket machinery from the user's
application that runs inside the container. QEMU hides all the
details of the connection to vhost backend and presents the device
as a PCI device with a network interface wrapping from the guest
kernel. So, the application inside VM shouldn't care what actually
there is a socket connected to OVS that implements backend and
forward traffic somewhere. For the application it's just a usual
network interface.
But in case of a container world, application should handle all
that by creating a virtio-user device that will connect to some
socket, that has an OVS on the other side.
>
> Can you describe the steps used today (without the broker) for
> instantiating a new DPDK app container and connecting it to OVS?
> Although my interest is in the vhost-user protocol I think it's
> necessary to understand the OVS requirements here and I know little
> about them.
I might describe some things wrong since I worked with k8s and CNI
plugins last time ~1.5 years ago, but the basic schema will look
something like this:
1. user decides to start a new pod and requests k8s to do that
via cmdline tools or some API calls.
2. k8s scheduler looks for available resources asking resource
manager plugins, finds an appropriate physical host and asks
local to that node kubelet daemon to launch a new pod there.
3. kubelet asks local CNI plugin to allocate network resources
and annotate the pod with required mount points, devices that
needs to be passed in and environment variables.
(this is, IIRC, a gRPC connection. It might be a multus-cni
or kuryr-kubernetes or any other CNI plugin. CNI plugin is
usually deployed as a system DaemonSet, so it runs in a
separate pod.
4. Assuming that vhost-user connection requested in server mode.
CNI plugin will:
4.1 create a directory for a vhost-user socket.
4.2 add this directory to pod annotations as a mount point.
4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
by connecting to ovsdb-server by JSONRPC directly.
It will set port type as dpdkvhostuserclient and specify
socket-path as a path inside the directory it created.
(OVS will create a port and rte_vhost will enter the
re-connection loop since socket does not exist yet.)
4.4 Set up socket file location as environment variable in
pod annotations.
4.5 report success to kubelet.
5. kubelet will finish all other preparations and resource
allocations and will ask docker/podman to start a container
with all mount points, devices and environment variables from
the pod annotation.
6. docker/podman starts a container.
Need to mention here that in many cases initial process of
a container is not the actual application that will use a
vhost-user connection, but likely a shell that will invoke
the actual application.
7. Application starts inside the container, checks the environment
variables (actually, checking of environment variables usually
happens in a shell script that invokes the application with
correct arguments) and creates a net_virtio_user port in server
mode. At this point socket file will be created.
(since we're running third-party application inside the container
we can only assume that it will do what is written here, it's
a responsibility of an application developer to do the right
thing.)
8. OVS successfully re-connects to the newly created socket in a
shared directory and vhost-user protocol establishes the network
connectivity.
As you can wee, there are way too many entities and communication
methods involved. So, passing a pre-opened file descriptor from
CNI all the way down to application is not that easy as it is in
case of QEMU+LibVirt.
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 14:16 ` Stefan Hajnoczi
@ 2021-03-19 15:37 ` Ilya Maximets
2021-03-19 16:01 ` Stefan Hajnoczi
2021-03-19 16:02 ` Marc-André Lureau
0 siblings, 2 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-19 15:37 UTC (permalink / raw)
To: Stefan Hajnoczi, Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
On 3/19/21 3:16 PM, Stefan Hajnoczi wrote:
> On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
>> On 3/18/21 8:47 PM, Ilya Maximets wrote:
>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>> BTW what is the security model of the broker? Unlike pathname UNIX
>>>> domain sockets there is no ownership permission check.
>>>
>>> I thought about this. Yes, we should allow connection to this socket
>>> for a wide group of applications. That might be a problem.
>>> However, 2 applications need to know the 1024 (at most) byte key in
>>> order to connect to each other. This might be considered as a
>>> sufficient security model in case these keys are not predictable.
>>> Suggestions on how to make this more secure are welcome.
>>
>> Digging more into unix sockets, I think that broker might use
>> SO_PEERCRED to identify at least a uid and gid of a client.
>> This way we can implement policies, e.g. one client might
>> request to pair it only with clients from the same group or
>> from the same user.
>>
>> This is actually a great extension for the SocketPair Broker Protocol.
>>
>> Might even use SO_PEERSEC to enforce even stricter policies
>> based on selinux context.
>
> Some piece of software or an administrator would need to understand the
> pid/uid/gid mappings used by specific containers in order to configure
> security policies in the broker like "app1 is allowed to connect to
> app2's socket". This is probably harder than it looks (and DBus already
> has everything to do this and more).
AFAIU, neither of orchestration solutions configures different access
rights for sockets right now. So, it, probably, should not be a big
problem for current setups.
I'd expect pid/uid/gid being mapped to host namespace if SO_PEERCRED
requested from it. Interesting thing to check, though.
For DBus, as I mentioned in the other reply, IIUC, it will require
mounting /run/user/*<user-id>* or its bits and some other stuff to the
container in order to make it work. Also it will, probably, require
running containers in privileged mode which will wipe out most of the
security.
Bets regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 15:37 ` Ilya Maximets
@ 2021-03-19 16:01 ` Stefan Hajnoczi
2021-03-19 16:02 ` Marc-André Lureau
1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-19 16:01 UTC (permalink / raw)
To: Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 2638 bytes --]
On Fri, Mar 19, 2021 at 04:37:01PM +0100, Ilya Maximets wrote:
> On 3/19/21 3:16 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> >>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> BTW what is the security model of the broker? Unlike pathname UNIX
> >>>> domain sockets there is no ownership permission check.
> >>>
> >>> I thought about this. Yes, we should allow connection to this socket
> >>> for a wide group of applications. That might be a problem.
> >>> However, 2 applications need to know the 1024 (at most) byte key in
> >>> order to connect to each other. This might be considered as a
> >>> sufficient security model in case these keys are not predictable.
> >>> Suggestions on how to make this more secure are welcome.
> >>
> >> Digging more into unix sockets, I think that broker might use
> >> SO_PEERCRED to identify at least a uid and gid of a client.
> >> This way we can implement policies, e.g. one client might
> >> request to pair it only with clients from the same group or
> >> from the same user.
> >>
> >> This is actually a great extension for the SocketPair Broker Protocol.
> >>
> >> Might even use SO_PEERSEC to enforce even stricter policies
> >> based on selinux context.
> >
> > Some piece of software or an administrator would need to understand the
> > pid/uid/gid mappings used by specific containers in order to configure
> > security policies in the broker like "app1 is allowed to connect to
> > app2's socket". This is probably harder than it looks (and DBus already
> > has everything to do this and more).
>
> AFAIU, neither of orchestration solutions configures different access
> rights for sockets right now. So, it, probably, should not be a big
> problem for current setups.
>
> I'd expect pid/uid/gid being mapped to host namespace if SO_PEERCRED
> requested from it. Interesting thing to check, though.
>
> For DBus, as I mentioned in the other reply, IIUC, it will require
> mounting /run/user/*<user-id>* or its bits and some other stuff to the
> container in order to make it work. Also it will, probably, require
> running containers in privileged mode which will wipe out most of the
> security.
Flatpak has sandboxed D-Bus for it application containers:
https://docs.flatpak.org/en/latest/sandbox-permissions.html
"Limited access to the session D-Bus instance - an app can only own its
own name on the bus."
I don't know about how it works.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 15:37 ` Ilya Maximets
2021-03-19 16:01 ` Stefan Hajnoczi
@ 2021-03-19 16:02 ` Marc-André Lureau
1 sibling, 0 replies; 38+ messages in thread
From: Marc-André Lureau @ 2021-03-19 16:02 UTC (permalink / raw)
To: Ilya Maximets
Cc: Stefan Hajnoczi, Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno,
Julia Suvorova, Daniel Berrange
Hi
On Fri, Mar 19, 2021 at 7:37 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> On 3/19/21 3:16 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> >>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> BTW what is the security model of the broker? Unlike pathname UNIX
> >>>> domain sockets there is no ownership permission check.
> >>>
> >>> I thought about this. Yes, we should allow connection to this socket
> >>> for a wide group of applications. That might be a problem.
> >>> However, 2 applications need to know the 1024 (at most) byte key in
> >>> order to connect to each other. This might be considered as a
> >>> sufficient security model in case these keys are not predictable.
> >>> Suggestions on how to make this more secure are welcome.
> >>
> >> Digging more into unix sockets, I think that broker might use
> >> SO_PEERCRED to identify at least a uid and gid of a client.
> >> This way we can implement policies, e.g. one client might
> >> request to pair it only with clients from the same group or
> >> from the same user.
> >>
> >> This is actually a great extension for the SocketPair Broker Protocol.
> >>
> >> Might even use SO_PEERSEC to enforce even stricter policies
> >> based on selinux context.
> >
> > Some piece of software or an administrator would need to understand the
> > pid/uid/gid mappings used by specific containers in order to configure
> > security policies in the broker like "app1 is allowed to connect to
> > app2's socket". This is probably harder than it looks (and DBus already
> > has everything to do this and more).
>
> AFAIU, neither of orchestration solutions configures different access
> rights for sockets right now. So, it, probably, should not be a big
> problem for current setups.
>
> I'd expect pid/uid/gid being mapped to host namespace if SO_PEERCRED
> requested from it. Interesting thing to check, though.
>
> For DBus, as I mentioned in the other reply, IIUC, it will require
> mounting /run/user/*<user-id>* or its bits and some other stuff to the
> container in order to make it work. Also it will, probably, require
> running containers in privileged mode which will wipe out most of the
> security.
>
Right, if you need to communicate across namespaces, then it becomes less
common.
However, having a DBus socket (as a private bus) exposed in the NS isn't
going to be any different than having the broker socket exposed, unless I
am missing something.
You'll have the same issues discussed earlier about uid mapping, for
peercred authentication to work.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 14:39 ` Stefan Hajnoczi
@ 2021-03-19 16:11 ` Ilya Maximets
2021-03-19 16:45 ` Ilya Maximets
0 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-19 16:11 UTC (permalink / raw)
To: Stefan Hajnoczi, Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova
On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
> Hi Ilya,
> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
> is restarted then existing vhost-user connections drop with an error but
> QEMU could attempt to reconnect to the UNIX domain socket which the new
> OVS instance will set up.
>
> Why is it impossible to reconnect when OVS owns the listen socket?
Well, AFAIK, qemu reconnects client connections only:
``reconnect`` sets the timeout for reconnecting on non-server
sockets when the remote end goes away. qemu will delay this many
seconds and then attempt to reconnect. Zero disables reconnecting,
and is the default.
I'm not sure about exact reason. It was historically this way.
For me it doesn't make much sense. I mean, your right that it's
just a socket, so it should not matter who listens and who connects.
If reconnection is possible in one direction, it should be possible
in the opposite direction too.
dpdkvhostuser was deprecated just to scare users and force them to
migrate to dpdkvhostuserclient and avoid constant bug reports like:
"OVS service restarted and network is lost now".
BTW, virtio-user ports in DPDK doesn't support re-connection in client
mode too.
BTW2, with SocketPair Broker it might be cheaper to implement server
reconnection in QEMU because all connections in these case are client
connections, i.e. both ends will connect() to a broker.
Bets regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 16:11 ` Ilya Maximets
@ 2021-03-19 16:45 ` Ilya Maximets
2021-03-24 20:56 ` Maxime Coquelin
0 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-19 16:45 UTC (permalink / raw)
To: Ilya Maximets, Stefan Hajnoczi
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova
On 3/19/21 5:11 PM, Ilya Maximets wrote:
> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>> Hi Ilya,
>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>> is restarted then existing vhost-user connections drop with an error but
>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>> OVS instance will set up.
>>
>> Why is it impossible to reconnect when OVS owns the listen socket?
>
> Well, AFAIK, qemu reconnects client connections only:
>
> ``reconnect`` sets the timeout for reconnecting on non-server
> sockets when the remote end goes away. qemu will delay this many
> seconds and then attempt to reconnect. Zero disables reconnecting,
> and is the default.
>
> I'm not sure about exact reason. It was historically this way.
> For me it doesn't make much sense. I mean, your right that it's
> just a socket, so it should not matter who listens and who connects.
> If reconnection is possible in one direction, it should be possible
> in the opposite direction too.
Sorry, my thought slipped. :) Yes, QEMU supports re-connection
for client sockets. So, in theory, dpdkvhostuser ports should work
after re-connection. And that would be nice. I don't remember
right now why this doesn't work... Maybe vhost-user parts in QEMU
doesn't handle this case. Need to dig some more into that and refresh
my memory. It was so long ago...
Maxime, do you remember?
>
> dpdkvhostuser was deprecated just to scare users and force them to
> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>
> "OVS service restarted and network is lost now".
>
> BTW, virtio-user ports in DPDK doesn't support re-connection in client
> mode too.
This is still true, though. virtio-user in client mode doesn't reconnect.
>
> BTW2, with SocketPair Broker it might be cheaper to implement server
> reconnection in QEMU because all connections in these case are client
> connections, i.e. both ends will connect() to a broker.
>
> Bets regards, Ilya Maximets.
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 15:29 ` Ilya Maximets
@ 2021-03-19 17:21 ` Stefan Hajnoczi
2021-03-23 17:57 ` Adrian Moreno
0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-19 17:21 UTC (permalink / raw)
To: Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Julia Suvorova,
Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 12956 bytes --]
On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> And some housekeeping usually required for applications in case the
> >>>> socket server terminated abnormally and socket files left on a file
> >>>> system:
> >>>> "failed to bind to vhu: Address already in use; remove it and try again"
> >>>
> >>> QEMU avoids this by unlinking before binding. The drawback is that users
> >>> might accidentally hijack an existing listen socket, but that can be
> >>> solved with a pidfile.
> >>
> >> How exactly this could be solved with a pidfile?
> >
> > A pidfile prevents two instances of the same service from running at the
> > same time.
> >
> > The same effect can be achieved by the container orchestrator, systemd,
> > etc too because it refuses to run the same service twice.
>
> Sure. I understand that. My point was that these could be 2 different
> applications and they might not know which process to look for.
>
> >
> >> And what if this is
> >> a different application that tries to create a socket on a same path?
> >> e.g. QEMU creates a socket (started in a server mode) and user
> >> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >> dpdkvhostuserclient. This way rte_vhost library will try to bind
> >> to an existing socket file and will fail. Subsequently port creation
> >> in OVS will fail. We can't allow OVS to unlink files because this
> >> way OVS users will have ability to unlink random sockets that OVS has
> >> access to and we also has no idea if it's a QEMU that created a file
> >> or it was a virtio-user application or someone else.
> >
> > If rte_vhost unlinks the socket then the user will find that networking
> > doesn't work. They can either hot unplug the QEMU vhost-user-net device
> > or restart QEMU, depending on whether they need to keep the guest
> > running or not. This is a misconfiguration that is recoverable.
>
> True, it's recoverable, but with a high cost. Restart of a VM is rarely
> desirable. And the application inside the guest might not feel itself
> well after hot re-plug of a device that it actively used. I'd expect
> a DPDK application that runs inside a guest on some virtio-net device
> to crash after this kind of manipulations. Especially, if it uses some
> older versions of DPDK.
This unlink issue is probably something we think differently about.
There are many ways for users to misconfigure things when working with
system tools. If it's possible to catch misconfigurations that is
preferrable. In this case it's just the way pathname AF_UNIX domain
sockets work and IMO it's better not to have problems starting the
service due to stale files than to insist on preventing
misconfigurations. QEMU and DPDK do this differently and both seem to be
successful, so ¯\_(ツ)_/¯.
> >
> > Regarding letting OVS unlink files, I agree that it shouldn't if this
> > create a security issue. I don't know the security model of OVS.
>
> In general privileges of a ovs-vswitchd daemon might be completely
> different from privileges required to invoke control utilities or
> to access the configuration database. SO, yes, we should not allow
> that.
That can be locked down by restricting the socket path to a file beneath
/var/run/ovs/vhost-user/.
> >
> >> There are, probably, ways to detect if there is any alive process that
> >> has this socket open, but that sounds like too much for this purpose,
> >> also I'm not sure if it's possible if actual user is in a different
> >> container.
> >> So I don't see a good reliable way to detect these conditions. This
> >> falls on shoulders of a higher level management software or a user to
> >> clean these socket files up before adding ports.
> >
> > Does OVS always run in the same net namespace (pod) as the DPDK
> > application? If yes, then abstract AF_UNIX sockets can be used. Abstract
> > AF_UNIX sockets don't have a filesystem path and the socket address
> > disappears when there is no process listening anymore.
>
> OVS is usually started right on the host in a main network namespace.
> In case it's started in a pod, it will run in a separate container but
> configured with a host network. Applications almost exclusively runs
> in separate pods.
Okay.
> >>>> This patch-set aims to eliminate most of the inconveniences by
> >>>> leveraging an infrastructure service provided by a SocketPair Broker.
> >>>
> >>> I don't understand yet why this is useful for vhost-user, where the
> >>> creation of the vhost-user device backend and its use by a VMM are
> >>> closely managed by one piece of software:
> >>>
> >>> 1. Unlink the socket path.
> >>> 2. Create, bind, and listen on the socket path.
> >>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>> RPC, spawn a process, etc) and pass in the listen fd.
> >>> 4. In the meantime the VMM can open the socket path and call connect(2).
> >>> As soon as the vhost-user device backend calls accept(2) the
> >>> connection will proceed (there is no need for sleeping).
> >>>
> >>> This approach works across containers without a broker.
> >>
> >> Not sure if I fully understood a question here, but anyway.
> >>
> >> This approach works fine if you know what application to run.
> >> In case of a k8s cluster, it might be a random DPDK application
> >> with virtio-user ports running inside a container and want to
> >> have a network connection. Also, this application needs to run
> >> virtio-user in server mode, otherwise restart of the OVS will
> >> require restart of the application. So, you basically need to
> >> rely on a third-party application to create a socket with a right
> >> name and in a correct location that is shared with a host, so
> >> OVS can find it and connect.
> >>
> >> In a VM world everything is much more simple, since you have
> >> a libvirt and QEMU that will take care of all of these stuff
> >> and which are also under full control of management software
> >> and a system administrator.
> >> In case of a container with a "random" DPDK application inside
> >> there is no such entity that can help. Of course, some solution
> >> might be implemented in docker/podman daemon to create and manage
> >> outside-looking sockets for an application inside the container,
> >> but that is not available today AFAIK and I'm not sure if it
> >> ever will.
> >
> > Wait, when you say there is no entity like management software or a
> > system administrator, then how does OVS know to instantiate the new
> > port? I guess something still needs to invoke ovs-ctl add-port?
>
> I didn't mean that there is no any application that configures
> everything. Of course, there is. I mean that there is no such
> entity that abstracts all that socket machinery from the user's
> application that runs inside the container. QEMU hides all the
> details of the connection to vhost backend and presents the device
> as a PCI device with a network interface wrapping from the guest
> kernel. So, the application inside VM shouldn't care what actually
> there is a socket connected to OVS that implements backend and
> forward traffic somewhere. For the application it's just a usual
> network interface.
> But in case of a container world, application should handle all
> that by creating a virtio-user device that will connect to some
> socket, that has an OVS on the other side.
>
> >
> > Can you describe the steps used today (without the broker) for
> > instantiating a new DPDK app container and connecting it to OVS?
> > Although my interest is in the vhost-user protocol I think it's
> > necessary to understand the OVS requirements here and I know little
> > about them.
>
> I might describe some things wrong since I worked with k8s and CNI
> plugins last time ~1.5 years ago, but the basic schema will look
> something like this:
>
> 1. user decides to start a new pod and requests k8s to do that
> via cmdline tools or some API calls.
>
> 2. k8s scheduler looks for available resources asking resource
> manager plugins, finds an appropriate physical host and asks
> local to that node kubelet daemon to launch a new pod there.
>
> 3. kubelet asks local CNI plugin to allocate network resources
> and annotate the pod with required mount points, devices that
> needs to be passed in and environment variables.
> (this is, IIRC, a gRPC connection. It might be a multus-cni
> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
> usually deployed as a system DaemonSet, so it runs in a
> separate pod.
>
> 4. Assuming that vhost-user connection requested in server mode.
> CNI plugin will:
> 4.1 create a directory for a vhost-user socket.
> 4.2 add this directory to pod annotations as a mount point.
> 4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
> by connecting to ovsdb-server by JSONRPC directly.
> It will set port type as dpdkvhostuserclient and specify
> socket-path as a path inside the directory it created.
> (OVS will create a port and rte_vhost will enter the
> re-connection loop since socket does not exist yet.)
> 4.4 Set up socket file location as environment variable in
> pod annotations.
> 4.5 report success to kubelet.
>
> 5. kubelet will finish all other preparations and resource
> allocations and will ask docker/podman to start a container
> with all mount points, devices and environment variables from
> the pod annotation.
>
> 6. docker/podman starts a container.
> Need to mention here that in many cases initial process of
> a container is not the actual application that will use a
> vhost-user connection, but likely a shell that will invoke
> the actual application.
>
> 7. Application starts inside the container, checks the environment
> variables (actually, checking of environment variables usually
> happens in a shell script that invokes the application with
> correct arguments) and creates a net_virtio_user port in server
> mode. At this point socket file will be created.
> (since we're running third-party application inside the container
> we can only assume that it will do what is written here, it's
> a responsibility of an application developer to do the right
> thing.)
>
> 8. OVS successfully re-connects to the newly created socket in a
> shared directory and vhost-user protocol establishes the network
> connectivity.
>
> As you can wee, there are way too many entities and communication
> methods involved. So, passing a pre-opened file descriptor from
> CNI all the way down to application is not that easy as it is in
> case of QEMU+LibVirt.
File descriptor passing isn't necessary if OVS owns the listen socket
and the application container is the one who connects. That's why I
asked why dpdkvhostuser was deprecated in another email. The benefit of
doing this would be that the application container can instantly connect
to OVS without a sleep loop.
I still don't get the attraction of the broker idea. The pros:
+ Overcomes the issue with stale UNIX domain socket files
+ Eliminates the re-connect sleep loop
Neutral:
* vhost-user UNIX domain socket directory container volume is replaced
by broker UNIX domain socket bind mount
* UNIX domain socket naming conflicts become broker key naming conflicts
The cons:
- Requires running a new service on the host with potential security
issues
- Requires support in third-party applications, QEMU, and DPDK/OVS
- The old code must be kept for compatibility with non-broker
configurations, especially since third-party applications may not
support the broker. Developers and users will have to learn about both
options and decide which one to use.
This seems like a modest improvement for the complexity and effort
involved. The same pros can be achieved by:
* Adding unlink(2) to rte_vhost (or applications can add rm -f
$PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
it doesn't catch a misconfiguration where the user launches two
processes with the same socket path.
* Reversing the direction of the client/server relationship to
eliminate the re-connect sleep loop at startup. I'm unsure whether
this is possible.
That said, the broker idea doesn't affect the vhost-user protocol itself
and is more of an OVS/DPDK topic. I may just not be familiar enough with
OVS/DPDK to understand the benefits of the approach.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 17:21 ` Stefan Hajnoczi
@ 2021-03-23 17:57 ` Adrian Moreno
2021-03-23 18:27 ` Ilya Maximets
0 siblings, 1 reply; 38+ messages in thread
From: Adrian Moreno @ 2021-03-23 17:57 UTC (permalink / raw)
To: Stefan Hajnoczi, Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Julia Suvorova,
Marc-André Lureau, Daniel Berrange, Billy McFall
On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>> And some housekeeping usually required for applications in case the
>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>> system:
>>>>>> "failed to bind to vhu: Address already in use; remove it and try again"
>>>>>
>>>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>> solved with a pidfile.
>>>>
>>>> How exactly this could be solved with a pidfile?
>>>
>>> A pidfile prevents two instances of the same service from running at the
>>> same time.
>>>
>>> The same effect can be achieved by the container orchestrator, systemd,
>>> etc too because it refuses to run the same service twice.
>>
>> Sure. I understand that. My point was that these could be 2 different
>> applications and they might not know which process to look for.
>>
>>>
>>>> And what if this is
>>>> a different application that tries to create a socket on a same path?
>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind
>>>> to an existing socket file and will fail. Subsequently port creation
>>>> in OVS will fail. We can't allow OVS to unlink files because this
>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>> access to and we also has no idea if it's a QEMU that created a file
>>>> or it was a virtio-user application or someone else.
>>>
>>> If rte_vhost unlinks the socket then the user will find that networking
>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device
>>> or restart QEMU, depending on whether they need to keep the guest
>>> running or not. This is a misconfiguration that is recoverable.
>>
>> True, it's recoverable, but with a high cost. Restart of a VM is rarely
>> desirable. And the application inside the guest might not feel itself
>> well after hot re-plug of a device that it actively used. I'd expect
>> a DPDK application that runs inside a guest on some virtio-net device
>> to crash after this kind of manipulations. Especially, if it uses some
>> older versions of DPDK.
>
> This unlink issue is probably something we think differently about.
> There are many ways for users to misconfigure things when working with
> system tools. If it's possible to catch misconfigurations that is
> preferrable. In this case it's just the way pathname AF_UNIX domain
> sockets work and IMO it's better not to have problems starting the
> service due to stale files than to insist on preventing
> misconfigurations. QEMU and DPDK do this differently and both seem to be
> successful, so ¯\_(ツ)_/¯.
>
>>>
>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>> create a security issue. I don't know the security model of OVS.
>>
>> In general privileges of a ovs-vswitchd daemon might be completely
>> different from privileges required to invoke control utilities or
>> to access the configuration database. SO, yes, we should not allow
>> that.
>
> That can be locked down by restricting the socket path to a file beneath
> /var/run/ovs/vhost-user/.
>
>>>
>>>> There are, probably, ways to detect if there is any alive process that
>>>> has this socket open, but that sounds like too much for this purpose,
>>>> also I'm not sure if it's possible if actual user is in a different
>>>> container.
>>>> So I don't see a good reliable way to detect these conditions. This
>>>> falls on shoulders of a higher level management software or a user to
>>>> clean these socket files up before adding ports.
>>>
>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>> disappears when there is no process listening anymore.
>>
>> OVS is usually started right on the host in a main network namespace.
>> In case it's started in a pod, it will run in a separate container but
>> configured with a host network. Applications almost exclusively runs
>> in separate pods.
>
> Okay.
>
>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>>>
>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>> closely managed by one piece of software:
>>>>>
>>>>> 1. Unlink the socket path.
>>>>> 2. Create, bind, and listen on the socket path.
>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>> RPC, spawn a process, etc) and pass in the listen fd.
>>>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>>> As soon as the vhost-user device backend calls accept(2) the
>>>>> connection will proceed (there is no need for sleeping).
>>>>>
>>>>> This approach works across containers without a broker.
>>>>
>>>> Not sure if I fully understood a question here, but anyway.
>>>>
>>>> This approach works fine if you know what application to run.
>>>> In case of a k8s cluster, it might be a random DPDK application
>>>> with virtio-user ports running inside a container and want to
>>>> have a network connection. Also, this application needs to run
>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>> require restart of the application. So, you basically need to
>>>> rely on a third-party application to create a socket with a right
>>>> name and in a correct location that is shared with a host, so
>>>> OVS can find it and connect.
>>>>
>>>> In a VM world everything is much more simple, since you have
>>>> a libvirt and QEMU that will take care of all of these stuff
>>>> and which are also under full control of management software
>>>> and a system administrator.
>>>> In case of a container with a "random" DPDK application inside
>>>> there is no such entity that can help. Of course, some solution
>>>> might be implemented in docker/podman daemon to create and manage
>>>> outside-looking sockets for an application inside the container,
>>>> but that is not available today AFAIK and I'm not sure if it
>>>> ever will.
>>>
>>> Wait, when you say there is no entity like management software or a
>>> system administrator, then how does OVS know to instantiate the new
>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>
>> I didn't mean that there is no any application that configures
>> everything. Of course, there is. I mean that there is no such
>> entity that abstracts all that socket machinery from the user's
>> application that runs inside the container. QEMU hides all the
>> details of the connection to vhost backend and presents the device
>> as a PCI device with a network interface wrapping from the guest
>> kernel. So, the application inside VM shouldn't care what actually
>> there is a socket connected to OVS that implements backend and
>> forward traffic somewhere. For the application it's just a usual
>> network interface.
>> But in case of a container world, application should handle all
>> that by creating a virtio-user device that will connect to some
>> socket, that has an OVS on the other side.
>>
>>>
>>> Can you describe the steps used today (without the broker) for
>>> instantiating a new DPDK app container and connecting it to OVS?
>>> Although my interest is in the vhost-user protocol I think it's
>>> necessary to understand the OVS requirements here and I know little
>>> about them.
>>>> I might describe some things wrong since I worked with k8s and CNI
>> plugins last time ~1.5 years ago, but the basic schema will look
>> something like this:
>>
>> 1. user decides to start a new pod and requests k8s to do that
>> via cmdline tools or some API calls.
>>
>> 2. k8s scheduler looks for available resources asking resource
>> manager plugins, finds an appropriate physical host and asks
>> local to that node kubelet daemon to launch a new pod there.
>>
When the CNI is called, the pod has already been created, i.e: a PodID exists
and so does an associated network namespace. Therefore, everything that has to
do with the runtime spec such as mountpoints or devices cannot be modified by
this time.
That's why the Device Plugin API is used to modify the Pod's spec before the CNI
chain is called.
>> 3. kubelet asks local CNI plugin to allocate network resources
>> and annotate the pod with required mount points, devices that
>> needs to be passed in and environment variables.
>> (this is, IIRC, a gRPC connection. It might be a multus-cni
>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
>> usually deployed as a system DaemonSet, so it runs in a
>> separate pod.
>>
>> 4. Assuming that vhost-user connection requested in server mode.
>> CNI plugin will:
>> 4.1 create a directory for a vhost-user socket.
>> 4.2 add this directory to pod annotations as a mount point.
I believe this is not possible, it would have to inspect the pod's spec or
otherwise determine an existing mount point where the socket should be created.
+Billy might give more insights on this
>> 4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>> by connecting to ovsdb-server by JSONRPC directly.
>> It will set port type as dpdkvhostuserclient and specify
>> socket-path as a path inside the directory it created.
>> (OVS will create a port and rte_vhost will enter the
>> re-connection loop since socket does not exist yet.)
>> 4.4 Set up socket file location as environment variable in
>> pod annotations.
>> 4.5 report success to kubelet.
>>
Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin
or other external entity that can inject the mount point before the pod is created.
However, there is another usecase that might be relevant: dynamic attachment of
network interfaces. In this case the CNI cannot work in collaboration with a
Device Plugin or "mount-point injector" and an existing mount point has to be used.
Also, some form of notification mechanism has to exist to tell the workload a
new socket is ready.
>> 5. kubelet will finish all other preparations and resource
>> allocations and will ask docker/podman to start a container
>> with all mount points, devices and environment variables from
>> the pod annotation.
>>
>> 6. docker/podman starts a container.
>> Need to mention here that in many cases initial process of
>> a container is not the actual application that will use a
>> vhost-user connection, but likely a shell that will invoke
>> the actual application.
>>
>> 7. Application starts inside the container, checks the environment
>> variables (actually, checking of environment variables usually
>> happens in a shell script that invokes the application with
>> correct arguments) and creates a net_virtio_user port in server
>> mode. At this point socket file will be created.
>> (since we're running third-party application inside the container
>> we can only assume that it will do what is written here, it's
>> a responsibility of an application developer to do the right
>> thing.)
>>
>> 8. OVS successfully re-connects to the newly created socket in a
>> shared directory and vhost-user protocol establishes the network
>> connectivity.
>>
>> As you can wee, there are way too many entities and communication
>> methods involved. So, passing a pre-opened file descriptor from
>> CNI all the way down to application is not that easy as it is in
>> case of QEMU+LibVirt.
>
> File descriptor passing isn't necessary if OVS owns the listen socket
> and the application container is the one who connects. That's why I
> asked why dpdkvhostuser was deprecated in another email. The benefit of
> doing this would be that the application container can instantly connect
> to OVS without a sleep loop.
>
> I still don't get the attraction of the broker idea. The pros:
> + Overcomes the issue with stale UNIX domain socket files
> + Eliminates the re-connect sleep loop
>
> Neutral:
> * vhost-user UNIX domain socket directory container volume is replaced
> by broker UNIX domain socket bind mount
> * UNIX domain socket naming conflicts become broker key naming conflicts
>
> The cons:
> - Requires running a new service on the host with potential security
> issues
> - Requires support in third-party applications, QEMU, and DPDK/OVS
> - The old code must be kept for compatibility with non-broker
> configurations, especially since third-party applications may not
> support the broker. Developers and users will have to learn about both
> options and decide which one to use.
>
> This seems like a modest improvement for the complexity and effort
> involved. The same pros can be achieved by:
> * Adding unlink(2) to rte_vhost (or applications can add rm -f
> $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
> it doesn't catch a misconfiguration where the user launches two
> processes with the same socket path.
> * Reversing the direction of the client/server relationship to
> eliminate the re-connect sleep loop at startup. I'm unsure whether
> this is possible.
>
> That said, the broker idea doesn't affect the vhost-user protocol itself
> and is more of an OVS/DPDK topic. I may just not be familiar enough with
> OVS/DPDK to understand the benefits of the approach.
>
> Stefan
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-23 17:57 ` Adrian Moreno
@ 2021-03-23 18:27 ` Ilya Maximets
2021-03-23 20:54 ` Billy McFall
0 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-23 18:27 UTC (permalink / raw)
To: Adrian Moreno, Stefan Hajnoczi, Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Julia Suvorova,
Marc-André Lureau, Daniel Berrange, Billy McFall
On 3/23/21 6:57 PM, Adrian Moreno wrote:
>
>
> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>>> And some housekeeping usually required for applications in case the
>>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>>> system:
>>>>>>> "failed to bind to vhu: Address already in use; remove it and try again"
>>>>>>
>>>>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>>> solved with a pidfile.
>>>>>
>>>>> How exactly this could be solved with a pidfile?
>>>>
>>>> A pidfile prevents two instances of the same service from running at the
>>>> same time.
>>>>
>>>> The same effect can be achieved by the container orchestrator, systemd,
>>>> etc too because it refuses to run the same service twice.
>>>
>>> Sure. I understand that. My point was that these could be 2 different
>>> applications and they might not know which process to look for.
>>>
>>>>
>>>>> And what if this is
>>>>> a different application that tries to create a socket on a same path?
>>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind
>>>>> to an existing socket file and will fail. Subsequently port creation
>>>>> in OVS will fail. We can't allow OVS to unlink files because this
>>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>>> access to and we also has no idea if it's a QEMU that created a file
>>>>> or it was a virtio-user application or someone else.
>>>>
>>>> If rte_vhost unlinks the socket then the user will find that networking
>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device
>>>> or restart QEMU, depending on whether they need to keep the guest
>>>> running or not. This is a misconfiguration that is recoverable.
>>>
>>> True, it's recoverable, but with a high cost. Restart of a VM is rarely
>>> desirable. And the application inside the guest might not feel itself
>>> well after hot re-plug of a device that it actively used. I'd expect
>>> a DPDK application that runs inside a guest on some virtio-net device
>>> to crash after this kind of manipulations. Especially, if it uses some
>>> older versions of DPDK.
>>
>> This unlink issue is probably something we think differently about.
>> There are many ways for users to misconfigure things when working with
>> system tools. If it's possible to catch misconfigurations that is
>> preferrable. In this case it's just the way pathname AF_UNIX domain
>> sockets work and IMO it's better not to have problems starting the
>> service due to stale files than to insist on preventing
>> misconfigurations. QEMU and DPDK do this differently and both seem to be
>> successful, so ¯\_(ツ)_/¯.
>>
>>>>
>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>>> create a security issue. I don't know the security model of OVS.
>>>
>>> In general privileges of a ovs-vswitchd daemon might be completely
>>> different from privileges required to invoke control utilities or
>>> to access the configuration database. SO, yes, we should not allow
>>> that.
>>
>> That can be locked down by restricting the socket path to a file beneath
>> /var/run/ovs/vhost-user/.
>>
>>>>
>>>>> There are, probably, ways to detect if there is any alive process that
>>>>> has this socket open, but that sounds like too much for this purpose,
>>>>> also I'm not sure if it's possible if actual user is in a different
>>>>> container.
>>>>> So I don't see a good reliable way to detect these conditions. This
>>>>> falls on shoulders of a higher level management software or a user to
>>>>> clean these socket files up before adding ports.
>>>>
>>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
>>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>>> disappears when there is no process listening anymore.
>>>
>>> OVS is usually started right on the host in a main network namespace.
>>> In case it's started in a pod, it will run in a separate container but
>>> configured with a host network. Applications almost exclusively runs
>>> in separate pods.
>>
>> Okay.
>>
>>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>>>>
>>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>>> closely managed by one piece of software:
>>>>>>
>>>>>> 1. Unlink the socket path.
>>>>>> 2. Create, bind, and listen on the socket path.
>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>> RPC, spawn a process, etc) and pass in the listen fd.
>>>>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>>>> As soon as the vhost-user device backend calls accept(2) the
>>>>>> connection will proceed (there is no need for sleeping).
>>>>>>
>>>>>> This approach works across containers without a broker.
>>>>>
>>>>> Not sure if I fully understood a question here, but anyway.
>>>>>
>>>>> This approach works fine if you know what application to run.
>>>>> In case of a k8s cluster, it might be a random DPDK application
>>>>> with virtio-user ports running inside a container and want to
>>>>> have a network connection. Also, this application needs to run
>>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>>> require restart of the application. So, you basically need to
>>>>> rely on a third-party application to create a socket with a right
>>>>> name and in a correct location that is shared with a host, so
>>>>> OVS can find it and connect.
>>>>>
>>>>> In a VM world everything is much more simple, since you have
>>>>> a libvirt and QEMU that will take care of all of these stuff
>>>>> and which are also under full control of management software
>>>>> and a system administrator.
>>>>> In case of a container with a "random" DPDK application inside
>>>>> there is no such entity that can help. Of course, some solution
>>>>> might be implemented in docker/podman daemon to create and manage
>>>>> outside-looking sockets for an application inside the container,
>>>>> but that is not available today AFAIK and I'm not sure if it
>>>>> ever will.
>>>>
>>>> Wait, when you say there is no entity like management software or a
>>>> system administrator, then how does OVS know to instantiate the new
>>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>>
>>> I didn't mean that there is no any application that configures
>>> everything. Of course, there is. I mean that there is no such
>>> entity that abstracts all that socket machinery from the user's
>>> application that runs inside the container. QEMU hides all the
>>> details of the connection to vhost backend and presents the device
>>> as a PCI device with a network interface wrapping from the guest
>>> kernel. So, the application inside VM shouldn't care what actually
>>> there is a socket connected to OVS that implements backend and
>>> forward traffic somewhere. For the application it's just a usual
>>> network interface.
>>> But in case of a container world, application should handle all
>>> that by creating a virtio-user device that will connect to some
>>> socket, that has an OVS on the other side.
>>>
>>>>
>>>> Can you describe the steps used today (without the broker) for
>>>> instantiating a new DPDK app container and connecting it to OVS?
>>>> Although my interest is in the vhost-user protocol I think it's
>>>> necessary to understand the OVS requirements here and I know little
>>>> about them.
>>>>> I might describe some things wrong since I worked with k8s and CNI
>>> plugins last time ~1.5 years ago, but the basic schema will look
>>> something like this:
>>>
>>> 1. user decides to start a new pod and requests k8s to do that
>>> via cmdline tools or some API calls.
>>>
>>> 2. k8s scheduler looks for available resources asking resource
>>> manager plugins, finds an appropriate physical host and asks
>>> local to that node kubelet daemon to launch a new pod there.
>>>
>
> When the CNI is called, the pod has already been created, i.e: a PodID exists
> and so does an associated network namespace. Therefore, everything that has to
> do with the runtime spec such as mountpoints or devices cannot be modified by
> this time.
>
> That's why the Device Plugin API is used to modify the Pod's spec before the CNI
> chain is called.
>
>>> 3. kubelet asks local CNI plugin to allocate network resources
>>> and annotate the pod with required mount points, devices that
>>> needs to be passed in and environment variables.
>>> (this is, IIRC, a gRPC connection. It might be a multus-cni
>>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
>>> usually deployed as a system DaemonSet, so it runs in a
>>> separate pod.
>>>
>>> 4. Assuming that vhost-user connection requested in server mode.
>>> CNI plugin will:
>>> 4.1 create a directory for a vhost-user socket.
>>> 4.2 add this directory to pod annotations as a mount point.
>
> I believe this is not possible, it would have to inspect the pod's spec or
> otherwise determine an existing mount point where the socket should be created.
Uff. Yes, you're right. Thanks for your clarification.
I mixed up CNI and Device Plugin here.
CNI itself is not able to annotate new resources to the pod, i.e.
create new mounts or something like this. And I don't recall any
vhost-user device plugins. Is there any? There is an SR-IOV device
plugin, but its purpose is to allocate and pass PCI devices, not create
mounts for vhost-user.
So, IIUC, right now user must create the directory and specify
a mount point in a pod spec file or pass the whole /var/run/openvswitch
or something like this, right?
Looking at userspace-cni-network-plugin, it actually just parses
annotations to find the shared directory and fails if there is
no any:
https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
And examples suggests to specify a directory to mount:
https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
Looks like this is done by user's hands.
>
> +Billy might give more insights on this
>
>>> 4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>>> by connecting to ovsdb-server by JSONRPC directly.
>>> It will set port type as dpdkvhostuserclient and specify
>>> socket-path as a path inside the directory it created.
>>> (OVS will create a port and rte_vhost will enter the
>>> re-connection loop since socket does not exist yet.)
>>> 4.4 Set up socket file location as environment variable in
>>> pod annotations.
>>> 4.5 report success to kubelet.
>>>
>
> Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin
> or other external entity that can inject the mount point before the pod is created.
>
> However, there is another usecase that might be relevant: dynamic attachment of
> network interfaces. In this case the CNI cannot work in collaboration with a
> Device Plugin or "mount-point injector" and an existing mount point has to be used.
> Also, some form of notification mechanism has to exist to tell the workload a
> new socket is ready.
>
>>> 5. kubelet will finish all other preparations and resource
>>> allocations and will ask docker/podman to start a container
>>> with all mount points, devices and environment variables from
>>> the pod annotation.
>>>
>>> 6. docker/podman starts a container.
>>> Need to mention here that in many cases initial process of
>>> a container is not the actual application that will use a
>>> vhost-user connection, but likely a shell that will invoke
>>> the actual application.
>>>
>>> 7. Application starts inside the container, checks the environment
>>> variables (actually, checking of environment variables usually
>>> happens in a shell script that invokes the application with
>>> correct arguments) and creates a net_virtio_user port in server
>>> mode. At this point socket file will be created.
>>> (since we're running third-party application inside the container
>>> we can only assume that it will do what is written here, it's
>>> a responsibility of an application developer to do the right
>>> thing.)
>>>
>>> 8. OVS successfully re-connects to the newly created socket in a
>>> shared directory and vhost-user protocol establishes the network
>>> connectivity.
>>>
>>> As you can wee, there are way too many entities and communication
>>> methods involved. So, passing a pre-opened file descriptor from
>>> CNI all the way down to application is not that easy as it is in
>>> case of QEMU+LibVirt.
>>
>> File descriptor passing isn't necessary if OVS owns the listen socket
>> and the application container is the one who connects. That's why I
>> asked why dpdkvhostuser was deprecated in another email. The benefit of
>> doing this would be that the application container can instantly connect
>> to OVS without a sleep loop.
>>
>> I still don't get the attraction of the broker idea. The pros:
>> + Overcomes the issue with stale UNIX domain socket files
>> + Eliminates the re-connect sleep loop
>>
>> Neutral:
>> * vhost-user UNIX domain socket directory container volume is replaced
>> by broker UNIX domain socket bind mount
>> * UNIX domain socket naming conflicts become broker key naming conflicts
>>
>> The cons:
>> - Requires running a new service on the host with potential security
>> issues
>> - Requires support in third-party applications, QEMU, and DPDK/OVS
>> - The old code must be kept for compatibility with non-broker
>> configurations, especially since third-party applications may not
>> support the broker. Developers and users will have to learn about both
>> options and decide which one to use.
>>
>> This seems like a modest improvement for the complexity and effort
>> involved. The same pros can be achieved by:
>> * Adding unlink(2) to rte_vhost (or applications can add rm -f
>> $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
>> it doesn't catch a misconfiguration where the user launches two
>> processes with the same socket path.
>> * Reversing the direction of the client/server relationship to
>> eliminate the re-connect sleep loop at startup. I'm unsure whether
>> this is possible.
>>
>> That said, the broker idea doesn't affect the vhost-user protocol itself
>> and is more of an OVS/DPDK topic. I may just not be familiar enough with
>> OVS/DPDK to understand the benefits of the approach.
>>
>> Stefan
>>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-23 18:27 ` Ilya Maximets
@ 2021-03-23 20:54 ` Billy McFall
2021-03-24 12:05 ` Stefan Hajnoczi
0 siblings, 1 reply; 38+ messages in thread
From: Billy McFall @ 2021-03-23 20:54 UTC (permalink / raw)
To: Ilya Maximets
Cc: Adrian Moreno, Stefan Hajnoczi, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >
> >
> > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>> system:
> >>>>>>> "failed to bind to vhu: Address already in use; remove it and try
> again"
> >>>>>>
> >>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> users
> >>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>> solved with a pidfile.
> >>>>>
> >>>>> How exactly this could be solved with a pidfile?
> >>>>
> >>>> A pidfile prevents two instances of the same service from running at
> the
> >>>> same time.
> >>>>
> >>>> The same effect can be achieved by the container orchestrator,
> systemd,
> >>>> etc too because it refuses to run the same service twice.
> >>>
> >>> Sure. I understand that. My point was that these could be 2 different
> >>> applications and they might not know which process to look for.
> >>>
> >>>>
> >>>>> And what if this is
> >>>>> a different application that tries to create a socket on a same path?
> >>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind
> >>>>> to an existing socket file and will fail. Subsequently port creation
> >>>>> in OVS will fail. We can't allow OVS to unlink files because this
> >>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>> or it was a virtio-user application or someone else.
> >>>>
> >>>> If rte_vhost unlinks the socket then the user will find that
> networking
> >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> device
> >>>> or restart QEMU, depending on whether they need to keep the guest
> >>>> running or not. This is a misconfiguration that is recoverable.
> >>>
> >>> True, it's recoverable, but with a high cost. Restart of a VM is
> rarely
> >>> desirable. And the application inside the guest might not feel itself
> >>> well after hot re-plug of a device that it actively used. I'd expect
> >>> a DPDK application that runs inside a guest on some virtio-net device
> >>> to crash after this kind of manipulations. Especially, if it uses some
> >>> older versions of DPDK.
> >>
> >> This unlink issue is probably something we think differently about.
> >> There are many ways for users to misconfigure things when working with
> >> system tools. If it's possible to catch misconfigurations that is
> >> preferrable. In this case it's just the way pathname AF_UNIX domain
> >> sockets work and IMO it's better not to have problems starting the
> >> service due to stale files than to insist on preventing
> >> misconfigurations. QEMU and DPDK do this differently and both seem to be
> >> successful, so ¯\_(ツ)_/¯.
> >>
> >>>>
> >>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> >>>> create a security issue. I don't know the security model of OVS.
> >>>
> >>> In general privileges of a ovs-vswitchd daemon might be completely
> >>> different from privileges required to invoke control utilities or
> >>> to access the configuration database. SO, yes, we should not allow
> >>> that.
> >>
> >> That can be locked down by restricting the socket path to a file beneath
> >> /var/run/ovs/vhost-user/.
> >>
> >>>>
> >>>>> There are, probably, ways to detect if there is any alive process
> that
> >>>>> has this socket open, but that sounds like too much for this purpose,
> >>>>> also I'm not sure if it's possible if actual user is in a different
> >>>>> container.
> >>>>> So I don't see a good reliable way to detect these conditions. This
> >>>>> falls on shoulders of a higher level management software or a user to
> >>>>> clean these socket files up before adding ports.
> >>>>
> >>>> Does OVS always run in the same net namespace (pod) as the DPDK
> >>>> application? If yes, then abstract AF_UNIX sockets can be used.
> Abstract
> >>>> AF_UNIX sockets don't have a filesystem path and the socket address
> >>>> disappears when there is no process listening anymore.
> >>>
> >>> OVS is usually started right on the host in a main network namespace.
> >>> In case it's started in a pod, it will run in a separate container but
> >>> configured with a host network. Applications almost exclusively runs
> >>> in separate pods.
> >>
> >> Okay.
> >>
> >>>>>>> This patch-set aims to eliminate most of the inconveniences by
> >>>>>>> leveraging an infrastructure service provided by a SocketPair
> Broker.
> >>>>>>
> >>>>>> I don't understand yet why this is useful for vhost-user, where the
> >>>>>> creation of the vhost-user device backend and its use by a VMM are
> >>>>>> closely managed by one piece of software:
> >>>>>>
> >>>>>> 1. Unlink the socket path.
> >>>>>> 2. Create, bind, and listen on the socket path.
> >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>>>> RPC, spawn a process, etc) and pass in the listen fd.
> >>>>>> 4. In the meantime the VMM can open the socket path and call
> connect(2).
> >>>>>> As soon as the vhost-user device backend calls accept(2) the
> >>>>>> connection will proceed (there is no need for sleeping).
> >>>>>>
> >>>>>> This approach works across containers without a broker.
> >>>>>
> >>>>> Not sure if I fully understood a question here, but anyway.
> >>>>>
> >>>>> This approach works fine if you know what application to run.
> >>>>> In case of a k8s cluster, it might be a random DPDK application
> >>>>> with virtio-user ports running inside a container and want to
> >>>>> have a network connection. Also, this application needs to run
> >>>>> virtio-user in server mode, otherwise restart of the OVS will
> >>>>> require restart of the application. So, you basically need to
> >>>>> rely on a third-party application to create a socket with a right
> >>>>> name and in a correct location that is shared with a host, so
> >>>>> OVS can find it and connect.
> >>>>>
> >>>>> In a VM world everything is much more simple, since you have
> >>>>> a libvirt and QEMU that will take care of all of these stuff
> >>>>> and which are also under full control of management software
> >>>>> and a system administrator.
> >>>>> In case of a container with a "random" DPDK application inside
> >>>>> there is no such entity that can help. Of course, some solution
> >>>>> might be implemented in docker/podman daemon to create and manage
> >>>>> outside-looking sockets for an application inside the container,
> >>>>> but that is not available today AFAIK and I'm not sure if it
> >>>>> ever will.
> >>>>
> >>>> Wait, when you say there is no entity like management software or a
> >>>> system administrator, then how does OVS know to instantiate the new
> >>>> port? I guess something still needs to invoke ovs-ctl add-port?
> >>>
> >>> I didn't mean that there is no any application that configures
> >>> everything. Of course, there is. I mean that there is no such
> >>> entity that abstracts all that socket machinery from the user's
> >>> application that runs inside the container. QEMU hides all the
> >>> details of the connection to vhost backend and presents the device
> >>> as a PCI device with a network interface wrapping from the guest
> >>> kernel. So, the application inside VM shouldn't care what actually
> >>> there is a socket connected to OVS that implements backend and
> >>> forward traffic somewhere. For the application it's just a usual
> >>> network interface.
> >>> But in case of a container world, application should handle all
> >>> that by creating a virtio-user device that will connect to some
> >>> socket, that has an OVS on the other side.
> >>>
> >>>>
> >>>> Can you describe the steps used today (without the broker) for
> >>>> instantiating a new DPDK app container and connecting it to OVS?
> >>>> Although my interest is in the vhost-user protocol I think it's
> >>>> necessary to understand the OVS requirements here and I know little
> >>>> about them.
> >>>>> I might describe some things wrong since I worked with k8s and CNI
> >>> plugins last time ~1.5 years ago, but the basic schema will look
> >>> something like this:
> >>>
> >>> 1. user decides to start a new pod and requests k8s to do that
> >>> via cmdline tools or some API calls.
> >>>
> >>> 2. k8s scheduler looks for available resources asking resource
> >>> manager plugins, finds an appropriate physical host and asks
> >>> local to that node kubelet daemon to launch a new pod there.
> >>>
> >
> > When the CNI is called, the pod has already been created, i.e: a PodID
> exists
> > and so does an associated network namespace. Therefore, everything that
> has to
> > do with the runtime spec such as mountpoints or devices cannot be
> modified by
> > this time.
> >
> > That's why the Device Plugin API is used to modify the Pod's spec before
> the CNI
> > chain is called.
> >
> >>> 3. kubelet asks local CNI plugin to allocate network resources
> >>> and annotate the pod with required mount points, devices that
> >>> needs to be passed in and environment variables.
> >>> (this is, IIRC, a gRPC connection. It might be a multus-cni
> >>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
> >>> usually deployed as a system DaemonSet, so it runs in a
> >>> separate pod.
> >>>
> >>> 4. Assuming that vhost-user connection requested in server mode.
> >>> CNI plugin will:
> >>> 4.1 create a directory for a vhost-user socket.
> >>> 4.2 add this directory to pod annotations as a mount point.
> >
> > I believe this is not possible, it would have to inspect the pod's spec
> or
> > otherwise determine an existing mount point where the socket should be
> created.
>
> Uff. Yes, you're right. Thanks for your clarification.
> I mixed up CNI and Device Plugin here.
>
> CNI itself is not able to annotate new resources to the pod, i.e.
> create new mounts or something like this. And I don't recall any
> vhost-user device plugins. Is there any? There is an SR-IOV device
> plugin, but its purpose is to allocate and pass PCI devices, not create
> mounts for vhost-user.
>
> So, IIUC, right now user must create the directory and specify
> a mount point in a pod spec file or pass the whole /var/run/openvswitch
> or something like this, right?
>
> Looking at userspace-cni-network-plugin, it actually just parses
> annotations to find the shared directory and fails if there is
> no any:
>
> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
>
> And examples suggests to specify a directory to mount:
>
> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
>
> Looks like this is done by user's hands.
>
> Yes, I am one of the primary authors of Userspace CNI. Currently, the
directory is by hand. Long term thought was to have a mutating
webhook/admission controller inject a directory into the podspec. Not sure
if it has changed, but I think when I was originally doing this work, OvS
only lets you choose the directory at install time, so it has to be
something like /var/run/openvswitch/. You can choose the socketfile name
and maybe a subdirectory off the main directory, but not the full path.
One of the issues I was trying to solve was making sure ContainerA couldn't
see ContainerB's socketfiles. That's where the admission controller could
create a unique subdirectory for each container under
/var/run/openvswitch/. But this was more of a PoC CNI and other work items
always took precedence so that work never completed.
Billy
>
> > +Billy might give more insights on this
> >
> >>> 4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
> >>> by connecting to ovsdb-server by JSONRPC directly.
> >>> It will set port type as dpdkvhostuserclient and specify
> >>> socket-path as a path inside the directory it created.
> >>> (OVS will create a port and rte_vhost will enter the
> >>> re-connection loop since socket does not exist yet.)
> >>> 4.4 Set up socket file location as environment variable in
> >>> pod annotations.
> >>> 4.5 report success to kubelet.
> >>>
> >
> > Since the CNI cannot modify the pod's mounts it has to rely on a Device
> Plugin
> > or other external entity that can inject the mount point before the pod
> is created.
> >
> > However, there is another usecase that might be relevant: dynamic
> attachment of
> > network interfaces. In this case the CNI cannot work in collaboration
> with a
> > Device Plugin or "mount-point injector" and an existing mount point has
> to be used.
> > Also, some form of notification mechanism has to exist to tell the
> workload a
> > new socket is ready.
> >
> >>> 5. kubelet will finish all other preparations and resource
> >>> allocations and will ask docker/podman to start a container
> >>> with all mount points, devices and environment variables from
> >>> the pod annotation.
> >>>
> >>> 6. docker/podman starts a container.
> >>> Need to mention here that in many cases initial process of
> >>> a container is not the actual application that will use a
> >>> vhost-user connection, but likely a shell that will invoke
> >>> the actual application.
> >>>
> >>> 7. Application starts inside the container, checks the environment
> >>> variables (actually, checking of environment variables usually
> >>> happens in a shell script that invokes the application with
> >>> correct arguments) and creates a net_virtio_user port in server
> >>> mode. At this point socket file will be created.
> >>> (since we're running third-party application inside the container
> >>> we can only assume that it will do what is written here, it's
> >>> a responsibility of an application developer to do the right
> >>> thing.)
> >>>
> >>> 8. OVS successfully re-connects to the newly created socket in a
> >>> shared directory and vhost-user protocol establishes the network
> >>> connectivity.
> >>>
> >>> As you can wee, there are way too many entities and communication
> >>> methods involved. So, passing a pre-opened file descriptor from
> >>> CNI all the way down to application is not that easy as it is in
> >>> case of QEMU+LibVirt.
> >>
> >> File descriptor passing isn't necessary if OVS owns the listen socket
> >> and the application container is the one who connects. That's why I
> >> asked why dpdkvhostuser was deprecated in another email. The benefit of
> >> doing this would be that the application container can instantly connect
> >> to OVS without a sleep loop.
> >>
> >> I still don't get the attraction of the broker idea. The pros:
> >> + Overcomes the issue with stale UNIX domain socket files
> >> + Eliminates the re-connect sleep loop
> >>
> >> Neutral:
> >> * vhost-user UNIX domain socket directory container volume is replaced
> >> by broker UNIX domain socket bind mount
> >> * UNIX domain socket naming conflicts become broker key naming conflicts
> >>
> >> The cons:
> >> - Requires running a new service on the host with potential security
> >> issues
> >> - Requires support in third-party applications, QEMU, and DPDK/OVS
> >> - The old code must be kept for compatibility with non-broker
> >> configurations, especially since third-party applications may not
> >> support the broker. Developers and users will have to learn about both
> >> options and decide which one to use.
> >>
> >> This seems like a modest improvement for the complexity and effort
> >> involved. The same pros can be achieved by:
> >> * Adding unlink(2) to rte_vhost (or applications can add rm -f
> >> $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
> >> it doesn't catch a misconfiguration where the user launches two
> >> processes with the same socket path.
> >> * Reversing the direction of the client/server relationship to
> >> eliminate the re-connect sleep loop at startup. I'm unsure whether
> >> this is possible.
> >>
> >> That said, the broker idea doesn't affect the vhost-user protocol itself
> >> and is more of an OVS/DPDK topic. I may just not be familiar enough with
> >> OVS/DPDK to understand the benefits of the approach.
> >>
> >> Stefan
> >>
> >
>
>
--
*Billy McFall*
Networking Group
CTO Office
*Red Hat*
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-23 20:54 ` Billy McFall
@ 2021-03-24 12:05 ` Stefan Hajnoczi
2021-03-24 13:11 ` Ilya Maximets
0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-24 12:05 UTC (permalink / raw)
To: Billy McFall
Cc: Ilya Maximets, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 14138 bytes --]
On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> > On 3/23/21 6:57 PM, Adrian Moreno wrote:
> > >
> > >
> > > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> > >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> > >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> > >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> > >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> > >>>>>>> And some housekeeping usually required for applications in case the
> > >>>>>>> socket server terminated abnormally and socket files left on a file
> > >>>>>>> system:
> > >>>>>>> "failed to bind to vhu: Address already in use; remove it and try
> > again"
> > >>>>>>
> > >>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> > users
> > >>>>>> might accidentally hijack an existing listen socket, but that can be
> > >>>>>> solved with a pidfile.
> > >>>>>
> > >>>>> How exactly this could be solved with a pidfile?
> > >>>>
> > >>>> A pidfile prevents two instances of the same service from running at
> > the
> > >>>> same time.
> > >>>>
> > >>>> The same effect can be achieved by the container orchestrator,
> > systemd,
> > >>>> etc too because it refuses to run the same service twice.
> > >>>
> > >>> Sure. I understand that. My point was that these could be 2 different
> > >>> applications and they might not know which process to look for.
> > >>>
> > >>>>
> > >>>>> And what if this is
> > >>>>> a different application that tries to create a socket on a same path?
> > >>>>> e.g. QEMU creates a socket (started in a server mode) and user
> > >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> > >>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind
> > >>>>> to an existing socket file and will fail. Subsequently port creation
> > >>>>> in OVS will fail. We can't allow OVS to unlink files because this
> > >>>>> way OVS users will have ability to unlink random sockets that OVS has
> > >>>>> access to and we also has no idea if it's a QEMU that created a file
> > >>>>> or it was a virtio-user application or someone else.
> > >>>>
> > >>>> If rte_vhost unlinks the socket then the user will find that
> > networking
> > >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> > device
> > >>>> or restart QEMU, depending on whether they need to keep the guest
> > >>>> running or not. This is a misconfiguration that is recoverable.
> > >>>
> > >>> True, it's recoverable, but with a high cost. Restart of a VM is
> > rarely
> > >>> desirable. And the application inside the guest might not feel itself
> > >>> well after hot re-plug of a device that it actively used. I'd expect
> > >>> a DPDK application that runs inside a guest on some virtio-net device
> > >>> to crash after this kind of manipulations. Especially, if it uses some
> > >>> older versions of DPDK.
> > >>
> > >> This unlink issue is probably something we think differently about.
> > >> There are many ways for users to misconfigure things when working with
> > >> system tools. If it's possible to catch misconfigurations that is
> > >> preferrable. In this case it's just the way pathname AF_UNIX domain
> > >> sockets work and IMO it's better not to have problems starting the
> > >> service due to stale files than to insist on preventing
> > >> misconfigurations. QEMU and DPDK do this differently and both seem to be
> > >> successful, so ¯\_(ツ)_/¯.
> > >>
> > >>>>
> > >>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> > >>>> create a security issue. I don't know the security model of OVS.
> > >>>
> > >>> In general privileges of a ovs-vswitchd daemon might be completely
> > >>> different from privileges required to invoke control utilities or
> > >>> to access the configuration database. SO, yes, we should not allow
> > >>> that.
> > >>
> > >> That can be locked down by restricting the socket path to a file beneath
> > >> /var/run/ovs/vhost-user/.
> > >>
> > >>>>
> > >>>>> There are, probably, ways to detect if there is any alive process
> > that
> > >>>>> has this socket open, but that sounds like too much for this purpose,
> > >>>>> also I'm not sure if it's possible if actual user is in a different
> > >>>>> container.
> > >>>>> So I don't see a good reliable way to detect these conditions. This
> > >>>>> falls on shoulders of a higher level management software or a user to
> > >>>>> clean these socket files up before adding ports.
> > >>>>
> > >>>> Does OVS always run in the same net namespace (pod) as the DPDK
> > >>>> application? If yes, then abstract AF_UNIX sockets can be used.
> > Abstract
> > >>>> AF_UNIX sockets don't have a filesystem path and the socket address
> > >>>> disappears when there is no process listening anymore.
> > >>>
> > >>> OVS is usually started right on the host in a main network namespace.
> > >>> In case it's started in a pod, it will run in a separate container but
> > >>> configured with a host network. Applications almost exclusively runs
> > >>> in separate pods.
> > >>
> > >> Okay.
> > >>
> > >>>>>>> This patch-set aims to eliminate most of the inconveniences by
> > >>>>>>> leveraging an infrastructure service provided by a SocketPair
> > Broker.
> > >>>>>>
> > >>>>>> I don't understand yet why this is useful for vhost-user, where the
> > >>>>>> creation of the vhost-user device backend and its use by a VMM are
> > >>>>>> closely managed by one piece of software:
> > >>>>>>
> > >>>>>> 1. Unlink the socket path.
> > >>>>>> 2. Create, bind, and listen on the socket path.
> > >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> > >>>>>> RPC, spawn a process, etc) and pass in the listen fd.
> > >>>>>> 4. In the meantime the VMM can open the socket path and call
> > connect(2).
> > >>>>>> As soon as the vhost-user device backend calls accept(2) the
> > >>>>>> connection will proceed (there is no need for sleeping).
> > >>>>>>
> > >>>>>> This approach works across containers without a broker.
> > >>>>>
> > >>>>> Not sure if I fully understood a question here, but anyway.
> > >>>>>
> > >>>>> This approach works fine if you know what application to run.
> > >>>>> In case of a k8s cluster, it might be a random DPDK application
> > >>>>> with virtio-user ports running inside a container and want to
> > >>>>> have a network connection. Also, this application needs to run
> > >>>>> virtio-user in server mode, otherwise restart of the OVS will
> > >>>>> require restart of the application. So, you basically need to
> > >>>>> rely on a third-party application to create a socket with a right
> > >>>>> name and in a correct location that is shared with a host, so
> > >>>>> OVS can find it and connect.
> > >>>>>
> > >>>>> In a VM world everything is much more simple, since you have
> > >>>>> a libvirt and QEMU that will take care of all of these stuff
> > >>>>> and which are also under full control of management software
> > >>>>> and a system administrator.
> > >>>>> In case of a container with a "random" DPDK application inside
> > >>>>> there is no such entity that can help. Of course, some solution
> > >>>>> might be implemented in docker/podman daemon to create and manage
> > >>>>> outside-looking sockets for an application inside the container,
> > >>>>> but that is not available today AFAIK and I'm not sure if it
> > >>>>> ever will.
> > >>>>
> > >>>> Wait, when you say there is no entity like management software or a
> > >>>> system administrator, then how does OVS know to instantiate the new
> > >>>> port? I guess something still needs to invoke ovs-ctl add-port?
> > >>>
> > >>> I didn't mean that there is no any application that configures
> > >>> everything. Of course, there is. I mean that there is no such
> > >>> entity that abstracts all that socket machinery from the user's
> > >>> application that runs inside the container. QEMU hides all the
> > >>> details of the connection to vhost backend and presents the device
> > >>> as a PCI device with a network interface wrapping from the guest
> > >>> kernel. So, the application inside VM shouldn't care what actually
> > >>> there is a socket connected to OVS that implements backend and
> > >>> forward traffic somewhere. For the application it's just a usual
> > >>> network interface.
> > >>> But in case of a container world, application should handle all
> > >>> that by creating a virtio-user device that will connect to some
> > >>> socket, that has an OVS on the other side.
> > >>>
> > >>>>
> > >>>> Can you describe the steps used today (without the broker) for
> > >>>> instantiating a new DPDK app container and connecting it to OVS?
> > >>>> Although my interest is in the vhost-user protocol I think it's
> > >>>> necessary to understand the OVS requirements here and I know little
> > >>>> about them.
> > >>>>> I might describe some things wrong since I worked with k8s and CNI
> > >>> plugins last time ~1.5 years ago, but the basic schema will look
> > >>> something like this:
> > >>>
> > >>> 1. user decides to start a new pod and requests k8s to do that
> > >>> via cmdline tools or some API calls.
> > >>>
> > >>> 2. k8s scheduler looks for available resources asking resource
> > >>> manager plugins, finds an appropriate physical host and asks
> > >>> local to that node kubelet daemon to launch a new pod there.
> > >>>
> > >
> > > When the CNI is called, the pod has already been created, i.e: a PodID
> > exists
> > > and so does an associated network namespace. Therefore, everything that
> > has to
> > > do with the runtime spec such as mountpoints or devices cannot be
> > modified by
> > > this time.
> > >
> > > That's why the Device Plugin API is used to modify the Pod's spec before
> > the CNI
> > > chain is called.
> > >
> > >>> 3. kubelet asks local CNI plugin to allocate network resources
> > >>> and annotate the pod with required mount points, devices that
> > >>> needs to be passed in and environment variables.
> > >>> (this is, IIRC, a gRPC connection. It might be a multus-cni
> > >>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
> > >>> usually deployed as a system DaemonSet, so it runs in a
> > >>> separate pod.
> > >>>
> > >>> 4. Assuming that vhost-user connection requested in server mode.
> > >>> CNI plugin will:
> > >>> 4.1 create a directory for a vhost-user socket.
> > >>> 4.2 add this directory to pod annotations as a mount point.
> > >
> > > I believe this is not possible, it would have to inspect the pod's spec
> > or
> > > otherwise determine an existing mount point where the socket should be
> > created.
> >
> > Uff. Yes, you're right. Thanks for your clarification.
> > I mixed up CNI and Device Plugin here.
> >
> > CNI itself is not able to annotate new resources to the pod, i.e.
> > create new mounts or something like this. And I don't recall any
> > vhost-user device plugins. Is there any? There is an SR-IOV device
> > plugin, but its purpose is to allocate and pass PCI devices, not create
> > mounts for vhost-user.
> >
> > So, IIUC, right now user must create the directory and specify
> > a mount point in a pod spec file or pass the whole /var/run/openvswitch
> > or something like this, right?
> >
> > Looking at userspace-cni-network-plugin, it actually just parses
> > annotations to find the shared directory and fails if there is
> > no any:
> >
> > https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
> >
> > And examples suggests to specify a directory to mount:
> >
> > https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
> >
> > Looks like this is done by user's hands.
> >
> > Yes, I am one of the primary authors of Userspace CNI. Currently, the
> directory is by hand. Long term thought was to have a mutating
> webhook/admission controller inject a directory into the podspec. Not sure
> if it has changed, but I think when I was originally doing this work, OvS
> only lets you choose the directory at install time, so it has to be
> something like /var/run/openvswitch/. You can choose the socketfile name
> and maybe a subdirectory off the main directory, but not the full path.
>
> One of the issues I was trying to solve was making sure ContainerA couldn't
> see ContainerB's socketfiles. That's where the admission controller could
> create a unique subdirectory for each container under
> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
> always took precedence so that work never completed.
If the CNI plugin has access to the container's network namespace, could
it create an abstract AF_UNIX listen socket?
That way the application inside the container could connect to an
AF_UNIX socket and there is no need to manage container volumes.
I'm not familiar with the Open VSwitch, so I'm not sure if there is a
sane way of passing the listen socket fd into ovswitchd from the CNI
plugin?
The steps:
1. CNI plugin enters container's network namespace and opens an abstract
AF_UNIX listen socket.
2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
add-port step. Instead of using type=dpdkvhostuserclient
options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
dpdkvhostuser server with the listen fd.
3. When the container starts, it connects to the abstract AF_UNIX
socket. The abstract socket name is provided to the container at
startup time in an environment variable. The name is unique, at least
to the pod, so that multiple containers in the pod can run vhost-user
applications.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-24 12:05 ` Stefan Hajnoczi
@ 2021-03-24 13:11 ` Ilya Maximets
2021-03-24 15:07 ` Stefan Hajnoczi
2021-03-25 9:35 ` Stefan Hajnoczi
0 siblings, 2 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-24 13:11 UTC (permalink / raw)
To: Stefan Hajnoczi, Billy McFall
Cc: Ilya Maximets, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
>>>>
>>>>
>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>>>>>> And some housekeeping usually required for applications in case the
>>>>>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>>>>>> system:
>>>>>>>>>> "failed to bind to vhu: Address already in use; remove it and try
>>> again"
>>>>>>>>>
>>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
>>> users
>>>>>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>>>>>> solved with a pidfile.
>>>>>>>>
>>>>>>>> How exactly this could be solved with a pidfile?
>>>>>>>
>>>>>>> A pidfile prevents two instances of the same service from running at
>>> the
>>>>>>> same time.
>>>>>>>
>>>>>>> The same effect can be achieved by the container orchestrator,
>>> systemd,
>>>>>>> etc too because it refuses to run the same service twice.
>>>>>>
>>>>>> Sure. I understand that. My point was that these could be 2 different
>>>>>> applications and they might not know which process to look for.
>>>>>>
>>>>>>>
>>>>>>>> And what if this is
>>>>>>>> a different application that tries to create a socket on a same path?
>>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>>>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind
>>>>>>>> to an existing socket file and will fail. Subsequently port creation
>>>>>>>> in OVS will fail. We can't allow OVS to unlink files because this
>>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>>>>>> access to and we also has no idea if it's a QEMU that created a file
>>>>>>>> or it was a virtio-user application or someone else.
>>>>>>>
>>>>>>> If rte_vhost unlinks the socket then the user will find that
>>> networking
>>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
>>> device
>>>>>>> or restart QEMU, depending on whether they need to keep the guest
>>>>>>> running or not. This is a misconfiguration that is recoverable.
>>>>>>
>>>>>> True, it's recoverable, but with a high cost. Restart of a VM is
>>> rarely
>>>>>> desirable. And the application inside the guest might not feel itself
>>>>>> well after hot re-plug of a device that it actively used. I'd expect
>>>>>> a DPDK application that runs inside a guest on some virtio-net device
>>>>>> to crash after this kind of manipulations. Especially, if it uses some
>>>>>> older versions of DPDK.
>>>>>
>>>>> This unlink issue is probably something we think differently about.
>>>>> There are many ways for users to misconfigure things when working with
>>>>> system tools. If it's possible to catch misconfigurations that is
>>>>> preferrable. In this case it's just the way pathname AF_UNIX domain
>>>>> sockets work and IMO it's better not to have problems starting the
>>>>> service due to stale files than to insist on preventing
>>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be
>>>>> successful, so ¯\_(ツ)_/¯.
>>>>>
>>>>>>>
>>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>>>>>> create a security issue. I don't know the security model of OVS.
>>>>>>
>>>>>> In general privileges of a ovs-vswitchd daemon might be completely
>>>>>> different from privileges required to invoke control utilities or
>>>>>> to access the configuration database. SO, yes, we should not allow
>>>>>> that.
>>>>>
>>>>> That can be locked down by restricting the socket path to a file beneath
>>>>> /var/run/ovs/vhost-user/.
>>>>>
>>>>>>>
>>>>>>>> There are, probably, ways to detect if there is any alive process
>>> that
>>>>>>>> has this socket open, but that sounds like too much for this purpose,
>>>>>>>> also I'm not sure if it's possible if actual user is in a different
>>>>>>>> container.
>>>>>>>> So I don't see a good reliable way to detect these conditions. This
>>>>>>>> falls on shoulders of a higher level management software or a user to
>>>>>>>> clean these socket files up before adding ports.
>>>>>>>
>>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>>>>>> application? If yes, then abstract AF_UNIX sockets can be used.
>>> Abstract
>>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>>>>>> disappears when there is no process listening anymore.
>>>>>>
>>>>>> OVS is usually started right on the host in a main network namespace.
>>>>>> In case it's started in a pod, it will run in a separate container but
>>>>>> configured with a host network. Applications almost exclusively runs
>>>>>> in separate pods.
>>>>>
>>>>> Okay.
>>>>>
>>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>>>>>> leveraging an infrastructure service provided by a SocketPair
>>> Broker.
>>>>>>>>>
>>>>>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>>>>>> closely managed by one piece of software:
>>>>>>>>>
>>>>>>>>> 1. Unlink the socket path.
>>>>>>>>> 2. Create, bind, and listen on the socket path.
>>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>>>>> RPC, spawn a process, etc) and pass in the listen fd.
>>>>>>>>> 4. In the meantime the VMM can open the socket path and call
>>> connect(2).
>>>>>>>>> As soon as the vhost-user device backend calls accept(2) the
>>>>>>>>> connection will proceed (there is no need for sleeping).
>>>>>>>>>
>>>>>>>>> This approach works across containers without a broker.
>>>>>>>>
>>>>>>>> Not sure if I fully understood a question here, but anyway.
>>>>>>>>
>>>>>>>> This approach works fine if you know what application to run.
>>>>>>>> In case of a k8s cluster, it might be a random DPDK application
>>>>>>>> with virtio-user ports running inside a container and want to
>>>>>>>> have a network connection. Also, this application needs to run
>>>>>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>>>>>> require restart of the application. So, you basically need to
>>>>>>>> rely on a third-party application to create a socket with a right
>>>>>>>> name and in a correct location that is shared with a host, so
>>>>>>>> OVS can find it and connect.
>>>>>>>>
>>>>>>>> In a VM world everything is much more simple, since you have
>>>>>>>> a libvirt and QEMU that will take care of all of these stuff
>>>>>>>> and which are also under full control of management software
>>>>>>>> and a system administrator.
>>>>>>>> In case of a container with a "random" DPDK application inside
>>>>>>>> there is no such entity that can help. Of course, some solution
>>>>>>>> might be implemented in docker/podman daemon to create and manage
>>>>>>>> outside-looking sockets for an application inside the container,
>>>>>>>> but that is not available today AFAIK and I'm not sure if it
>>>>>>>> ever will.
>>>>>>>
>>>>>>> Wait, when you say there is no entity like management software or a
>>>>>>> system administrator, then how does OVS know to instantiate the new
>>>>>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>>>>>
>>>>>> I didn't mean that there is no any application that configures
>>>>>> everything. Of course, there is. I mean that there is no such
>>>>>> entity that abstracts all that socket machinery from the user's
>>>>>> application that runs inside the container. QEMU hides all the
>>>>>> details of the connection to vhost backend and presents the device
>>>>>> as a PCI device with a network interface wrapping from the guest
>>>>>> kernel. So, the application inside VM shouldn't care what actually
>>>>>> there is a socket connected to OVS that implements backend and
>>>>>> forward traffic somewhere. For the application it's just a usual
>>>>>> network interface.
>>>>>> But in case of a container world, application should handle all
>>>>>> that by creating a virtio-user device that will connect to some
>>>>>> socket, that has an OVS on the other side.
>>>>>>
>>>>>>>
>>>>>>> Can you describe the steps used today (without the broker) for
>>>>>>> instantiating a new DPDK app container and connecting it to OVS?
>>>>>>> Although my interest is in the vhost-user protocol I think it's
>>>>>>> necessary to understand the OVS requirements here and I know little
>>>>>>> about them.
>>>>>>>> I might describe some things wrong since I worked with k8s and CNI
>>>>>> plugins last time ~1.5 years ago, but the basic schema will look
>>>>>> something like this:
>>>>>>
>>>>>> 1. user decides to start a new pod and requests k8s to do that
>>>>>> via cmdline tools or some API calls.
>>>>>>
>>>>>> 2. k8s scheduler looks for available resources asking resource
>>>>>> manager plugins, finds an appropriate physical host and asks
>>>>>> local to that node kubelet daemon to launch a new pod there.
>>>>>>
>>>>
>>>> When the CNI is called, the pod has already been created, i.e: a PodID
>>> exists
>>>> and so does an associated network namespace. Therefore, everything that
>>> has to
>>>> do with the runtime spec such as mountpoints or devices cannot be
>>> modified by
>>>> this time.
>>>>
>>>> That's why the Device Plugin API is used to modify the Pod's spec before
>>> the CNI
>>>> chain is called.
>>>>
>>>>>> 3. kubelet asks local CNI plugin to allocate network resources
>>>>>> and annotate the pod with required mount points, devices that
>>>>>> needs to be passed in and environment variables.
>>>>>> (this is, IIRC, a gRPC connection. It might be a multus-cni
>>>>>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
>>>>>> usually deployed as a system DaemonSet, so it runs in a
>>>>>> separate pod.
>>>>>>
>>>>>> 4. Assuming that vhost-user connection requested in server mode.
>>>>>> CNI plugin will:
>>>>>> 4.1 create a directory for a vhost-user socket.
>>>>>> 4.2 add this directory to pod annotations as a mount point.
>>>>
>>>> I believe this is not possible, it would have to inspect the pod's spec
>>> or
>>>> otherwise determine an existing mount point where the socket should be
>>> created.
>>>
>>> Uff. Yes, you're right. Thanks for your clarification.
>>> I mixed up CNI and Device Plugin here.
>>>
>>> CNI itself is not able to annotate new resources to the pod, i.e.
>>> create new mounts or something like this. And I don't recall any
>>> vhost-user device plugins. Is there any? There is an SR-IOV device
>>> plugin, but its purpose is to allocate and pass PCI devices, not create
>>> mounts for vhost-user.
>>>
>>> So, IIUC, right now user must create the directory and specify
>>> a mount point in a pod spec file or pass the whole /var/run/openvswitch
>>> or something like this, right?
>>>
>>> Looking at userspace-cni-network-plugin, it actually just parses
>>> annotations to find the shared directory and fails if there is
>>> no any:
>>>
>>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
>>>
>>> And examples suggests to specify a directory to mount:
>>>
>>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
>>>
>>> Looks like this is done by user's hands.
>>>
>>> Yes, I am one of the primary authors of Userspace CNI. Currently, the
>> directory is by hand. Long term thought was to have a mutating
>> webhook/admission controller inject a directory into the podspec. Not sure
>> if it has changed, but I think when I was originally doing this work, OvS
>> only lets you choose the directory at install time, so it has to be
>> something like /var/run/openvswitch/. You can choose the socketfile name
>> and maybe a subdirectory off the main directory, but not the full path.
>>
>> One of the issues I was trying to solve was making sure ContainerA couldn't
>> see ContainerB's socketfiles. That's where the admission controller could
>> create a unique subdirectory for each container under
>> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
>> always took precedence so that work never completed.
>
> If the CNI plugin has access to the container's network namespace, could
> it create an abstract AF_UNIX listen socket?
>
> That way the application inside the container could connect to an
> AF_UNIX socket and there is no need to manage container volumes.
>
> I'm not familiar with the Open VSwitch, so I'm not sure if there is a
> sane way of passing the listen socket fd into ovswitchd from the CNI
> plugin?
>
> The steps:
> 1. CNI plugin enters container's network namespace and opens an abstract
> AF_UNIX listen socket.
> 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
> add-port step. Instead of using type=dpdkvhostuserclient
> options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
> dpdkvhostuser server with the listen fd.
For this step you will need a side channel, i.e. a separate unix socket
created by ovs-vswitchd (most likely, created by rte_vhost on
rte_vhost_driver_register() call).
The problem is that ovs-vsctl talks with ovsdb-server and adds the new
port -- just a new row in the 'interface' table of the database.
ovs-vswitchd receives update from the database and creates the actual
port. All the communications done through JSONRPC, so passing fds is
not an option.
> 3. When the container starts, it connects to the abstract AF_UNIX
> socket. The abstract socket name is provided to the container at
> startup time in an environment variable. The name is unique, at least
> to the pod, so that multiple containers in the pod can run vhost-user
> applications.
Few more problems with this solution:
- We still want to run application inside the container in a server mode,
because virtio-user PMD in client mode doesn't support re-connection.
- How to get this fd again after the OVS restart? CNI will not be invoked
at this point to pass a new fd.
- If application will close the connection for any reason (restart, some
reconfiguration internal to the application) and OVS will be re-started
at the same time, abstract socket will be gone. Need a persistent daemon
to hold it.
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-24 13:11 ` Ilya Maximets
@ 2021-03-24 15:07 ` Stefan Hajnoczi
2021-03-25 9:35 ` Stefan Hajnoczi
1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-24 15:07 UTC (permalink / raw)
To: Ilya Maximets
Cc: Billy McFall, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 15907 bytes --]
On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>
> >>>>
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>>>>> system:
> >>>>>>>>>> "failed to bind to vhu: Address already in use; remove it and try
> >>> again"
> >>>>>>>>>
> >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> >>> users
> >>>>>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>>>>> solved with a pidfile.
> >>>>>>>>
> >>>>>>>> How exactly this could be solved with a pidfile?
> >>>>>>>
> >>>>>>> A pidfile prevents two instances of the same service from running at
> >>> the
> >>>>>>> same time.
> >>>>>>>
> >>>>>>> The same effect can be achieved by the container orchestrator,
> >>> systemd,
> >>>>>>> etc too because it refuses to run the same service twice.
> >>>>>>
> >>>>>> Sure. I understand that. My point was that these could be 2 different
> >>>>>> applications and they might not know which process to look for.
> >>>>>>
> >>>>>>>
> >>>>>>>> And what if this is
> >>>>>>>> a different application that tries to create a socket on a same path?
> >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind
> >>>>>>>> to an existing socket file and will fail. Subsequently port creation
> >>>>>>>> in OVS will fail. We can't allow OVS to unlink files because this
> >>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>>>>> or it was a virtio-user application or someone else.
> >>>>>>>
> >>>>>>> If rte_vhost unlinks the socket then the user will find that
> >>> networking
> >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> >>> device
> >>>>>>> or restart QEMU, depending on whether they need to keep the guest
> >>>>>>> running or not. This is a misconfiguration that is recoverable.
> >>>>>>
> >>>>>> True, it's recoverable, but with a high cost. Restart of a VM is
> >>> rarely
> >>>>>> desirable. And the application inside the guest might not feel itself
> >>>>>> well after hot re-plug of a device that it actively used. I'd expect
> >>>>>> a DPDK application that runs inside a guest on some virtio-net device
> >>>>>> to crash after this kind of manipulations. Especially, if it uses some
> >>>>>> older versions of DPDK.
> >>>>>
> >>>>> This unlink issue is probably something we think differently about.
> >>>>> There are many ways for users to misconfigure things when working with
> >>>>> system tools. If it's possible to catch misconfigurations that is
> >>>>> preferrable. In this case it's just the way pathname AF_UNIX domain
> >>>>> sockets work and IMO it's better not to have problems starting the
> >>>>> service due to stale files than to insist on preventing
> >>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be
> >>>>> successful, so ¯\_(ツ)_/¯.
> >>>>>
> >>>>>>>
> >>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> >>>>>>> create a security issue. I don't know the security model of OVS.
> >>>>>>
> >>>>>> In general privileges of a ovs-vswitchd daemon might be completely
> >>>>>> different from privileges required to invoke control utilities or
> >>>>>> to access the configuration database. SO, yes, we should not allow
> >>>>>> that.
> >>>>>
> >>>>> That can be locked down by restricting the socket path to a file beneath
> >>>>> /var/run/ovs/vhost-user/.
> >>>>>
> >>>>>>>
> >>>>>>>> There are, probably, ways to detect if there is any alive process
> >>> that
> >>>>>>>> has this socket open, but that sounds like too much for this purpose,
> >>>>>>>> also I'm not sure if it's possible if actual user is in a different
> >>>>>>>> container.
> >>>>>>>> So I don't see a good reliable way to detect these conditions. This
> >>>>>>>> falls on shoulders of a higher level management software or a user to
> >>>>>>>> clean these socket files up before adding ports.
> >>>>>>>
> >>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK
> >>>>>>> application? If yes, then abstract AF_UNIX sockets can be used.
> >>> Abstract
> >>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address
> >>>>>>> disappears when there is no process listening anymore.
> >>>>>>
> >>>>>> OVS is usually started right on the host in a main network namespace.
> >>>>>> In case it's started in a pod, it will run in a separate container but
> >>>>>> configured with a host network. Applications almost exclusively runs
> >>>>>> in separate pods.
> >>>>>
> >>>>> Okay.
> >>>>>
> >>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by
> >>>>>>>>>> leveraging an infrastructure service provided by a SocketPair
> >>> Broker.
> >>>>>>>>>
> >>>>>>>>> I don't understand yet why this is useful for vhost-user, where the
> >>>>>>>>> creation of the vhost-user device backend and its use by a VMM are
> >>>>>>>>> closely managed by one piece of software:
> >>>>>>>>>
> >>>>>>>>> 1. Unlink the socket path.
> >>>>>>>>> 2. Create, bind, and listen on the socket path.
> >>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>>>>>>> RPC, spawn a process, etc) and pass in the listen fd.
> >>>>>>>>> 4. In the meantime the VMM can open the socket path and call
> >>> connect(2).
> >>>>>>>>> As soon as the vhost-user device backend calls accept(2) the
> >>>>>>>>> connection will proceed (there is no need for sleeping).
> >>>>>>>>>
> >>>>>>>>> This approach works across containers without a broker.
> >>>>>>>>
> >>>>>>>> Not sure if I fully understood a question here, but anyway.
> >>>>>>>>
> >>>>>>>> This approach works fine if you know what application to run.
> >>>>>>>> In case of a k8s cluster, it might be a random DPDK application
> >>>>>>>> with virtio-user ports running inside a container and want to
> >>>>>>>> have a network connection. Also, this application needs to run
> >>>>>>>> virtio-user in server mode, otherwise restart of the OVS will
> >>>>>>>> require restart of the application. So, you basically need to
> >>>>>>>> rely on a third-party application to create a socket with a right
> >>>>>>>> name and in a correct location that is shared with a host, so
> >>>>>>>> OVS can find it and connect.
> >>>>>>>>
> >>>>>>>> In a VM world everything is much more simple, since you have
> >>>>>>>> a libvirt and QEMU that will take care of all of these stuff
> >>>>>>>> and which are also under full control of management software
> >>>>>>>> and a system administrator.
> >>>>>>>> In case of a container with a "random" DPDK application inside
> >>>>>>>> there is no such entity that can help. Of course, some solution
> >>>>>>>> might be implemented in docker/podman daemon to create and manage
> >>>>>>>> outside-looking sockets for an application inside the container,
> >>>>>>>> but that is not available today AFAIK and I'm not sure if it
> >>>>>>>> ever will.
> >>>>>>>
> >>>>>>> Wait, when you say there is no entity like management software or a
> >>>>>>> system administrator, then how does OVS know to instantiate the new
> >>>>>>> port? I guess something still needs to invoke ovs-ctl add-port?
> >>>>>>
> >>>>>> I didn't mean that there is no any application that configures
> >>>>>> everything. Of course, there is. I mean that there is no such
> >>>>>> entity that abstracts all that socket machinery from the user's
> >>>>>> application that runs inside the container. QEMU hides all the
> >>>>>> details of the connection to vhost backend and presents the device
> >>>>>> as a PCI device with a network interface wrapping from the guest
> >>>>>> kernel. So, the application inside VM shouldn't care what actually
> >>>>>> there is a socket connected to OVS that implements backend and
> >>>>>> forward traffic somewhere. For the application it's just a usual
> >>>>>> network interface.
> >>>>>> But in case of a container world, application should handle all
> >>>>>> that by creating a virtio-user device that will connect to some
> >>>>>> socket, that has an OVS on the other side.
> >>>>>>
> >>>>>>>
> >>>>>>> Can you describe the steps used today (without the broker) for
> >>>>>>> instantiating a new DPDK app container and connecting it to OVS?
> >>>>>>> Although my interest is in the vhost-user protocol I think it's
> >>>>>>> necessary to understand the OVS requirements here and I know little
> >>>>>>> about them.
> >>>>>>>> I might describe some things wrong since I worked with k8s and CNI
> >>>>>> plugins last time ~1.5 years ago, but the basic schema will look
> >>>>>> something like this:
> >>>>>>
> >>>>>> 1. user decides to start a new pod and requests k8s to do that
> >>>>>> via cmdline tools or some API calls.
> >>>>>>
> >>>>>> 2. k8s scheduler looks for available resources asking resource
> >>>>>> manager plugins, finds an appropriate physical host and asks
> >>>>>> local to that node kubelet daemon to launch a new pod there.
> >>>>>>
> >>>>
> >>>> When the CNI is called, the pod has already been created, i.e: a PodID
> >>> exists
> >>>> and so does an associated network namespace. Therefore, everything that
> >>> has to
> >>>> do with the runtime spec such as mountpoints or devices cannot be
> >>> modified by
> >>>> this time.
> >>>>
> >>>> That's why the Device Plugin API is used to modify the Pod's spec before
> >>> the CNI
> >>>> chain is called.
> >>>>
> >>>>>> 3. kubelet asks local CNI plugin to allocate network resources
> >>>>>> and annotate the pod with required mount points, devices that
> >>>>>> needs to be passed in and environment variables.
> >>>>>> (this is, IIRC, a gRPC connection. It might be a multus-cni
> >>>>>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
> >>>>>> usually deployed as a system DaemonSet, so it runs in a
> >>>>>> separate pod.
> >>>>>>
> >>>>>> 4. Assuming that vhost-user connection requested in server mode.
> >>>>>> CNI plugin will:
> >>>>>> 4.1 create a directory for a vhost-user socket.
> >>>>>> 4.2 add this directory to pod annotations as a mount point.
> >>>>
> >>>> I believe this is not possible, it would have to inspect the pod's spec
> >>> or
> >>>> otherwise determine an existing mount point where the socket should be
> >>> created.
> >>>
> >>> Uff. Yes, you're right. Thanks for your clarification.
> >>> I mixed up CNI and Device Plugin here.
> >>>
> >>> CNI itself is not able to annotate new resources to the pod, i.e.
> >>> create new mounts or something like this. And I don't recall any
> >>> vhost-user device plugins. Is there any? There is an SR-IOV device
> >>> plugin, but its purpose is to allocate and pass PCI devices, not create
> >>> mounts for vhost-user.
> >>>
> >>> So, IIUC, right now user must create the directory and specify
> >>> a mount point in a pod spec file or pass the whole /var/run/openvswitch
> >>> or something like this, right?
> >>>
> >>> Looking at userspace-cni-network-plugin, it actually just parses
> >>> annotations to find the shared directory and fails if there is
> >>> no any:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
> >>>
> >>> And examples suggests to specify a directory to mount:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
> >>>
> >>> Looks like this is done by user's hands.
> >>>
> >>> Yes, I am one of the primary authors of Userspace CNI. Currently, the
> >> directory is by hand. Long term thought was to have a mutating
> >> webhook/admission controller inject a directory into the podspec. Not sure
> >> if it has changed, but I think when I was originally doing this work, OvS
> >> only lets you choose the directory at install time, so it has to be
> >> something like /var/run/openvswitch/. You can choose the socketfile name
> >> and maybe a subdirectory off the main directory, but not the full path.
> >>
> >> One of the issues I was trying to solve was making sure ContainerA couldn't
> >> see ContainerB's socketfiles. That's where the admission controller could
> >> create a unique subdirectory for each container under
> >> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
> >> always took precedence so that work never completed.
> >
> > If the CNI plugin has access to the container's network namespace, could
> > it create an abstract AF_UNIX listen socket?
> >
> > That way the application inside the container could connect to an
> > AF_UNIX socket and there is no need to manage container volumes.
> >
> > I'm not familiar with the Open VSwitch, so I'm not sure if there is a
> > sane way of passing the listen socket fd into ovswitchd from the CNI
> > plugin?
> >
> > The steps:
> > 1. CNI plugin enters container's network namespace and opens an abstract
> > AF_UNIX listen socket.
> > 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
> > add-port step. Instead of using type=dpdkvhostuserclient
> > options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
> > dpdkvhostuser server with the listen fd.
>
> For this step you will need a side channel, i.e. a separate unix socket
> created by ovs-vswitchd (most likely, created by rte_vhost on
> rte_vhost_driver_register() call).
>
> The problem is that ovs-vsctl talks with ovsdb-server and adds the new
> port -- just a new row in the 'interface' table of the database.
> ovs-vswitchd receives update from the database and creates the actual
> port. All the communications done through JSONRPC, so passing fds is
> not an option.
>
> > 3. When the container starts, it connects to the abstract AF_UNIX
> > socket. The abstract socket name is provided to the container at
> > startup time in an environment variable. The name is unique, at least
> > to the pod, so that multiple containers in the pod can run vhost-user
> > applications.
>
> Few more problems with this solution:
>
> - We still want to run application inside the container in a server mode,
> because virtio-user PMD in client mode doesn't support re-connection.
>
> - How to get this fd again after the OVS restart? CNI will not be invoked
> at this point to pass a new fd.
>
> - If application will close the connection for any reason (restart, some
> reconfiguration internal to the application) and OVS will be re-started
> at the same time, abstract socket will be gone. Need a persistent daemon
> to hold it.
Okay, if there is no component that has a lifetime suitable for holding
the abstract listen socket, then using pathname AF_UNIX sockets seems
like a better approach.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-19 16:45 ` Ilya Maximets
@ 2021-03-24 20:56 ` Maxime Coquelin
2021-03-24 21:39 ` Ilya Maximets
0 siblings, 1 reply; 38+ messages in thread
From: Maxime Coquelin @ 2021-03-24 20:56 UTC (permalink / raw)
To: Ilya Maximets, Stefan Hajnoczi
Cc: Chenbo Xia, dev, Adrian Moreno, Julia Suvorova
Hi Ilya,
On 3/19/21 5:45 PM, Ilya Maximets wrote:
> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>> Hi Ilya,
>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>> is restarted then existing vhost-user connections drop with an error but
>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>> OVS instance will set up.
>>>
>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>
>> Well, AFAIK, qemu reconnects client connections only:
>>
>> ``reconnect`` sets the timeout for reconnecting on non-server
>> sockets when the remote end goes away. qemu will delay this many
>> seconds and then attempt to reconnect. Zero disables reconnecting,
>> and is the default.
>>
>> I'm not sure about exact reason. It was historically this way.
>> For me it doesn't make much sense. I mean, your right that it's
>> just a socket, so it should not matter who listens and who connects.
>> If reconnection is possible in one direction, it should be possible
>> in the opposite direction too.
>
> Sorry, my thought slipped. :) Yes, QEMU supports re-connection
> for client sockets. So, in theory, dpdkvhostuser ports should work
> after re-connection. And that would be nice. I don't remember
> right now why this doesn't work... Maybe vhost-user parts in QEMU
> doesn't handle this case. Need to dig some more into that and refresh
> my memory. It was so long ago...
>
> Maxime, do you remember?
Sorry for the delay. I didn't remember, so I wanted to have a try.
I can confirm reconnect works with QEMU as client and with Vhost PMD as
server with:
<interface type='vhostuser'>
<mac address='56:48:4f:53:54:01'/>
<source type='unix' path='/tmp/vhost-user1' mode='client'>
<reconnect enabled='yes' timeout='1'/>
</source>
<model type='virtio'/>
<driver name='vhost' rx_queue_size='256'/>
<address type='pci' domain='0x0000' bus='0x07' slot='0x00'
function='0x0'/>
</interface>
>
>>
>> dpdkvhostuser was deprecated just to scare users and force them to
>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>
>> "OVS service restarted and network is lost now".
>>
>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>> mode too.
>
> This is still true, though. virtio-user in client mode doesn't reconnect.
That could be added, and it is maybe not as important for containers as
it is for VM to support it, given the ephemeral nature of containers?
Regards,
Maxime
>
>>
>> BTW2, with SocketPair Broker it might be cheaper to implement server
>> reconnection in QEMU because all connections in these case are client
>> connections, i.e. both ends will connect() to a broker.
>>
>> Bets regards, Ilya Maximets.
>>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-24 20:56 ` Maxime Coquelin
@ 2021-03-24 21:39 ` Ilya Maximets
2021-03-24 21:51 ` Maxime Coquelin
0 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-24 21:39 UTC (permalink / raw)
To: Maxime Coquelin, Ilya Maximets, Stefan Hajnoczi
Cc: Chenbo Xia, dev, Adrian Moreno, Julia Suvorova
On 3/24/21 9:56 PM, Maxime Coquelin wrote:
> Hi Ilya,
>
> On 3/19/21 5:45 PM, Ilya Maximets wrote:
>> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>>> Hi Ilya,
>>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>>> is restarted then existing vhost-user connections drop with an error but
>>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>>> OVS instance will set up.
>>>>
>>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>>
>>> Well, AFAIK, qemu reconnects client connections only:
>>>
>>> ``reconnect`` sets the timeout for reconnecting on non-server
>>> sockets when the remote end goes away. qemu will delay this many
>>> seconds and then attempt to reconnect. Zero disables reconnecting,
>>> and is the default.
>>>
>>> I'm not sure about exact reason. It was historically this way.
>>> For me it doesn't make much sense. I mean, your right that it's
>>> just a socket, so it should not matter who listens and who connects.
>>> If reconnection is possible in one direction, it should be possible
>>> in the opposite direction too.
>>
>> Sorry, my thought slipped. :) Yes, QEMU supports re-connection
>> for client sockets. So, in theory, dpdkvhostuser ports should work
>> after re-connection. And that would be nice. I don't remember
>> right now why this doesn't work... Maybe vhost-user parts in QEMU
>> doesn't handle this case. Need to dig some more into that and refresh
>> my memory. It was so long ago...
>>
>> Maxime, do you remember?
>
> Sorry for the delay. I didn't remember, so I wanted to have a try.
>
> I can confirm reconnect works with QEMU as client and with Vhost PMD as
> server with:
>
>
> <interface type='vhostuser'>
> <mac address='56:48:4f:53:54:01'/>
> <source type='unix' path='/tmp/vhost-user1' mode='client'>
> <reconnect enabled='yes' timeout='1'/>
> </source>
> <model type='virtio'/>
> <driver name='vhost' rx_queue_size='256'/>
> <address type='pci' domain='0x0000' bus='0x07' slot='0x00'
> function='0x0'/>
> </interface>
Cool, thanks for checking. :)
If it works with vhost PMD, it should probably work with OVS too.
There is still a couple of problems:
1. OpenStack Nova doesn't support configuration of a 'reconnect'
in libvirt xml (it only adds the 'source'):
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L1834
2. 'reconnect' configuration supported only starting from libvirt 4.1.0.
It's released in 2018, but still some systems are using older versions.
e.g. Ubuntu 18.04 which will be supported until 2023 uses libvirt 4.0.0.
>
>>
>>>
>>> dpdkvhostuser was deprecated just to scare users and force them to
>>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>>
>>> "OVS service restarted and network is lost now".
>>>
>>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>>> mode too.
>>
>> This is still true, though. virtio-user in client mode doesn't reconnect.
>
> That could be added, and it is maybe not as important for containers as
> it is for VM to support it, given the ephemeral nature of containers?
Well, restart of OVS should not require restarting of all the containers
on the host even though they are "stateless".
BTW, some infrastructure changes that I made in this series might be
reused to implement client-side reconnection for virtio-user.
>
> Regards,
> Maxime
>
>>
>>>
>>> BTW2, with SocketPair Broker it might be cheaper to implement server
>>> reconnection in QEMU because all connections in these case are client
>>> connections, i.e. both ends will connect() to a broker.
>>>
>>> Bets regards, Ilya Maximets.
>>>
>>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-24 21:39 ` Ilya Maximets
@ 2021-03-24 21:51 ` Maxime Coquelin
2021-03-24 22:17 ` Ilya Maximets
0 siblings, 1 reply; 38+ messages in thread
From: Maxime Coquelin @ 2021-03-24 21:51 UTC (permalink / raw)
To: Ilya Maximets, Stefan Hajnoczi
Cc: Chenbo Xia, dev, Adrian Moreno, Julia Suvorova
On 3/24/21 10:39 PM, Ilya Maximets wrote:
> On 3/24/21 9:56 PM, Maxime Coquelin wrote:
>> Hi Ilya,
>>
>> On 3/19/21 5:45 PM, Ilya Maximets wrote:
>>> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>>>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>>>> Hi Ilya,
>>>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>>>> is restarted then existing vhost-user connections drop with an error but
>>>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>>>> OVS instance will set up.
>>>>>
>>>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>>>
>>>> Well, AFAIK, qemu reconnects client connections only:
>>>>
>>>> ``reconnect`` sets the timeout for reconnecting on non-server
>>>> sockets when the remote end goes away. qemu will delay this many
>>>> seconds and then attempt to reconnect. Zero disables reconnecting,
>>>> and is the default.
>>>>
>>>> I'm not sure about exact reason. It was historically this way.
>>>> For me it doesn't make much sense. I mean, your right that it's
>>>> just a socket, so it should not matter who listens and who connects.
>>>> If reconnection is possible in one direction, it should be possible
>>>> in the opposite direction too.
>>>
>>> Sorry, my thought slipped. :) Yes, QEMU supports re-connection
>>> for client sockets. So, in theory, dpdkvhostuser ports should work
>>> after re-connection. And that would be nice. I don't remember
>>> right now why this doesn't work... Maybe vhost-user parts in QEMU
>>> doesn't handle this case. Need to dig some more into that and refresh
>>> my memory. It was so long ago...
>>>
>>> Maxime, do you remember?
>>
>> Sorry for the delay. I didn't remember, so I wanted to have a try.
>>
>> I can confirm reconnect works with QEMU as client and with Vhost PMD as
>> server with:
>>
>>
>> <interface type='vhostuser'>
>> <mac address='56:48:4f:53:54:01'/>
>> <source type='unix' path='/tmp/vhost-user1' mode='client'>
>> <reconnect enabled='yes' timeout='1'/>
>> </source>
>> <model type='virtio'/>
>> <driver name='vhost' rx_queue_size='256'/>
>> <address type='pci' domain='0x0000' bus='0x07' slot='0x00'
>> function='0x0'/>
>> </interface>
>
> Cool, thanks for checking. :)
> If it works with vhost PMD, it should probably work with OVS too.
>
> There is still a couple of problems:
>
> 1. OpenStack Nova doesn't support configuration of a 'reconnect'
> in libvirt xml (it only adds the 'source'):
> https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L1834
>
> 2. 'reconnect' configuration supported only starting from libvirt 4.1.0.
> It's released in 2018, but still some systems are using older versions.
> e.g. Ubuntu 18.04 which will be supported until 2023 uses libvirt 4.0.0.
Is it really a problem? We can keep using OVS as client for OSP, and use
OVS as server for containers.
>>
>>>
>>>>
>>>> dpdkvhostuser was deprecated just to scare users and force them to
>>>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>>>
>>>> "OVS service restarted and network is lost now".
>>>>
>>>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>>>> mode too.
>>>
>>> This is still true, though. virtio-user in client mode doesn't reconnect.
>>
>> That could be added, and it is maybe not as important for containers as
>> it is for VM to support it, given the ephemeral nature of containers?
>
> Well, restart of OVS should not require restarting of all the containers
> on the host even though they are "stateless".
>
> BTW, some infrastructure changes that I made in this series might be
> reused to implement client-side reconnection for virtio-user.
Great, I'll look at the series when we'll work on adding reconnect for
clients.
Thanks,
Maxime
>>
>> Regards,
>> Maxime
>>
>>>
>>>>
>>>> BTW2, with SocketPair Broker it might be cheaper to implement server
>>>> reconnection in QEMU because all connections in these case are client
>>>> connections, i.e. both ends will connect() to a broker.
>>>>
>>>> Bets regards, Ilya Maximets.
>>>>
>>>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-24 21:51 ` Maxime Coquelin
@ 2021-03-24 22:17 ` Ilya Maximets
0 siblings, 0 replies; 38+ messages in thread
From: Ilya Maximets @ 2021-03-24 22:17 UTC (permalink / raw)
To: Maxime Coquelin, Ilya Maximets, Stefan Hajnoczi
Cc: Chenbo Xia, dev, Adrian Moreno, Julia Suvorova
On 3/24/21 10:51 PM, Maxime Coquelin wrote:
>
>
> On 3/24/21 10:39 PM, Ilya Maximets wrote:
>> On 3/24/21 9:56 PM, Maxime Coquelin wrote:
>>> Hi Ilya,
>>>
>>> On 3/19/21 5:45 PM, Ilya Maximets wrote:
>>>> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>>>>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>>>>> Hi Ilya,
>>>>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>>>>> is restarted then existing vhost-user connections drop with an error but
>>>>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>>>>> OVS instance will set up.
>>>>>>
>>>>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>>>>
>>>>> Well, AFAIK, qemu reconnects client connections only:
>>>>>
>>>>> ``reconnect`` sets the timeout for reconnecting on non-server
>>>>> sockets when the remote end goes away. qemu will delay this many
>>>>> seconds and then attempt to reconnect. Zero disables reconnecting,
>>>>> and is the default.
>>>>>
>>>>> I'm not sure about exact reason. It was historically this way.
>>>>> For me it doesn't make much sense. I mean, your right that it's
>>>>> just a socket, so it should not matter who listens and who connects.
>>>>> If reconnection is possible in one direction, it should be possible
>>>>> in the opposite direction too.
>>>>
>>>> Sorry, my thought slipped. :) Yes, QEMU supports re-connection
>>>> for client sockets. So, in theory, dpdkvhostuser ports should work
>>>> after re-connection. And that would be nice. I don't remember
>>>> right now why this doesn't work... Maybe vhost-user parts in QEMU
>>>> doesn't handle this case. Need to dig some more into that and refresh
>>>> my memory. It was so long ago...
>>>>
>>>> Maxime, do you remember?
>>>
>>> Sorry for the delay. I didn't remember, so I wanted to have a try.
>>>
>>> I can confirm reconnect works with QEMU as client and with Vhost PMD as
>>> server with:
>>>
>>>
>>> <interface type='vhostuser'>
>>> <mac address='56:48:4f:53:54:01'/>
>>> <source type='unix' path='/tmp/vhost-user1' mode='client'>
>>> <reconnect enabled='yes' timeout='1'/>
>>> </source>
>>> <model type='virtio'/>
>>> <driver name='vhost' rx_queue_size='256'/>
>>> <address type='pci' domain='0x0000' bus='0x07' slot='0x00'
>>> function='0x0'/>
>>> </interface>
>>
>> Cool, thanks for checking. :)
>> If it works with vhost PMD, it should probably work with OVS too.
>>
>> There is still a couple of problems:
>>
>> 1. OpenStack Nova doesn't support configuration of a 'reconnect'
>> in libvirt xml (it only adds the 'source'):
>> https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L1834
>>
>> 2. 'reconnect' configuration supported only starting from libvirt 4.1.0.
>> It's released in 2018, but still some systems are using older versions.
>> e.g. Ubuntu 18.04 which will be supported until 2023 uses libvirt 4.0.0.
>
> Is it really a problem? We can keep using OVS as client for OSP,
Sure, no problem here. I think, we scared enough users so they
mostly switched to this configuration and it's a good thing.
> and use OVS as server for containers.
There is a problem here that OVS emits a deprecation warning each
time you're trying to use dpdkvhostuser ports. There are no problems
functionally, though, except that deprecation also implies that
we're consciously not adding new features to these ports.
Another thing is that current state of the art of k8s CNI requires
mounting of the whole /var/run/openvswitch or managing directories
by hands. So, IMO, Broker idea is still valid is some form in terms
of hassle reduction for users.
>
>>>
>>>>
>>>>>
>>>>> dpdkvhostuser was deprecated just to scare users and force them to
>>>>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>>>>
>>>>> "OVS service restarted and network is lost now".
>>>>>
>>>>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>>>>> mode too.
>>>>
>>>> This is still true, though. virtio-user in client mode doesn't reconnect.
>>>
>>> That could be added, and it is maybe not as important for containers as
>>> it is for VM to support it, given the ephemeral nature of containers?
>>
>> Well, restart of OVS should not require restarting of all the containers
>> on the host even though they are "stateless".
>>
>> BTW, some infrastructure changes that I made in this series might be
>> reused to implement client-side reconnection for virtio-user.
>
> Great, I'll look at the series when we'll work on adding reconnect for
> clients.
Please, take a look at the patch #1 in this set sooner.
It's a bug fix. :)
>
> Thanks,
> Maxime
>
>>>
>>> Regards,
>>> Maxime
>>>
>>>>
>>>>>
>>>>> BTW2, with SocketPair Broker it might be cheaper to implement server
>>>>> reconnection in QEMU because all connections in these case are client
>>>>> connections, i.e. both ends will connect() to a broker.
>>>>>
>>>>> Bets regards, Ilya Maximets.
>>>>>
>>>>
>>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
@ 2021-03-25 8:32 ` Maxime Coquelin
2021-04-07 7:21 ` Xia, Chenbo
0 siblings, 1 reply; 38+ messages in thread
From: Maxime Coquelin @ 2021-03-25 8:32 UTC (permalink / raw)
To: Ilya Maximets
Cc: Chenbo Xia, dev, Adrian Moreno, Stefan Hajnoczi, Julia Suvorova,
stable, Zhiyong Yang
On 3/17/21 9:25 PM, Ilya Maximets wrote:
> virtio_user_dev_server_reconnect() is typically called from the
> interrupt context while checking the link state:
>
> vhost_user_update_link_state()
> --> virtio_user_dev_server_reconnect()
>
> Under this conditions callback unregistering always fails. This means
> that listenfd is never unregistered and continue to trigger interrupts.
> For example, if second client will try to connect to the same socket,
> the server will receive interrupts infinitely because it will not
> accept them while listen fd is readable and generates epoll events.
>
> Fix that by moving reconfiguration of interrupts out of the
> interrupt context to alarm handler.
>
> 'virtio_user_dev_delayed_handler' renamed to
> 'virtio_user_dev_delayed_disconnect_handler' to better reflect its
> purpose.
>
> Additionally improved error logging around interrupt management.
>
> Fixes: bd8f50a45d0f ("net/virtio-user: support server mode")
> Cc: stable@dpdk.org
>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>
> CC: Zhiyong Yang <zhiyong.yang@intel.com>
>
> drivers/net/virtio/virtio_user/vhost_user.c | 4 +-
> .../net/virtio/virtio_user/virtio_user_dev.c | 70 ++++++++++++++-----
> .../net/virtio/virtio_user/virtio_user_dev.h | 2 +-
> 3 files changed, 57 insertions(+), 19 deletions(-)
>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Thanks,
Maxime
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-24 13:11 ` Ilya Maximets
2021-03-24 15:07 ` Stefan Hajnoczi
@ 2021-03-25 9:35 ` Stefan Hajnoczi
2021-03-25 11:00 ` Ilya Maximets
1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-25 9:35 UTC (permalink / raw)
To: Ilya Maximets
Cc: Billy McFall, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]
On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> - How to get this fd again after the OVS restart? CNI will not be invoked
> at this point to pass a new fd.
>
> - If application will close the connection for any reason (restart, some
> reconfiguration internal to the application) and OVS will be re-started
> at the same time, abstract socket will be gone. Need a persistent daemon
> to hold it.
I remembered that these two points can be solved by sd_notify(3)
FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
this is the case (at least in the CNI use case)?
https://www.freedesktop.org/software/systemd/man/sd_notify.html
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-25 9:35 ` Stefan Hajnoczi
@ 2021-03-25 11:00 ` Ilya Maximets
2021-03-25 16:43 ` Stefan Hajnoczi
0 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-25 11:00 UTC (permalink / raw)
To: Stefan Hajnoczi, Ilya Maximets
Cc: Billy McFall, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>> - How to get this fd again after the OVS restart? CNI will not be invoked
>> at this point to pass a new fd.
>>
>> - If application will close the connection for any reason (restart, some
>> reconfiguration internal to the application) and OVS will be re-started
>> at the same time, abstract socket will be gone. Need a persistent daemon
>> to hold it.
>
> I remembered that these two points can be solved by sd_notify(3)
> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> this is the case (at least in the CNI use case)?
>
> https://www.freedesktop.org/software/systemd/man/sd_notify.html
IIUC, these file descriptors only passed on the restart of the service,
so port-del + port-add scenario is not covered (and this is a very
common usecase, users are implementing some configuration changes this
way and also this is internally possible scenario, e.g. this sequence
will be triggered internally to change the OpenFlow port number).
port-del will release all the resources including the listening socket.
Keeping the fd for later use is not an option, because OVS will not know
if this port will be added back or not and fds is a limited resource.
It's also unclear how to map these file descriptors to particular ports
they belong to after restart.
OVS could run as a system pod or as a systemd service. It differs from
one setup to another. So it might not be controlled by systemd.
Also, it behaves as an old-style daemon, so it closes all the file
descriptors, forkes and so on. This might be adjusted, though, with
some rework of the deamonization procedure.
On the side note, it maybe interesting to allow user application to
create a socket and pass a pollable file descriptor directly to
rte_vhost_driver_register() instead of a socket path. This way
the user application may choose to use an abstract socket or a file
socket or any other future type of socket connections. This will
also allow user application to store these sockets somewhere, or
receive them from systemd/init/other management software.
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-25 11:00 ` Ilya Maximets
@ 2021-03-25 16:43 ` Stefan Hajnoczi
2021-03-25 17:58 ` Ilya Maximets
0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-25 16:43 UTC (permalink / raw)
To: Ilya Maximets
Cc: Billy McFall, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 3757 bytes --]
On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> > On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> >> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> >>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> - How to get this fd again after the OVS restart? CNI will not be invoked
> >> at this point to pass a new fd.
> >>
> >> - If application will close the connection for any reason (restart, some
> >> reconfiguration internal to the application) and OVS will be re-started
> >> at the same time, abstract socket will be gone. Need a persistent daemon
> >> to hold it.
> >
> > I remembered that these two points can be solved by sd_notify(3)
> > FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> > this is the case (at least in the CNI use case)?
> >
> > https://www.freedesktop.org/software/systemd/man/sd_notify.html
>
> IIUC, these file descriptors only passed on the restart of the service,
> so port-del + port-add scenario is not covered (and this is a very
> common usecase, users are implementing some configuration changes this
> way and also this is internally possible scenario, e.g. this sequence
> will be triggered internally to change the OpenFlow port number).
> port-del will release all the resources including the listening socket.
> Keeping the fd for later use is not an option, because OVS will not know
> if this port will be added back or not and fds is a limited resource.
If users of the CNI plugin are reasonably expected to do this then it
sounds like a blocker for the sd_notify(3) approach. Maybe it could be
fixed by introducing an atomic port-rename (?) operation, but this is
starting to sound too invasive.
> It's also unclear how to map these file descriptors to particular ports
> they belong to after restart.
The first fd would be a memfd containing a description of the remaining
fds plus any other crash recovery state that OVS wants.
> OVS could run as a system pod or as a systemd service. It differs from
> one setup to another. So it might not be controlled by systemd.
Does the CNI plugin allow both configurations?
It's impossible to come up with one approach that works for everyone in
the general case (beyond the CNI plugin, beyond Kubernetes). I think we
need to enumerate use cases and decide which ones are currently not
addressed satisfactorily.
> Also, it behaves as an old-style daemon, so it closes all the file
> descriptors, forkes and so on. This might be adjusted, though, with
> some rework of the deamonization procedure.
Doesn't sound like fun but may be doable.
> On the side note, it maybe interesting to allow user application to
> create a socket and pass a pollable file descriptor directly to
> rte_vhost_driver_register() instead of a socket path. This way
> the user application may choose to use an abstract socket or a file
> socket or any other future type of socket connections. This will
> also allow user application to store these sockets somewhere, or
> receive them from systemd/init/other management software.
Yes, sounds useful.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-25 16:43 ` Stefan Hajnoczi
@ 2021-03-25 17:58 ` Ilya Maximets
2021-03-30 15:01 ` Stefan Hajnoczi
0 siblings, 1 reply; 38+ messages in thread
From: Ilya Maximets @ 2021-03-25 17:58 UTC (permalink / raw)
To: Stefan Hajnoczi, Ilya Maximets
Cc: Billy McFall, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
On 3/25/21 5:43 PM, Stefan Hajnoczi wrote:
> On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
>> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
>>> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
>>>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
>>>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
>>>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
>>>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>> - How to get this fd again after the OVS restart? CNI will not be invoked
>>>> at this point to pass a new fd.
>>>>
>>>> - If application will close the connection for any reason (restart, some
>>>> reconfiguration internal to the application) and OVS will be re-started
>>>> at the same time, abstract socket will be gone. Need a persistent daemon
>>>> to hold it.
>>>
>>> I remembered that these two points can be solved by sd_notify(3)
>>> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
>>> this is the case (at least in the CNI use case)?
>>>
>>> https://www.freedesktop.org/software/systemd/man/sd_notify.html
>>
>> IIUC, these file descriptors only passed on the restart of the service,
>> so port-del + port-add scenario is not covered (and this is a very
>> common usecase, users are implementing some configuration changes this
>> way and also this is internally possible scenario, e.g. this sequence
>> will be triggered internally to change the OpenFlow port number).
>> port-del will release all the resources including the listening socket.
>> Keeping the fd for later use is not an option, because OVS will not know
>> if this port will be added back or not and fds is a limited resource.
>
> If users of the CNI plugin are reasonably expected to do this then it
> sounds like a blocker for the sd_notify(3) approach. Maybe it could be
> fixed by introducing an atomic port-rename (?) operation, but this is
> starting to sound too invasive.
It's hard to implement, actually. Things like 'port-rename' will
be internally implemented as del+add in most cases. Otherwise, it
will require a significant rework of OVS internals.
There are things that could be adjusted on the fly, but some
fundamental parts like OF port number that every other part depends
on are not easy to change.
>
>> It's also unclear how to map these file descriptors to particular ports
>> they belong to after restart.
>
> The first fd would be a memfd containing a description of the remaining
> fds plus any other crash recovery state that OVS wants.
Yeah, I saw that it's possible to assign names to fds, so from this
perspective it's not a big problem.
>
>> OVS could run as a system pod or as a systemd service. It differs from
>> one setup to another. So it might not be controlled by systemd.
>
> Does the CNI plugin allow both configurations?
CNI runs as a DaemonSet (pod on each node) by itself, and it doesn't
matter if OVS is running on the host or in a different pod. They have
a part of a filesystem to share (/var/run/openvswitch/ and some other).
For example, OVN-K8s CNI provides an OVS DaemonSet:
https://github.com/ovn-org/ovn-kubernetes/blob/master/dist/templates/ovs-node.yaml.j2
Users can use it, but it's not required and indifferent from the CNI
point of view.
Everything is a pod in k8s, but you can run some parts on the host if
you wish.
In general, CNI plugin only needs a network connection to the ovsdb-server
process. In reality, most of CNI plugins are connecting via control
socket in /var/run/openvswitch.
>
> It's impossible to come up with one approach that works for everyone in
> the general case (beyond the CNI plugin, beyond Kubernetes).
If we're looking for a solution to store abstract sockets somehow
for OVS then it's hard to came up with something generic. It will
have dependency on specific init system anyway.
OTOH, Broker solution will work for all cases. :) One may think
of a broker as a service that supplies abstract sockets for processes
from different namespaces. These sockets are already connected, for
convenience.
> I think we
> need to enumerate use cases and decide which ones are currently not
> addressed satisfactorily.
>
>> Also, it behaves as an old-style daemon, so it closes all the file
>> descriptors, forkes and so on. This might be adjusted, though, with
>> some rework of the deamonization procedure.
>
> Doesn't sound like fun but may be doable.
It really doesn't sound like fun, so I'd like to not do that unless
we have a solid usecase.
>
>> On the side note, it maybe interesting to allow user application to
>> create a socket and pass a pollable file descriptor directly to
>> rte_vhost_driver_register() instead of a socket path. This way
>> the user application may choose to use an abstract socket or a file
>> socket or any other future type of socket connections. This will
>> also allow user application to store these sockets somewhere, or
>> receive them from systemd/init/other management software.
>
> Yes, sounds useful.
>
> Stefan
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-25 17:58 ` Ilya Maximets
@ 2021-03-30 15:01 ` Stefan Hajnoczi
0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2021-03-30 15:01 UTC (permalink / raw)
To: Ilya Maximets
Cc: Billy McFall, Adrian Moreno, Maxime Coquelin, Chenbo Xia, dev,
Julia Suvorova, Marc-André Lureau, Daniel Berrange
[-- Attachment #1: Type: text/plain, Size: 4564 bytes --]
On Thu, Mar 25, 2021 at 06:58:56PM +0100, Ilya Maximets wrote:
> On 3/25/21 5:43 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
> >> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> >>> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> >>>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> >>>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >>>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> - How to get this fd again after the OVS restart? CNI will not be invoked
> >>>> at this point to pass a new fd.
> >>>>
> >>>> - If application will close the connection for any reason (restart, some
> >>>> reconfiguration internal to the application) and OVS will be re-started
> >>>> at the same time, abstract socket will be gone. Need a persistent daemon
> >>>> to hold it.
> >>>
> >>> I remembered that these two points can be solved by sd_notify(3)
> >>> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> >>> this is the case (at least in the CNI use case)?
> >>>
> >>> https://www.freedesktop.org/software/systemd/man/sd_notify.html
> >>
> >> IIUC, these file descriptors only passed on the restart of the service,
> >> so port-del + port-add scenario is not covered (and this is a very
> >> common usecase, users are implementing some configuration changes this
> >> way and also this is internally possible scenario, e.g. this sequence
> >> will be triggered internally to change the OpenFlow port number).
> >> port-del will release all the resources including the listening socket.
> >> Keeping the fd for later use is not an option, because OVS will not know
> >> if this port will be added back or not and fds is a limited resource.
> >
> > If users of the CNI plugin are reasonably expected to do this then it
> > sounds like a blocker for the sd_notify(3) approach. Maybe it could be
> > fixed by introducing an atomic port-rename (?) operation, but this is
> > starting to sound too invasive.
>
> It's hard to implement, actually. Things like 'port-rename' will
> be internally implemented as del+add in most cases. Otherwise, it
> will require a significant rework of OVS internals.
> There are things that could be adjusted on the fly, but some
> fundamental parts like OF port number that every other part depends
> on are not easy to change.
I see. In that case the sd_notify(3) approach won't work.
> >> OVS could run as a system pod or as a systemd service. It differs from
> >> one setup to another. So it might not be controlled by systemd.
> >
> > Does the CNI plugin allow both configurations?
>
> CNI runs as a DaemonSet (pod on each node) by itself, and it doesn't
> matter if OVS is running on the host or in a different pod.
Okay.
> >
> > It's impossible to come up with one approach that works for everyone in
> > the general case (beyond the CNI plugin, beyond Kubernetes).
>
> If we're looking for a solution to store abstract sockets somehow
> for OVS then it's hard to came up with something generic. It will
> have dependency on specific init system anyway.
>
> OTOH, Broker solution will work for all cases. :) One may think
> of a broker as a service that supplies abstract sockets for processes
> from different namespaces. These sockets are already connected, for
> convenience.
I'm not sure what we're trying to come up with :). I haven't figured out
how much of what has been discussed is cosmetic and nice-to-have stuff
versus what is a real problem that needs a solution.
From the vhost-user point of view I would prefer to stick to the
existing UNIX domain socket approach. Any additional mechanism adds
extra complexity, won't be supported by all software, requires educating
users and developers, requires building new vhost-user application
container images, etc. IMO it's only worth doing if there is a real
problem with UNIX domain sockets that cannot be solved without
introducing a new connection mechanism.
Stefan
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket
2021-03-25 8:32 ` Maxime Coquelin
@ 2021-04-07 7:21 ` Xia, Chenbo
0 siblings, 0 replies; 38+ messages in thread
From: Xia, Chenbo @ 2021-04-07 7:21 UTC (permalink / raw)
To: Maxime Coquelin, Ilya Maximets
Cc: dev, Adrian Moreno, Stefan Hajnoczi, Julia Suvorova, stable,
Yang, Zhiyong
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, March 25, 2021 4:32 PM
> To: Ilya Maximets <i.maximets@ovn.org>
> Cc: Xia, Chenbo <chenbo.xia@intel.com>; dev@dpdk.org; Adrian Moreno
> <amorenoz@redhat.com>; Stefan Hajnoczi <stefanha@redhat.com>; Julia Suvorova
> <jusual@redhat.com>; stable@dpdk.org; Yang, Zhiyong <zhiyong.yang@intel.com>
> Subject: Re: [PATCH 1/4] net/virtio: fix interrupt unregistering for listening
> socket
>
>
>
> On 3/17/21 9:25 PM, Ilya Maximets wrote:
> > virtio_user_dev_server_reconnect() is typically called from the
> > interrupt context while checking the link state:
> >
> > vhost_user_update_link_state()
> > --> virtio_user_dev_server_reconnect()
> >
> > Under this conditions callback unregistering always fails. This means
> > that listenfd is never unregistered and continue to trigger interrupts.
> > For example, if second client will try to connect to the same socket,
> > the server will receive interrupts infinitely because it will not
> > accept them while listen fd is readable and generates epoll events.
> >
> > Fix that by moving reconfiguration of interrupts out of the
> > interrupt context to alarm handler.
> >
> > 'virtio_user_dev_delayed_handler' renamed to
> > 'virtio_user_dev_delayed_disconnect_handler' to better reflect its
> > purpose.
> >
> > Additionally improved error logging around interrupt management.
> >
> > Fixes: bd8f50a45d0f ("net/virtio-user: support server mode")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> > ---
> >
> > CC: Zhiyong Yang <zhiyong.yang@intel.com>
> >
> > drivers/net/virtio/virtio_user/vhost_user.c | 4 +-
> > .../net/virtio/virtio_user/virtio_user_dev.c | 70 ++++++++++++++-----
> > .../net/virtio/virtio_user/virtio_user_dev.h | 2 +-
> > 3 files changed, 57 insertions(+), 19 deletions(-)
> >
>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>
> Thanks,
> Maxime
This patch is applied to next-virtio/main with below fix:
In virtio_user_dev_delayed_intr_reconfig_handler, change 'dev->port_id' to 'dev->hw.port_id'
Thanks
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
` (5 preceding siblings ...)
2021-03-19 14:39 ` Stefan Hajnoczi
@ 2023-06-30 3:45 ` Stephen Hemminger
6 siblings, 0 replies; 38+ messages in thread
From: Stephen Hemminger @ 2023-06-30 3:45 UTC (permalink / raw)
To: Ilya Maximets
Cc: Maxime Coquelin, Chenbo Xia, dev, Adrian Moreno, Stefan Hajnoczi,
Julia Suvorova
On Wed, 17 Mar 2021 21:25:26 +0100
Ilya Maximets <i.maximets@ovn.org> wrote:
> TL;DR;
> Managing socket files is too much fun. :) And here is how this
> could be improved:
> https://github.com/igsilya/one-socket
> https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst
> In particular for vhost-user case.
>
> In modern virtualization setups there are tens or hundreds of different
> socket files for different purposes. Sockets to manage various
> daemons, vhost-user sockets for various virtual devices, memif sockets
> for memif network interfaces and so on.
>
> In order to make things work in containerized environments software
> systems has to share these sockets with containers. In most cases
> this sharing is implemented as a shared directory mounted inside the
> container, because socket files could be re-created in runtime or even
> not be available at the container startup. For example, if they are
> created by the application inside the container.
>
> Even more configuration tricks required in order to share some sockets
> between different containers and not only with the host, e.g. to
> create service chains.
> And some housekeeping usually required for applications in case the
> socket server terminated abnormally and socket files left on a file
> system:
> "failed to bind to vhu: Address already in use; remove it and try again"
>
> Additionally, all applications (system and user's!) should follow
> naming conventions and place socket files in particular location on a
> file system to make things work.
>
> In particular, this applies to vhost-user sockets.
>
> This patch-set aims to eliminate most of the inconveniences by
> leveraging an infrastructure service provided by a SocketPair Broker.
>
> *SocketPair Broker* is a daemon that mediates establishment of direct
> socket-based connections between clients.
>
> *One Socket* is a reference implementation of a SocketPair Broker
> Daemon, SocketPair Broker Protocol and a helper library for client
> applications (libspbroker):
>
> https://github.com/igsilya/one-socket
>
> It's fully functional, but not completely ready for production use
> for now. See 'todo' section in README.rst in one-socket repository.
>
> Basically, it's a daemon that listens on a single unix socket
> (broker socket) and accepts clients. Client connects and provides a
> 'key'. If two clients provided the same 'key', One Socket daemon
> creates a pair of connected sockets with socketpair() and sends
> sides of this pair to these two clients. At this point two clients
> have a direct communication channel between them. They will disconnect
> from the broker and continue to operate and communicate normally.
>
> Workflow overview with pictures available here:
>
> https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst
>
> Communication with a broker based on a SocketPair Broker Protocol:
>
> https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker-proto-spec.rst
>
>
> This patch-set extends vhost library, vhost pmd and virtio-user pmd to
> support SocketPair Broker as one of the connection methods.
> Usage example:
>
> # Starting a One Socket daemon with socket './one.socket':
> $ ONE_SOCKET_PATH=./one.socket ./one-socket
>
> # Starting testpmd #1 with virtio-user device in server mode:
> $ dpdk-testpmd --no-pci --in-memory --single-file-segments \
> --vdev="net_virtio_user,path=./one.socket,broker-key=MY-KEY,server=1"
>
> # Starting testpmd #2 with vhost pmd in client mode:
> $ dpdk-testpmd --no-pci --in-memory --single-file-segments \
> --vdev="eth_vhost0,iface=./one.socket,broker-key=MY-KEY,client=1"
>
> Details how to build and install One Socket are in README.rst in
> one-socket repository.
>
> DPDK side is the first step of implementation. Once available in DPDK,
> support could be easily added to Open vSwith or VPP or any DPDK-based
> application. Same support could be added to QEMU (found a volunteer
> for this part).
>
> Since SocketPair Broker is completely independent from the purposes
> connection will be used for, it has a potential to unify and replace
> all one-to-one unix socket connections on a host. This one persistent
> broker socket could be passed to any containers and can be used by
> any application greatly simplifying system management.
>
> Any feedback or suggestions on any component of this solution including
> this patch-set, One Socket Daemon, SocketPair Broker Protocol or
> libspbroker library are very welcome.
>
> *Note* about the patch set:
>
> First patch in a series is a *bug* fix, so it should be considered even
> outside of this series. It basically fixes unregistering of a
> listening socket that never happens in current code.
>
> The virtio-user part of the series heavily depends on this bug fix
> since broker connection unlike listening socket will not persist and
> will generate lots of interrupts if not unregistered.
>
> Ilya Maximets (4):
> net/virtio: fix interrupt unregistering for listening socket
> vhost: add support for SocketPair Broker
> net/vhost: add support for SocketPair Broker
> net/virtio: add support for SocketPair Broker
>
> doc/guides/nics/vhost.rst | 5 +
> doc/guides/nics/virtio.rst | 5 +
> doc/guides/prog_guide/vhost_lib.rst | 10 +
> drivers/net/vhost/rte_eth_vhost.c | 42 ++-
> drivers/net/virtio/meson.build | 6 +
> drivers/net/virtio/virtio_user/vhost_user.c | 122 ++++++++-
> .../net/virtio/virtio_user/virtio_user_dev.c | 142 +++++++---
> .../net/virtio/virtio_user/virtio_user_dev.h | 6 +-
> drivers/net/virtio/virtio_user_ethdev.c | 30 ++-
> lib/librte_vhost/meson.build | 7 +
> lib/librte_vhost/rte_vhost.h | 1 +
> lib/librte_vhost/socket.c | 245 ++++++++++++++++--
> 12 files changed, 550 insertions(+), 71 deletions(-)
>
Adding yet another dependency in DPDK for something that really should
be part of the OS (systemd?) seems like something that is not going
to be widely adopted or appreciated.
If this is a real problem, this should be solved outside of DPDK
with a common infrastructure available in almost all distros.
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2023-06-30 3:45 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 20:25 [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
2021-03-25 8:32 ` Maxime Coquelin
2021-04-07 7:21 ` Xia, Chenbo
2021-03-17 20:25 ` [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 3/4] net/vhost: " Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 4/4] net/virtio: " Ilya Maximets
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
2021-03-18 19:47 ` Ilya Maximets
2021-03-18 20:14 ` Ilya Maximets
2021-03-19 14:16 ` Stefan Hajnoczi
2021-03-19 15:37 ` Ilya Maximets
2021-03-19 16:01 ` Stefan Hajnoczi
2021-03-19 16:02 ` Marc-André Lureau
2021-03-19 8:51 ` Marc-André Lureau
2021-03-19 11:25 ` Ilya Maximets
2021-03-19 14:05 ` Stefan Hajnoczi
2021-03-19 15:29 ` Ilya Maximets
2021-03-19 17:21 ` Stefan Hajnoczi
2021-03-23 17:57 ` Adrian Moreno
2021-03-23 18:27 ` Ilya Maximets
2021-03-23 20:54 ` Billy McFall
2021-03-24 12:05 ` Stefan Hajnoczi
2021-03-24 13:11 ` Ilya Maximets
2021-03-24 15:07 ` Stefan Hajnoczi
2021-03-25 9:35 ` Stefan Hajnoczi
2021-03-25 11:00 ` Ilya Maximets
2021-03-25 16:43 ` Stefan Hajnoczi
2021-03-25 17:58 ` Ilya Maximets
2021-03-30 15:01 ` Stefan Hajnoczi
2021-03-19 14:39 ` Stefan Hajnoczi
2021-03-19 16:11 ` Ilya Maximets
2021-03-19 16:45 ` Ilya Maximets
2021-03-24 20:56 ` Maxime Coquelin
2021-03-24 21:39 ` Ilya Maximets
2021-03-24 21:51 ` Maxime Coquelin
2021-03-24 22:17 ` Ilya Maximets
2023-06-30 3:45 ` Stephen Hemminger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).