From: Stefan Hajnoczi <stefanha@redhat.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: "Billy McFall" <bmcfall@redhat.com>,
"Adrian Moreno" <amorenoz@redhat.com>,
"Maxime Coquelin" <maxime.coquelin@redhat.com>,
"Chenbo Xia" <chenbo.xia@intel.com>,
dev@dpdk.org, "Julia Suvorova" <jusual@redhat.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Daniel Berrange" <berrange@redhat.com>
Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
Date: Wed, 24 Mar 2021 15:07:27 +0000 [thread overview]
Message-ID: <YFtVr0uM3Dpynqyv@stefanha-x1.localdomain> (raw)
In-Reply-To: <597d1ec7-d271-dc0d-522d-b900c9cb00ea@ovn.org>
[-- Attachment #1: Type: text/plain, Size: 15907 bytes --]
On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>
> >>>>
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>>>>> system:
> >>>>>>>>>> "failed to bind to vhu: Address already in use; remove it and try
> >>> again"
> >>>>>>>>>
> >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> >>> users
> >>>>>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>>>>> solved with a pidfile.
> >>>>>>>>
> >>>>>>>> How exactly this could be solved with a pidfile?
> >>>>>>>
> >>>>>>> A pidfile prevents two instances of the same service from running at
> >>> the
> >>>>>>> same time.
> >>>>>>>
> >>>>>>> The same effect can be achieved by the container orchestrator,
> >>> systemd,
> >>>>>>> etc too because it refuses to run the same service twice.
> >>>>>>
> >>>>>> Sure. I understand that. My point was that these could be 2 different
> >>>>>> applications and they might not know which process to look for.
> >>>>>>
> >>>>>>>
> >>>>>>>> And what if this is
> >>>>>>>> a different application that tries to create a socket on a same path?
> >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind
> >>>>>>>> to an existing socket file and will fail. Subsequently port creation
> >>>>>>>> in OVS will fail. We can't allow OVS to unlink files because this
> >>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>>>>> or it was a virtio-user application or someone else.
> >>>>>>>
> >>>>>>> If rte_vhost unlinks the socket then the user will find that
> >>> networking
> >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> >>> device
> >>>>>>> or restart QEMU, depending on whether they need to keep the guest
> >>>>>>> running or not. This is a misconfiguration that is recoverable.
> >>>>>>
> >>>>>> True, it's recoverable, but with a high cost. Restart of a VM is
> >>> rarely
> >>>>>> desirable. And the application inside the guest might not feel itself
> >>>>>> well after hot re-plug of a device that it actively used. I'd expect
> >>>>>> a DPDK application that runs inside a guest on some virtio-net device
> >>>>>> to crash after this kind of manipulations. Especially, if it uses some
> >>>>>> older versions of DPDK.
> >>>>>
> >>>>> This unlink issue is probably something we think differently about.
> >>>>> There are many ways for users to misconfigure things when working with
> >>>>> system tools. If it's possible to catch misconfigurations that is
> >>>>> preferrable. In this case it's just the way pathname AF_UNIX domain
> >>>>> sockets work and IMO it's better not to have problems starting the
> >>>>> service due to stale files than to insist on preventing
> >>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be
> >>>>> successful, so ¯\_(ツ)_/¯.
> >>>>>
> >>>>>>>
> >>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> >>>>>>> create a security issue. I don't know the security model of OVS.
> >>>>>>
> >>>>>> In general privileges of a ovs-vswitchd daemon might be completely
> >>>>>> different from privileges required to invoke control utilities or
> >>>>>> to access the configuration database. SO, yes, we should not allow
> >>>>>> that.
> >>>>>
> >>>>> That can be locked down by restricting the socket path to a file beneath
> >>>>> /var/run/ovs/vhost-user/.
> >>>>>
> >>>>>>>
> >>>>>>>> There are, probably, ways to detect if there is any alive process
> >>> that
> >>>>>>>> has this socket open, but that sounds like too much for this purpose,
> >>>>>>>> also I'm not sure if it's possible if actual user is in a different
> >>>>>>>> container.
> >>>>>>>> So I don't see a good reliable way to detect these conditions. This
> >>>>>>>> falls on shoulders of a higher level management software or a user to
> >>>>>>>> clean these socket files up before adding ports.
> >>>>>>>
> >>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK
> >>>>>>> application? If yes, then abstract AF_UNIX sockets can be used.
> >>> Abstract
> >>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address
> >>>>>>> disappears when there is no process listening anymore.
> >>>>>>
> >>>>>> OVS is usually started right on the host in a main network namespace.
> >>>>>> In case it's started in a pod, it will run in a separate container but
> >>>>>> configured with a host network. Applications almost exclusively runs
> >>>>>> in separate pods.
> >>>>>
> >>>>> Okay.
> >>>>>
> >>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by
> >>>>>>>>>> leveraging an infrastructure service provided by a SocketPair
> >>> Broker.
> >>>>>>>>>
> >>>>>>>>> I don't understand yet why this is useful for vhost-user, where the
> >>>>>>>>> creation of the vhost-user device backend and its use by a VMM are
> >>>>>>>>> closely managed by one piece of software:
> >>>>>>>>>
> >>>>>>>>> 1. Unlink the socket path.
> >>>>>>>>> 2. Create, bind, and listen on the socket path.
> >>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>>>>>>> RPC, spawn a process, etc) and pass in the listen fd.
> >>>>>>>>> 4. In the meantime the VMM can open the socket path and call
> >>> connect(2).
> >>>>>>>>> As soon as the vhost-user device backend calls accept(2) the
> >>>>>>>>> connection will proceed (there is no need for sleeping).
> >>>>>>>>>
> >>>>>>>>> This approach works across containers without a broker.
> >>>>>>>>
> >>>>>>>> Not sure if I fully understood a question here, but anyway.
> >>>>>>>>
> >>>>>>>> This approach works fine if you know what application to run.
> >>>>>>>> In case of a k8s cluster, it might be a random DPDK application
> >>>>>>>> with virtio-user ports running inside a container and want to
> >>>>>>>> have a network connection. Also, this application needs to run
> >>>>>>>> virtio-user in server mode, otherwise restart of the OVS will
> >>>>>>>> require restart of the application. So, you basically need to
> >>>>>>>> rely on a third-party application to create a socket with a right
> >>>>>>>> name and in a correct location that is shared with a host, so
> >>>>>>>> OVS can find it and connect.
> >>>>>>>>
> >>>>>>>> In a VM world everything is much more simple, since you have
> >>>>>>>> a libvirt and QEMU that will take care of all of these stuff
> >>>>>>>> and which are also under full control of management software
> >>>>>>>> and a system administrator.
> >>>>>>>> In case of a container with a "random" DPDK application inside
> >>>>>>>> there is no such entity that can help. Of course, some solution
> >>>>>>>> might be implemented in docker/podman daemon to create and manage
> >>>>>>>> outside-looking sockets for an application inside the container,
> >>>>>>>> but that is not available today AFAIK and I'm not sure if it
> >>>>>>>> ever will.
> >>>>>>>
> >>>>>>> Wait, when you say there is no entity like management software or a
> >>>>>>> system administrator, then how does OVS know to instantiate the new
> >>>>>>> port? I guess something still needs to invoke ovs-ctl add-port?
> >>>>>>
> >>>>>> I didn't mean that there is no any application that configures
> >>>>>> everything. Of course, there is. I mean that there is no such
> >>>>>> entity that abstracts all that socket machinery from the user's
> >>>>>> application that runs inside the container. QEMU hides all the
> >>>>>> details of the connection to vhost backend and presents the device
> >>>>>> as a PCI device with a network interface wrapping from the guest
> >>>>>> kernel. So, the application inside VM shouldn't care what actually
> >>>>>> there is a socket connected to OVS that implements backend and
> >>>>>> forward traffic somewhere. For the application it's just a usual
> >>>>>> network interface.
> >>>>>> But in case of a container world, application should handle all
> >>>>>> that by creating a virtio-user device that will connect to some
> >>>>>> socket, that has an OVS on the other side.
> >>>>>>
> >>>>>>>
> >>>>>>> Can you describe the steps used today (without the broker) for
> >>>>>>> instantiating a new DPDK app container and connecting it to OVS?
> >>>>>>> Although my interest is in the vhost-user protocol I think it's
> >>>>>>> necessary to understand the OVS requirements here and I know little
> >>>>>>> about them.
> >>>>>>>> I might describe some things wrong since I worked with k8s and CNI
> >>>>>> plugins last time ~1.5 years ago, but the basic schema will look
> >>>>>> something like this:
> >>>>>>
> >>>>>> 1. user decides to start a new pod and requests k8s to do that
> >>>>>> via cmdline tools or some API calls.
> >>>>>>
> >>>>>> 2. k8s scheduler looks for available resources asking resource
> >>>>>> manager plugins, finds an appropriate physical host and asks
> >>>>>> local to that node kubelet daemon to launch a new pod there.
> >>>>>>
> >>>>
> >>>> When the CNI is called, the pod has already been created, i.e: a PodID
> >>> exists
> >>>> and so does an associated network namespace. Therefore, everything that
> >>> has to
> >>>> do with the runtime spec such as mountpoints or devices cannot be
> >>> modified by
> >>>> this time.
> >>>>
> >>>> That's why the Device Plugin API is used to modify the Pod's spec before
> >>> the CNI
> >>>> chain is called.
> >>>>
> >>>>>> 3. kubelet asks local CNI plugin to allocate network resources
> >>>>>> and annotate the pod with required mount points, devices that
> >>>>>> needs to be passed in and environment variables.
> >>>>>> (this is, IIRC, a gRPC connection. It might be a multus-cni
> >>>>>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is
> >>>>>> usually deployed as a system DaemonSet, so it runs in a
> >>>>>> separate pod.
> >>>>>>
> >>>>>> 4. Assuming that vhost-user connection requested in server mode.
> >>>>>> CNI plugin will:
> >>>>>> 4.1 create a directory for a vhost-user socket.
> >>>>>> 4.2 add this directory to pod annotations as a mount point.
> >>>>
> >>>> I believe this is not possible, it would have to inspect the pod's spec
> >>> or
> >>>> otherwise determine an existing mount point where the socket should be
> >>> created.
> >>>
> >>> Uff. Yes, you're right. Thanks for your clarification.
> >>> I mixed up CNI and Device Plugin here.
> >>>
> >>> CNI itself is not able to annotate new resources to the pod, i.e.
> >>> create new mounts or something like this. And I don't recall any
> >>> vhost-user device plugins. Is there any? There is an SR-IOV device
> >>> plugin, but its purpose is to allocate and pass PCI devices, not create
> >>> mounts for vhost-user.
> >>>
> >>> So, IIUC, right now user must create the directory and specify
> >>> a mount point in a pod spec file or pass the whole /var/run/openvswitch
> >>> or something like this, right?
> >>>
> >>> Looking at userspace-cni-network-plugin, it actually just parses
> >>> annotations to find the shared directory and fails if there is
> >>> no any:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
> >>>
> >>> And examples suggests to specify a directory to mount:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
> >>>
> >>> Looks like this is done by user's hands.
> >>>
> >>> Yes, I am one of the primary authors of Userspace CNI. Currently, the
> >> directory is by hand. Long term thought was to have a mutating
> >> webhook/admission controller inject a directory into the podspec. Not sure
> >> if it has changed, but I think when I was originally doing this work, OvS
> >> only lets you choose the directory at install time, so it has to be
> >> something like /var/run/openvswitch/. You can choose the socketfile name
> >> and maybe a subdirectory off the main directory, but not the full path.
> >>
> >> One of the issues I was trying to solve was making sure ContainerA couldn't
> >> see ContainerB's socketfiles. That's where the admission controller could
> >> create a unique subdirectory for each container under
> >> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
> >> always took precedence so that work never completed.
> >
> > If the CNI plugin has access to the container's network namespace, could
> > it create an abstract AF_UNIX listen socket?
> >
> > That way the application inside the container could connect to an
> > AF_UNIX socket and there is no need to manage container volumes.
> >
> > I'm not familiar with the Open VSwitch, so I'm not sure if there is a
> > sane way of passing the listen socket fd into ovswitchd from the CNI
> > plugin?
> >
> > The steps:
> > 1. CNI plugin enters container's network namespace and opens an abstract
> > AF_UNIX listen socket.
> > 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
> > add-port step. Instead of using type=dpdkvhostuserclient
> > options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
> > dpdkvhostuser server with the listen fd.
>
> For this step you will need a side channel, i.e. a separate unix socket
> created by ovs-vswitchd (most likely, created by rte_vhost on
> rte_vhost_driver_register() call).
>
> The problem is that ovs-vsctl talks with ovsdb-server and adds the new
> port -- just a new row in the 'interface' table of the database.
> ovs-vswitchd receives update from the database and creates the actual
> port. All the communications done through JSONRPC, so passing fds is
> not an option.
>
> > 3. When the container starts, it connects to the abstract AF_UNIX
> > socket. The abstract socket name is provided to the container at
> > startup time in an environment variable. The name is unique, at least
> > to the pod, so that multiple containers in the pod can run vhost-user
> > applications.
>
> Few more problems with this solution:
>
> - We still want to run application inside the container in a server mode,
> because virtio-user PMD in client mode doesn't support re-connection.
>
> - How to get this fd again after the OVS restart? CNI will not be invoked
> at this point to pass a new fd.
>
> - If application will close the connection for any reason (restart, some
> reconfiguration internal to the application) and OVS will be re-started
> at the same time, abstract socket will be gone. Need a persistent daemon
> to hold it.
Okay, if there is no component that has a lifetime suitable for holding
the abstract listen socket, then using pathname AF_UNIX sockets seems
like a better approach.
Stefan
next prev parent reply other threads:[~2021-03-24 15:07 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-17 20:25 Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
2021-03-25 8:32 ` Maxime Coquelin
2021-04-07 7:21 ` Xia, Chenbo
2021-03-17 20:25 ` [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 3/4] net/vhost: " Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 4/4] net/virtio: " Ilya Maximets
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
2021-03-18 19:47 ` Ilya Maximets
2021-03-18 20:14 ` Ilya Maximets
2021-03-19 14:16 ` Stefan Hajnoczi
2021-03-19 15:37 ` Ilya Maximets
2021-03-19 16:01 ` Stefan Hajnoczi
2021-03-19 16:02 ` Marc-André Lureau
2021-03-19 8:51 ` Marc-André Lureau
2021-03-19 11:25 ` Ilya Maximets
2021-03-19 14:05 ` Stefan Hajnoczi
2021-03-19 15:29 ` Ilya Maximets
2021-03-19 17:21 ` Stefan Hajnoczi
2021-03-23 17:57 ` Adrian Moreno
2021-03-23 18:27 ` Ilya Maximets
2021-03-23 20:54 ` Billy McFall
2021-03-24 12:05 ` Stefan Hajnoczi
2021-03-24 13:11 ` Ilya Maximets
2021-03-24 15:07 ` Stefan Hajnoczi [this message]
2021-03-25 9:35 ` Stefan Hajnoczi
2021-03-25 11:00 ` Ilya Maximets
2021-03-25 16:43 ` Stefan Hajnoczi
2021-03-25 17:58 ` Ilya Maximets
2021-03-30 15:01 ` Stefan Hajnoczi
2021-03-19 14:39 ` Stefan Hajnoczi
2021-03-19 16:11 ` Ilya Maximets
2021-03-19 16:45 ` Ilya Maximets
2021-03-24 20:56 ` Maxime Coquelin
2021-03-24 21:39 ` Ilya Maximets
2021-03-24 21:51 ` Maxime Coquelin
2021-03-24 22:17 ` Ilya Maximets
2023-06-30 3:45 ` Stephen Hemminger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YFtVr0uM3Dpynqyv@stefanha-x1.localdomain \
--to=stefanha@redhat.com \
--cc=amorenoz@redhat.com \
--cc=berrange@redhat.com \
--cc=bmcfall@redhat.com \
--cc=chenbo.xia@intel.com \
--cc=dev@dpdk.org \
--cc=i.maximets@ovn.org \
--cc=jusual@redhat.com \
--cc=marcandre.lureau@redhat.com \
--cc=maxime.coquelin@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).