From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3D464A0A0A; Tue, 23 Mar 2021 19:27:46 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 0A87340683; Tue, 23 Mar 2021 19:27:46 +0100 (CET) Received: from relay6-d.mail.gandi.net (relay6-d.mail.gandi.net [217.70.183.198]) by mails.dpdk.org (Postfix) with ESMTP id 49E3440143 for ; Tue, 23 Mar 2021 19:27:45 +0100 (CET) X-Originating-IP: 78.45.89.65 Received: from [192.168.1.23] (ip-78-45-89-65.net.upcbroadband.cz [78.45.89.65]) (Authenticated sender: i.maximets@ovn.org) by relay6-d.mail.gandi.net (Postfix) with ESMTPSA id 2F7AAC0006; Tue, 23 Mar 2021 18:27:42 +0000 (UTC) To: Adrian Moreno , Stefan Hajnoczi , Ilya Maximets Cc: Maxime Coquelin , Chenbo Xia , dev@dpdk.org, Julia Suvorova , =?UTF-8?Q?Marc-Andr=c3=a9_Lureau?= , Daniel Berrange , Billy McFall References: <20210317202530.4145673-1-i.maximets@ovn.org> From: Ilya Maximets Message-ID: <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> Date: Tue, 23 Mar 2021 19:27:42 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 3/23/21 6:57 PM, Adrian Moreno wrote: > > > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote: >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote: >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote: >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: >>>>>>> And some housekeeping usually required for applications in case the >>>>>>> socket server terminated abnormally and socket files left on a file >>>>>>> system: >>>>>>> "failed to bind to vhu: Address already in use; remove it and try again" >>>>>> >>>>>> QEMU avoids this by unlinking before binding. The drawback is that users >>>>>> might accidentally hijack an existing listen socket, but that can be >>>>>> solved with a pidfile. >>>>> >>>>> How exactly this could be solved with a pidfile? >>>> >>>> A pidfile prevents two instances of the same service from running at the >>>> same time. >>>> >>>> The same effect can be achieved by the container orchestrator, systemd, >>>> etc too because it refuses to run the same service twice. >>> >>> Sure. I understand that. My point was that these could be 2 different >>> applications and they might not know which process to look for. >>> >>>> >>>>> And what if this is >>>>> a different application that tries to create a socket on a same path? >>>>> e.g. QEMU creates a socket (started in a server mode) and user >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of >>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind >>>>> to an existing socket file and will fail. Subsequently port creation >>>>> in OVS will fail. We can't allow OVS to unlink files because this >>>>> way OVS users will have ability to unlink random sockets that OVS has >>>>> access to and we also has no idea if it's a QEMU that created a file >>>>> or it was a virtio-user application or someone else. >>>> >>>> If rte_vhost unlinks the socket then the user will find that networking >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device >>>> or restart QEMU, depending on whether they need to keep the guest >>>> running or not. This is a misconfiguration that is recoverable. >>> >>> True, it's recoverable, but with a high cost. Restart of a VM is rarely >>> desirable. And the application inside the guest might not feel itself >>> well after hot re-plug of a device that it actively used. I'd expect >>> a DPDK application that runs inside a guest on some virtio-net device >>> to crash after this kind of manipulations. Especially, if it uses some >>> older versions of DPDK. >> >> This unlink issue is probably something we think differently about. >> There are many ways for users to misconfigure things when working with >> system tools. If it's possible to catch misconfigurations that is >> preferrable. In this case it's just the way pathname AF_UNIX domain >> sockets work and IMO it's better not to have problems starting the >> service due to stale files than to insist on preventing >> misconfigurations. QEMU and DPDK do this differently and both seem to be >> successful, so ¯\_(ツ)_/¯. >> >>>> >>>> Regarding letting OVS unlink files, I agree that it shouldn't if this >>>> create a security issue. I don't know the security model of OVS. >>> >>> In general privileges of a ovs-vswitchd daemon might be completely >>> different from privileges required to invoke control utilities or >>> to access the configuration database. SO, yes, we should not allow >>> that. >> >> That can be locked down by restricting the socket path to a file beneath >> /var/run/ovs/vhost-user/. >> >>>> >>>>> There are, probably, ways to detect if there is any alive process that >>>>> has this socket open, but that sounds like too much for this purpose, >>>>> also I'm not sure if it's possible if actual user is in a different >>>>> container. >>>>> So I don't see a good reliable way to detect these conditions. This >>>>> falls on shoulders of a higher level management software or a user to >>>>> clean these socket files up before adding ports. >>>> >>>> Does OVS always run in the same net namespace (pod) as the DPDK >>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract >>>> AF_UNIX sockets don't have a filesystem path and the socket address >>>> disappears when there is no process listening anymore. >>> >>> OVS is usually started right on the host in a main network namespace. >>> In case it's started in a pod, it will run in a separate container but >>> configured with a host network. Applications almost exclusively runs >>> in separate pods. >> >> Okay. >> >>>>>>> This patch-set aims to eliminate most of the inconveniences by >>>>>>> leveraging an infrastructure service provided by a SocketPair Broker. >>>>>> >>>>>> I don't understand yet why this is useful for vhost-user, where the >>>>>> creation of the vhost-user device backend and its use by a VMM are >>>>>> closely managed by one piece of software: >>>>>> >>>>>> 1. Unlink the socket path. >>>>>> 2. Create, bind, and listen on the socket path. >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK >>>>>> RPC, spawn a process, etc) and pass in the listen fd. >>>>>> 4. In the meantime the VMM can open the socket path and call connect(2). >>>>>> As soon as the vhost-user device backend calls accept(2) the >>>>>> connection will proceed (there is no need for sleeping). >>>>>> >>>>>> This approach works across containers without a broker. >>>>> >>>>> Not sure if I fully understood a question here, but anyway. >>>>> >>>>> This approach works fine if you know what application to run. >>>>> In case of a k8s cluster, it might be a random DPDK application >>>>> with virtio-user ports running inside a container and want to >>>>> have a network connection. Also, this application needs to run >>>>> virtio-user in server mode, otherwise restart of the OVS will >>>>> require restart of the application. So, you basically need to >>>>> rely on a third-party application to create a socket with a right >>>>> name and in a correct location that is shared with a host, so >>>>> OVS can find it and connect. >>>>> >>>>> In a VM world everything is much more simple, since you have >>>>> a libvirt and QEMU that will take care of all of these stuff >>>>> and which are also under full control of management software >>>>> and a system administrator. >>>>> In case of a container with a "random" DPDK application inside >>>>> there is no such entity that can help. Of course, some solution >>>>> might be implemented in docker/podman daemon to create and manage >>>>> outside-looking sockets for an application inside the container, >>>>> but that is not available today AFAIK and I'm not sure if it >>>>> ever will. >>>> >>>> Wait, when you say there is no entity like management software or a >>>> system administrator, then how does OVS know to instantiate the new >>>> port? I guess something still needs to invoke ovs-ctl add-port? >>> >>> I didn't mean that there is no any application that configures >>> everything. Of course, there is. I mean that there is no such >>> entity that abstracts all that socket machinery from the user's >>> application that runs inside the container. QEMU hides all the >>> details of the connection to vhost backend and presents the device >>> as a PCI device with a network interface wrapping from the guest >>> kernel. So, the application inside VM shouldn't care what actually >>> there is a socket connected to OVS that implements backend and >>> forward traffic somewhere. For the application it's just a usual >>> network interface. >>> But in case of a container world, application should handle all >>> that by creating a virtio-user device that will connect to some >>> socket, that has an OVS on the other side. >>> >>>> >>>> Can you describe the steps used today (without the broker) for >>>> instantiating a new DPDK app container and connecting it to OVS? >>>> Although my interest is in the vhost-user protocol I think it's >>>> necessary to understand the OVS requirements here and I know little >>>> about them. >>>>> I might describe some things wrong since I worked with k8s and CNI >>> plugins last time ~1.5 years ago, but the basic schema will look >>> something like this: >>> >>> 1. user decides to start a new pod and requests k8s to do that >>> via cmdline tools or some API calls. >>> >>> 2. k8s scheduler looks for available resources asking resource >>> manager plugins, finds an appropriate physical host and asks >>> local to that node kubelet daemon to launch a new pod there. >>> > > When the CNI is called, the pod has already been created, i.e: a PodID exists > and so does an associated network namespace. Therefore, everything that has to > do with the runtime spec such as mountpoints or devices cannot be modified by > this time. > > That's why the Device Plugin API is used to modify the Pod's spec before the CNI > chain is called. > >>> 3. kubelet asks local CNI plugin to allocate network resources >>> and annotate the pod with required mount points, devices that >>> needs to be passed in and environment variables. >>> (this is, IIRC, a gRPC connection. It might be a multus-cni >>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is >>> usually deployed as a system DaemonSet, so it runs in a >>> separate pod. >>> >>> 4. Assuming that vhost-user connection requested in server mode. >>> CNI plugin will: >>> 4.1 create a directory for a vhost-user socket. >>> 4.2 add this directory to pod annotations as a mount point. > > I believe this is not possible, it would have to inspect the pod's spec or > otherwise determine an existing mount point where the socket should be created. Uff. Yes, you're right. Thanks for your clarification. I mixed up CNI and Device Plugin here. CNI itself is not able to annotate new resources to the pod, i.e. create new mounts or something like this. And I don't recall any vhost-user device plugins. Is there any? There is an SR-IOV device plugin, but its purpose is to allocate and pass PCI devices, not create mounts for vhost-user. So, IIUC, right now user must create the directory and specify a mount point in a pod spec file or pass the whole /var/run/openvswitch or something like this, right? Looking at userspace-cni-network-plugin, it actually just parses annotations to find the shared directory and fails if there is no any: https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122 And examples suggests to specify a directory to mount: https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41 Looks like this is done by user's hands. > > +Billy might give more insights on this > >>> 4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or >>> by connecting to ovsdb-server by JSONRPC directly. >>> It will set port type as dpdkvhostuserclient and specify >>> socket-path as a path inside the directory it created. >>> (OVS will create a port and rte_vhost will enter the >>> re-connection loop since socket does not exist yet.) >>> 4.4 Set up socket file location as environment variable in >>> pod annotations. >>> 4.5 report success to kubelet. >>> > > Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin > or other external entity that can inject the mount point before the pod is created. > > However, there is another usecase that might be relevant: dynamic attachment of > network interfaces. In this case the CNI cannot work in collaboration with a > Device Plugin or "mount-point injector" and an existing mount point has to be used. > Also, some form of notification mechanism has to exist to tell the workload a > new socket is ready. > >>> 5. kubelet will finish all other preparations and resource >>> allocations and will ask docker/podman to start a container >>> with all mount points, devices and environment variables from >>> the pod annotation. >>> >>> 6. docker/podman starts a container. >>> Need to mention here that in many cases initial process of >>> a container is not the actual application that will use a >>> vhost-user connection, but likely a shell that will invoke >>> the actual application. >>> >>> 7. Application starts inside the container, checks the environment >>> variables (actually, checking of environment variables usually >>> happens in a shell script that invokes the application with >>> correct arguments) and creates a net_virtio_user port in server >>> mode. At this point socket file will be created. >>> (since we're running third-party application inside the container >>> we can only assume that it will do what is written here, it's >>> a responsibility of an application developer to do the right >>> thing.) >>> >>> 8. OVS successfully re-connects to the newly created socket in a >>> shared directory and vhost-user protocol establishes the network >>> connectivity. >>> >>> As you can wee, there are way too many entities and communication >>> methods involved. So, passing a pre-opened file descriptor from >>> CNI all the way down to application is not that easy as it is in >>> case of QEMU+LibVirt. >> >> File descriptor passing isn't necessary if OVS owns the listen socket >> and the application container is the one who connects. That's why I >> asked why dpdkvhostuser was deprecated in another email. The benefit of >> doing this would be that the application container can instantly connect >> to OVS without a sleep loop. >> >> I still don't get the attraction of the broker idea. The pros: >> + Overcomes the issue with stale UNIX domain socket files >> + Eliminates the re-connect sleep loop >> >> Neutral: >> * vhost-user UNIX domain socket directory container volume is replaced >> by broker UNIX domain socket bind mount >> * UNIX domain socket naming conflicts become broker key naming conflicts >> >> The cons: >> - Requires running a new service on the host with potential security >> issues >> - Requires support in third-party applications, QEMU, and DPDK/OVS >> - The old code must be kept for compatibility with non-broker >> configurations, especially since third-party applications may not >> support the broker. Developers and users will have to learn about both >> options and decide which one to use. >> >> This seems like a modest improvement for the complexity and effort >> involved. The same pros can be achieved by: >> * Adding unlink(2) to rte_vhost (or applications can add rm -f >> $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is >> it doesn't catch a misconfiguration where the user launches two >> processes with the same socket path. >> * Reversing the direction of the client/server relationship to >> eliminate the re-connect sleep loop at startup. I'm unsure whether >> this is possible. >> >> That said, the broker idea doesn't affect the vhost-user protocol itself >> and is more of an OVS/DPDK topic. I may just not be familiar enough with >> OVS/DPDK to understand the benefits of the approach. >> >> Stefan >> >