From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id EC320A0A02; Wed, 24 Mar 2021 14:11:35 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A4A4E4067B; Wed, 24 Mar 2021 14:11:35 +0100 (CET) Received: from relay3-d.mail.gandi.net (relay3-d.mail.gandi.net [217.70.183.195]) by mails.dpdk.org (Postfix) with ESMTP id A853F4014F for ; Wed, 24 Mar 2021 14:11:34 +0100 (CET) X-Originating-IP: 78.45.89.65 Received: from [192.168.1.23] (ip-78-45-89-65.net.upcbroadband.cz [78.45.89.65]) (Authenticated sender: i.maximets@ovn.org) by relay3-d.mail.gandi.net (Postfix) with ESMTPSA id 37BE76000E; Wed, 24 Mar 2021 13:11:31 +0000 (UTC) To: Stefan Hajnoczi , Billy McFall Cc: Ilya Maximets , Adrian Moreno , Maxime Coquelin , Chenbo Xia , dev@dpdk.org, Julia Suvorova , =?UTF-8?Q?Marc-Andr=c3=a9_Lureau?= , Daniel Berrange References: <20210317202530.4145673-1-i.maximets@ovn.org> <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> From: Ilya Maximets Message-ID: <597d1ec7-d271-dc0d-522d-b900c9cb00ea@ovn.org> Date: Wed, 24 Mar 2021 14:11:31 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 3/24/21 1:05 PM, Stefan Hajnoczi wrote: > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote: >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets wrote: >> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote: >>>> >>>> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote: >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote: >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote: >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: >>>>>>>>>> And some housekeeping usually required for applications in case the >>>>>>>>>> socket server terminated abnormally and socket files left on a file >>>>>>>>>> system: >>>>>>>>>> "failed to bind to vhu: Address already in use; remove it and try >>> again" >>>>>>>>> >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that >>> users >>>>>>>>> might accidentally hijack an existing listen socket, but that can be >>>>>>>>> solved with a pidfile. >>>>>>>> >>>>>>>> How exactly this could be solved with a pidfile? >>>>>>> >>>>>>> A pidfile prevents two instances of the same service from running at >>> the >>>>>>> same time. >>>>>>> >>>>>>> The same effect can be achieved by the container orchestrator, >>> systemd, >>>>>>> etc too because it refuses to run the same service twice. >>>>>> >>>>>> Sure. I understand that. My point was that these could be 2 different >>>>>> applications and they might not know which process to look for. >>>>>> >>>>>>> >>>>>>>> And what if this is >>>>>>>> a different application that tries to create a socket on a same path? >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of >>>>>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind >>>>>>>> to an existing socket file and will fail. Subsequently port creation >>>>>>>> in OVS will fail. We can't allow OVS to unlink files because this >>>>>>>> way OVS users will have ability to unlink random sockets that OVS has >>>>>>>> access to and we also has no idea if it's a QEMU that created a file >>>>>>>> or it was a virtio-user application or someone else. >>>>>>> >>>>>>> If rte_vhost unlinks the socket then the user will find that >>> networking >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net >>> device >>>>>>> or restart QEMU, depending on whether they need to keep the guest >>>>>>> running or not. This is a misconfiguration that is recoverable. >>>>>> >>>>>> True, it's recoverable, but with a high cost. Restart of a VM is >>> rarely >>>>>> desirable. And the application inside the guest might not feel itself >>>>>> well after hot re-plug of a device that it actively used. I'd expect >>>>>> a DPDK application that runs inside a guest on some virtio-net device >>>>>> to crash after this kind of manipulations. Especially, if it uses some >>>>>> older versions of DPDK. >>>>> >>>>> This unlink issue is probably something we think differently about. >>>>> There are many ways for users to misconfigure things when working with >>>>> system tools. If it's possible to catch misconfigurations that is >>>>> preferrable. In this case it's just the way pathname AF_UNIX domain >>>>> sockets work and IMO it's better not to have problems starting the >>>>> service due to stale files than to insist on preventing >>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be >>>>> successful, so ¯\_(ツ)_/¯. >>>>> >>>>>>> >>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this >>>>>>> create a security issue. I don't know the security model of OVS. >>>>>> >>>>>> In general privileges of a ovs-vswitchd daemon might be completely >>>>>> different from privileges required to invoke control utilities or >>>>>> to access the configuration database. SO, yes, we should not allow >>>>>> that. >>>>> >>>>> That can be locked down by restricting the socket path to a file beneath >>>>> /var/run/ovs/vhost-user/. >>>>> >>>>>>> >>>>>>>> There are, probably, ways to detect if there is any alive process >>> that >>>>>>>> has this socket open, but that sounds like too much for this purpose, >>>>>>>> also I'm not sure if it's possible if actual user is in a different >>>>>>>> container. >>>>>>>> So I don't see a good reliable way to detect these conditions. This >>>>>>>> falls on shoulders of a higher level management software or a user to >>>>>>>> clean these socket files up before adding ports. >>>>>>> >>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK >>>>>>> application? If yes, then abstract AF_UNIX sockets can be used. >>> Abstract >>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address >>>>>>> disappears when there is no process listening anymore. >>>>>> >>>>>> OVS is usually started right on the host in a main network namespace. >>>>>> In case it's started in a pod, it will run in a separate container but >>>>>> configured with a host network. Applications almost exclusively runs >>>>>> in separate pods. >>>>> >>>>> Okay. >>>>> >>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by >>>>>>>>>> leveraging an infrastructure service provided by a SocketPair >>> Broker. >>>>>>>>> >>>>>>>>> I don't understand yet why this is useful for vhost-user, where the >>>>>>>>> creation of the vhost-user device backend and its use by a VMM are >>>>>>>>> closely managed by one piece of software: >>>>>>>>> >>>>>>>>> 1. Unlink the socket path. >>>>>>>>> 2. Create, bind, and listen on the socket path. >>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK >>>>>>>>> RPC, spawn a process, etc) and pass in the listen fd. >>>>>>>>> 4. In the meantime the VMM can open the socket path and call >>> connect(2). >>>>>>>>> As soon as the vhost-user device backend calls accept(2) the >>>>>>>>> connection will proceed (there is no need for sleeping). >>>>>>>>> >>>>>>>>> This approach works across containers without a broker. >>>>>>>> >>>>>>>> Not sure if I fully understood a question here, but anyway. >>>>>>>> >>>>>>>> This approach works fine if you know what application to run. >>>>>>>> In case of a k8s cluster, it might be a random DPDK application >>>>>>>> with virtio-user ports running inside a container and want to >>>>>>>> have a network connection. Also, this application needs to run >>>>>>>> virtio-user in server mode, otherwise restart of the OVS will >>>>>>>> require restart of the application. So, you basically need to >>>>>>>> rely on a third-party application to create a socket with a right >>>>>>>> name and in a correct location that is shared with a host, so >>>>>>>> OVS can find it and connect. >>>>>>>> >>>>>>>> In a VM world everything is much more simple, since you have >>>>>>>> a libvirt and QEMU that will take care of all of these stuff >>>>>>>> and which are also under full control of management software >>>>>>>> and a system administrator. >>>>>>>> In case of a container with a "random" DPDK application inside >>>>>>>> there is no such entity that can help. Of course, some solution >>>>>>>> might be implemented in docker/podman daemon to create and manage >>>>>>>> outside-looking sockets for an application inside the container, >>>>>>>> but that is not available today AFAIK and I'm not sure if it >>>>>>>> ever will. >>>>>>> >>>>>>> Wait, when you say there is no entity like management software or a >>>>>>> system administrator, then how does OVS know to instantiate the new >>>>>>> port? I guess something still needs to invoke ovs-ctl add-port? >>>>>> >>>>>> I didn't mean that there is no any application that configures >>>>>> everything. Of course, there is. I mean that there is no such >>>>>> entity that abstracts all that socket machinery from the user's >>>>>> application that runs inside the container. QEMU hides all the >>>>>> details of the connection to vhost backend and presents the device >>>>>> as a PCI device with a network interface wrapping from the guest >>>>>> kernel. So, the application inside VM shouldn't care what actually >>>>>> there is a socket connected to OVS that implements backend and >>>>>> forward traffic somewhere. For the application it's just a usual >>>>>> network interface. >>>>>> But in case of a container world, application should handle all >>>>>> that by creating a virtio-user device that will connect to some >>>>>> socket, that has an OVS on the other side. >>>>>> >>>>>>> >>>>>>> Can you describe the steps used today (without the broker) for >>>>>>> instantiating a new DPDK app container and connecting it to OVS? >>>>>>> Although my interest is in the vhost-user protocol I think it's >>>>>>> necessary to understand the OVS requirements here and I know little >>>>>>> about them. >>>>>>>> I might describe some things wrong since I worked with k8s and CNI >>>>>> plugins last time ~1.5 years ago, but the basic schema will look >>>>>> something like this: >>>>>> >>>>>> 1. user decides to start a new pod and requests k8s to do that >>>>>> via cmdline tools or some API calls. >>>>>> >>>>>> 2. k8s scheduler looks for available resources asking resource >>>>>> manager plugins, finds an appropriate physical host and asks >>>>>> local to that node kubelet daemon to launch a new pod there. >>>>>> >>>> >>>> When the CNI is called, the pod has already been created, i.e: a PodID >>> exists >>>> and so does an associated network namespace. Therefore, everything that >>> has to >>>> do with the runtime spec such as mountpoints or devices cannot be >>> modified by >>>> this time. >>>> >>>> That's why the Device Plugin API is used to modify the Pod's spec before >>> the CNI >>>> chain is called. >>>> >>>>>> 3. kubelet asks local CNI plugin to allocate network resources >>>>>> and annotate the pod with required mount points, devices that >>>>>> needs to be passed in and environment variables. >>>>>> (this is, IIRC, a gRPC connection. It might be a multus-cni >>>>>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is >>>>>> usually deployed as a system DaemonSet, so it runs in a >>>>>> separate pod. >>>>>> >>>>>> 4. Assuming that vhost-user connection requested in server mode. >>>>>> CNI plugin will: >>>>>> 4.1 create a directory for a vhost-user socket. >>>>>> 4.2 add this directory to pod annotations as a mount point. >>>> >>>> I believe this is not possible, it would have to inspect the pod's spec >>> or >>>> otherwise determine an existing mount point where the socket should be >>> created. >>> >>> Uff. Yes, you're right. Thanks for your clarification. >>> I mixed up CNI and Device Plugin here. >>> >>> CNI itself is not able to annotate new resources to the pod, i.e. >>> create new mounts or something like this. And I don't recall any >>> vhost-user device plugins. Is there any? There is an SR-IOV device >>> plugin, but its purpose is to allocate and pass PCI devices, not create >>> mounts for vhost-user. >>> >>> So, IIUC, right now user must create the directory and specify >>> a mount point in a pod spec file or pass the whole /var/run/openvswitch >>> or something like this, right? >>> >>> Looking at userspace-cni-network-plugin, it actually just parses >>> annotations to find the shared directory and fails if there is >>> no any: >>> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122 >>> >>> And examples suggests to specify a directory to mount: >>> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41 >>> >>> Looks like this is done by user's hands. >>> >>> Yes, I am one of the primary authors of Userspace CNI. Currently, the >> directory is by hand. Long term thought was to have a mutating >> webhook/admission controller inject a directory into the podspec. Not sure >> if it has changed, but I think when I was originally doing this work, OvS >> only lets you choose the directory at install time, so it has to be >> something like /var/run/openvswitch/. You can choose the socketfile name >> and maybe a subdirectory off the main directory, but not the full path. >> >> One of the issues I was trying to solve was making sure ContainerA couldn't >> see ContainerB's socketfiles. That's where the admission controller could >> create a unique subdirectory for each container under >> /var/run/openvswitch/. But this was more of a PoC CNI and other work items >> always took precedence so that work never completed. > > If the CNI plugin has access to the container's network namespace, could > it create an abstract AF_UNIX listen socket? > > That way the application inside the container could connect to an > AF_UNIX socket and there is no need to manage container volumes. > > I'm not familiar with the Open VSwitch, so I'm not sure if there is a > sane way of passing the listen socket fd into ovswitchd from the CNI > plugin? > > The steps: > 1. CNI plugin enters container's network namespace and opens an abstract > AF_UNIX listen socket. > 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl > add-port step. Instead of using type=dpdkvhostuserclient > options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a > dpdkvhostuser server with the listen fd. For this step you will need a side channel, i.e. a separate unix socket created by ovs-vswitchd (most likely, created by rte_vhost on rte_vhost_driver_register() call). The problem is that ovs-vsctl talks with ovsdb-server and adds the new port -- just a new row in the 'interface' table of the database. ovs-vswitchd receives update from the database and creates the actual port. All the communications done through JSONRPC, so passing fds is not an option. > 3. When the container starts, it connects to the abstract AF_UNIX > socket. The abstract socket name is provided to the container at > startup time in an environment variable. The name is unique, at least > to the pod, so that multiple containers in the pod can run vhost-user > applications. Few more problems with this solution: - We still want to run application inside the container in a server mode, because virtio-user PMD in client mode doesn't support re-connection. - How to get this fd again after the OVS restart? CNI will not be invoked at this point to pass a new fd. - If application will close the connection for any reason (restart, some reconfiguration internal to the application) and OVS will be re-started at the same time, abstract socket will be gone. Need a persistent daemon to hold it. Best regards, Ilya Maximets.