On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote: > On 3/24/21 1:05 PM, Stefan Hajnoczi wrote: > > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote: > >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets wrote: > >> > >>> On 3/23/21 6:57 PM, Adrian Moreno wrote: > >>>> > >>>> > >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote: > >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote: > >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote: > >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: > >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: > >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: > >>>>>>>>>> And some housekeeping usually required for applications in case the > >>>>>>>>>> socket server terminated abnormally and socket files left on a file > >>>>>>>>>> system: > >>>>>>>>>> "failed to bind to vhu: Address already in use; remove it and try > >>> again" > >>>>>>>>> > >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that > >>> users > >>>>>>>>> might accidentally hijack an existing listen socket, but that can be > >>>>>>>>> solved with a pidfile. > >>>>>>>> > >>>>>>>> How exactly this could be solved with a pidfile? > >>>>>>> > >>>>>>> A pidfile prevents two instances of the same service from running at > >>> the > >>>>>>> same time. > >>>>>>> > >>>>>>> The same effect can be achieved by the container orchestrator, > >>> systemd, > >>>>>>> etc too because it refuses to run the same service twice. > >>>>>> > >>>>>> Sure. I understand that. My point was that these could be 2 different > >>>>>> applications and they might not know which process to look for. > >>>>>> > >>>>>>> > >>>>>>>> And what if this is > >>>>>>>> a different application that tries to create a socket on a same path? > >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user > >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of > >>>>>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind > >>>>>>>> to an existing socket file and will fail. Subsequently port creation > >>>>>>>> in OVS will fail. We can't allow OVS to unlink files because this > >>>>>>>> way OVS users will have ability to unlink random sockets that OVS has > >>>>>>>> access to and we also has no idea if it's a QEMU that created a file > >>>>>>>> or it was a virtio-user application or someone else. > >>>>>>> > >>>>>>> If rte_vhost unlinks the socket then the user will find that > >>> networking > >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net > >>> device > >>>>>>> or restart QEMU, depending on whether they need to keep the guest > >>>>>>> running or not. This is a misconfiguration that is recoverable. > >>>>>> > >>>>>> True, it's recoverable, but with a high cost. Restart of a VM is > >>> rarely > >>>>>> desirable. And the application inside the guest might not feel itself > >>>>>> well after hot re-plug of a device that it actively used. I'd expect > >>>>>> a DPDK application that runs inside a guest on some virtio-net device > >>>>>> to crash after this kind of manipulations. Especially, if it uses some > >>>>>> older versions of DPDK. > >>>>> > >>>>> This unlink issue is probably something we think differently about. > >>>>> There are many ways for users to misconfigure things when working with > >>>>> system tools. If it's possible to catch misconfigurations that is > >>>>> preferrable. In this case it's just the way pathname AF_UNIX domain > >>>>> sockets work and IMO it's better not to have problems starting the > >>>>> service due to stale files than to insist on preventing > >>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be > >>>>> successful, so ¯\_(ツ)_/¯. > >>>>> > >>>>>>> > >>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this > >>>>>>> create a security issue. I don't know the security model of OVS. > >>>>>> > >>>>>> In general privileges of a ovs-vswitchd daemon might be completely > >>>>>> different from privileges required to invoke control utilities or > >>>>>> to access the configuration database. SO, yes, we should not allow > >>>>>> that. > >>>>> > >>>>> That can be locked down by restricting the socket path to a file beneath > >>>>> /var/run/ovs/vhost-user/. > >>>>> > >>>>>>> > >>>>>>>> There are, probably, ways to detect if there is any alive process > >>> that > >>>>>>>> has this socket open, but that sounds like too much for this purpose, > >>>>>>>> also I'm not sure if it's possible if actual user is in a different > >>>>>>>> container. > >>>>>>>> So I don't see a good reliable way to detect these conditions. This > >>>>>>>> falls on shoulders of a higher level management software or a user to > >>>>>>>> clean these socket files up before adding ports. > >>>>>>> > >>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK > >>>>>>> application? If yes, then abstract AF_UNIX sockets can be used. > >>> Abstract > >>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address > >>>>>>> disappears when there is no process listening anymore. > >>>>>> > >>>>>> OVS is usually started right on the host in a main network namespace. > >>>>>> In case it's started in a pod, it will run in a separate container but > >>>>>> configured with a host network. Applications almost exclusively runs > >>>>>> in separate pods. > >>>>> > >>>>> Okay. > >>>>> > >>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by > >>>>>>>>>> leveraging an infrastructure service provided by a SocketPair > >>> Broker. > >>>>>>>>> > >>>>>>>>> I don't understand yet why this is useful for vhost-user, where the > >>>>>>>>> creation of the vhost-user device backend and its use by a VMM are > >>>>>>>>> closely managed by one piece of software: > >>>>>>>>> > >>>>>>>>> 1. Unlink the socket path. > >>>>>>>>> 2. Create, bind, and listen on the socket path. > >>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK > >>>>>>>>> RPC, spawn a process, etc) and pass in the listen fd. > >>>>>>>>> 4. In the meantime the VMM can open the socket path and call > >>> connect(2). > >>>>>>>>> As soon as the vhost-user device backend calls accept(2) the > >>>>>>>>> connection will proceed (there is no need for sleeping). > >>>>>>>>> > >>>>>>>>> This approach works across containers without a broker. > >>>>>>>> > >>>>>>>> Not sure if I fully understood a question here, but anyway. > >>>>>>>> > >>>>>>>> This approach works fine if you know what application to run. > >>>>>>>> In case of a k8s cluster, it might be a random DPDK application > >>>>>>>> with virtio-user ports running inside a container and want to > >>>>>>>> have a network connection. Also, this application needs to run > >>>>>>>> virtio-user in server mode, otherwise restart of the OVS will > >>>>>>>> require restart of the application. So, you basically need to > >>>>>>>> rely on a third-party application to create a socket with a right > >>>>>>>> name and in a correct location that is shared with a host, so > >>>>>>>> OVS can find it and connect. > >>>>>>>> > >>>>>>>> In a VM world everything is much more simple, since you have > >>>>>>>> a libvirt and QEMU that will take care of all of these stuff > >>>>>>>> and which are also under full control of management software > >>>>>>>> and a system administrator. > >>>>>>>> In case of a container with a "random" DPDK application inside > >>>>>>>> there is no such entity that can help. Of course, some solution > >>>>>>>> might be implemented in docker/podman daemon to create and manage > >>>>>>>> outside-looking sockets for an application inside the container, > >>>>>>>> but that is not available today AFAIK and I'm not sure if it > >>>>>>>> ever will. > >>>>>>> > >>>>>>> Wait, when you say there is no entity like management software or a > >>>>>>> system administrator, then how does OVS know to instantiate the new > >>>>>>> port? I guess something still needs to invoke ovs-ctl add-port? > >>>>>> > >>>>>> I didn't mean that there is no any application that configures > >>>>>> everything. Of course, there is. I mean that there is no such > >>>>>> entity that abstracts all that socket machinery from the user's > >>>>>> application that runs inside the container. QEMU hides all the > >>>>>> details of the connection to vhost backend and presents the device > >>>>>> as a PCI device with a network interface wrapping from the guest > >>>>>> kernel. So, the application inside VM shouldn't care what actually > >>>>>> there is a socket connected to OVS that implements backend and > >>>>>> forward traffic somewhere. For the application it's just a usual > >>>>>> network interface. > >>>>>> But in case of a container world, application should handle all > >>>>>> that by creating a virtio-user device that will connect to some > >>>>>> socket, that has an OVS on the other side. > >>>>>> > >>>>>>> > >>>>>>> Can you describe the steps used today (without the broker) for > >>>>>>> instantiating a new DPDK app container and connecting it to OVS? > >>>>>>> Although my interest is in the vhost-user protocol I think it's > >>>>>>> necessary to understand the OVS requirements here and I know little > >>>>>>> about them. > >>>>>>>> I might describe some things wrong since I worked with k8s and CNI > >>>>>> plugins last time ~1.5 years ago, but the basic schema will look > >>>>>> something like this: > >>>>>> > >>>>>> 1. user decides to start a new pod and requests k8s to do that > >>>>>> via cmdline tools or some API calls. > >>>>>> > >>>>>> 2. k8s scheduler looks for available resources asking resource > >>>>>> manager plugins, finds an appropriate physical host and asks > >>>>>> local to that node kubelet daemon to launch a new pod there. > >>>>>> > >>>> > >>>> When the CNI is called, the pod has already been created, i.e: a PodID > >>> exists > >>>> and so does an associated network namespace. Therefore, everything that > >>> has to > >>>> do with the runtime spec such as mountpoints or devices cannot be > >>> modified by > >>>> this time. > >>>> > >>>> That's why the Device Plugin API is used to modify the Pod's spec before > >>> the CNI > >>>> chain is called. > >>>> > >>>>>> 3. kubelet asks local CNI plugin to allocate network resources > >>>>>> and annotate the pod with required mount points, devices that > >>>>>> needs to be passed in and environment variables. > >>>>>> (this is, IIRC, a gRPC connection. It might be a multus-cni > >>>>>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is > >>>>>> usually deployed as a system DaemonSet, so it runs in a > >>>>>> separate pod. > >>>>>> > >>>>>> 4. Assuming that vhost-user connection requested in server mode. > >>>>>> CNI plugin will: > >>>>>> 4.1 create a directory for a vhost-user socket. > >>>>>> 4.2 add this directory to pod annotations as a mount point. > >>>> > >>>> I believe this is not possible, it would have to inspect the pod's spec > >>> or > >>>> otherwise determine an existing mount point where the socket should be > >>> created. > >>> > >>> Uff. Yes, you're right. Thanks for your clarification. > >>> I mixed up CNI and Device Plugin here. > >>> > >>> CNI itself is not able to annotate new resources to the pod, i.e. > >>> create new mounts or something like this. And I don't recall any > >>> vhost-user device plugins. Is there any? There is an SR-IOV device > >>> plugin, but its purpose is to allocate and pass PCI devices, not create > >>> mounts for vhost-user. > >>> > >>> So, IIUC, right now user must create the directory and specify > >>> a mount point in a pod spec file or pass the whole /var/run/openvswitch > >>> or something like this, right? > >>> > >>> Looking at userspace-cni-network-plugin, it actually just parses > >>> annotations to find the shared directory and fails if there is > >>> no any: > >>> > >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122 > >>> > >>> And examples suggests to specify a directory to mount: > >>> > >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41 > >>> > >>> Looks like this is done by user's hands. > >>> > >>> Yes, I am one of the primary authors of Userspace CNI. Currently, the > >> directory is by hand. Long term thought was to have a mutating > >> webhook/admission controller inject a directory into the podspec. Not sure > >> if it has changed, but I think when I was originally doing this work, OvS > >> only lets you choose the directory at install time, so it has to be > >> something like /var/run/openvswitch/. You can choose the socketfile name > >> and maybe a subdirectory off the main directory, but not the full path. > >> > >> One of the issues I was trying to solve was making sure ContainerA couldn't > >> see ContainerB's socketfiles. That's where the admission controller could > >> create a unique subdirectory for each container under > >> /var/run/openvswitch/. But this was more of a PoC CNI and other work items > >> always took precedence so that work never completed. > > > > If the CNI plugin has access to the container's network namespace, could > > it create an abstract AF_UNIX listen socket? > > > > That way the application inside the container could connect to an > > AF_UNIX socket and there is no need to manage container volumes. > > > > I'm not familiar with the Open VSwitch, so I'm not sure if there is a > > sane way of passing the listen socket fd into ovswitchd from the CNI > > plugin? > > > > The steps: > > 1. CNI plugin enters container's network namespace and opens an abstract > > AF_UNIX listen socket. > > 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl > > add-port step. Instead of using type=dpdkvhostuserclient > > options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a > > dpdkvhostuser server with the listen fd. > > For this step you will need a side channel, i.e. a separate unix socket > created by ovs-vswitchd (most likely, created by rte_vhost on > rte_vhost_driver_register() call). > > The problem is that ovs-vsctl talks with ovsdb-server and adds the new > port -- just a new row in the 'interface' table of the database. > ovs-vswitchd receives update from the database and creates the actual > port. All the communications done through JSONRPC, so passing fds is > not an option. > > > 3. When the container starts, it connects to the abstract AF_UNIX > > socket. The abstract socket name is provided to the container at > > startup time in an environment variable. The name is unique, at least > > to the pod, so that multiple containers in the pod can run vhost-user > > applications. > > Few more problems with this solution: > > - We still want to run application inside the container in a server mode, > because virtio-user PMD in client mode doesn't support re-connection. > > - How to get this fd again after the OVS restart? CNI will not be invoked > at this point to pass a new fd. > > - If application will close the connection for any reason (restart, some > reconfiguration internal to the application) and OVS will be re-started > at the same time, abstract socket will be gone. Need a persistent daemon > to hold it. Okay, if there is no component that has a lifetime suitable for holding the abstract listen socket, then using pathname AF_UNIX sockets seems like a better approach. Stefan