From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 42D7AA0A02; Wed, 24 Mar 2021 13:05:35 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id B32BE4067B; Wed, 24 Mar 2021 13:05:34 +0100 (CET) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mails.dpdk.org (Postfix) with ESMTP id 745544014F for ; Wed, 24 Mar 2021 13:05:33 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1616587532; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=2ZOa7t4w04EAV4EMVNmnLIgRTNCl8tEBLWL0nzzGzPY=; b=dnwONcQRarLuA+MAiqrQZBPd5pzsFd6V1ps+NFPdRjpNnlFSLMajbajaeajrLopiDZ7vtr JdO/74canm/7gMiLQcNrrhArZMdWJKNMjr6lVyAOqw2/RpbVINV6dgWu/R0xAxAaD+0S23 3eUOULmcpqW/edcsn04I8AUeFJD2wDw= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-266-_3sh9-WlNhmjx_giOzjikw-1; Wed, 24 Mar 2021 08:05:28 -0400 X-MC-Unique: _3sh9-WlNhmjx_giOzjikw-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 255DF107ACCD; Wed, 24 Mar 2021 12:05:27 +0000 (UTC) Received: from localhost (ovpn-115-111.ams2.redhat.com [10.36.115.111]) by smtp.corp.redhat.com (Postfix) with ESMTP id 83D4718EC9; Wed, 24 Mar 2021 12:05:26 +0000 (UTC) Date: Wed, 24 Mar 2021 12:05:25 +0000 From: Stefan Hajnoczi To: Billy McFall Cc: Ilya Maximets , Adrian Moreno , Maxime Coquelin , Chenbo Xia , dev@dpdk.org, Julia Suvorova , =?iso-8859-1?Q?Marc-Andr=E9?= Lureau , Daniel Berrange Message-ID: References: <20210317202530.4145673-1-i.maximets@ovn.org> <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=stefanha@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="O4yH6+tQV1mKuXZu" Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" --O4yH6+tQV1mKuXZu Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote: > On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets wrote: >=20 > > On 3/23/21 6:57 PM, Adrian Moreno wrote: > > > > > > > > > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote: > > >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote: > > >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote: > > >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: > > >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: > > >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: > > >>>>>>> And some housekeeping usually required for applications in case= the > > >>>>>>> socket server terminated abnormally and socket files left on a = file > > >>>>>>> system: > > >>>>>>> "failed to bind to vhu: Address already in use; remove it and = try > > again" > > >>>>>> > > >>>>>> QEMU avoids this by unlinking before binding. The drawback is th= at > > users > > >>>>>> might accidentally hijack an existing listen socket, but that ca= n be > > >>>>>> solved with a pidfile. > > >>>>> > > >>>>> How exactly this could be solved with a pidfile? > > >>>> > > >>>> A pidfile prevents two instances of the same service from running = at > > the > > >>>> same time. > > >>>> > > >>>> The same effect can be achieved by the container orchestrator, > > systemd, > > >>>> etc too because it refuses to run the same service twice. > > >>> > > >>> Sure. I understand that. My point was that these could be 2 differ= ent > > >>> applications and they might not know which process to look for. > > >>> > > >>>> > > >>>>> And what if this is > > >>>>> a different application that tries to create a socket on a same p= ath? > > >>>>> e.g. QEMU creates a socket (started in a server mode) and user > > >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead o= f > > >>>>> dpdkvhostuserclient. This way rte_vhost library will try to bind > > >>>>> to an existing socket file and will fail. Subsequently port crea= tion > > >>>>> in OVS will fail. We can't allow OVS to unlink files because th= is > > >>>>> way OVS users will have ability to unlink random sockets that OVS= has > > >>>>> access to and we also has no idea if it's a QEMU that created a f= ile > > >>>>> or it was a virtio-user application or someone else. > > >>>> > > >>>> If rte_vhost unlinks the socket then the user will find that > > networking > > >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net > > device > > >>>> or restart QEMU, depending on whether they need to keep the guest > > >>>> running or not. This is a misconfiguration that is recoverable. > > >>> > > >>> True, it's recoverable, but with a high cost. Restart of a VM is > > rarely > > >>> desirable. And the application inside the guest might not feel its= elf > > >>> well after hot re-plug of a device that it actively used. I'd expe= ct > > >>> a DPDK application that runs inside a guest on some virtio-net devi= ce > > >>> to crash after this kind of manipulations. Especially, if it uses = some > > >>> older versions of DPDK. > > >> > > >> This unlink issue is probably something we think differently about. > > >> There are many ways for users to misconfigure things when working wi= th > > >> system tools. If it's possible to catch misconfigurations that is > > >> preferrable. In this case it's just the way pathname AF_UNIX domain > > >> sockets work and IMO it's better not to have problems starting the > > >> service due to stale files than to insist on preventing > > >> misconfigurations. QEMU and DPDK do this differently and both seem t= o be > > >> successful, so =C2=AF\_(=E3=83=84)_/=C2=AF. > > >> > > >>>> > > >>>> Regarding letting OVS unlink files, I agree that it shouldn't if t= his > > >>>> create a security issue. I don't know the security model of OVS. > > >>> > > >>> In general privileges of a ovs-vswitchd daemon might be completely > > >>> different from privileges required to invoke control utilities or > > >>> to access the configuration database. SO, yes, we should not allow > > >>> that. > > >> > > >> That can be locked down by restricting the socket path to a file ben= eath > > >> /var/run/ovs/vhost-user/. > > >> > > >>>> > > >>>>> There are, probably, ways to detect if there is any alive process > > that > > >>>>> has this socket open, but that sounds like too much for this purp= ose, > > >>>>> also I'm not sure if it's possible if actual user is in a differe= nt > > >>>>> container. > > >>>>> So I don't see a good reliable way to detect these conditions. T= his > > >>>>> falls on shoulders of a higher level management software or a use= r to > > >>>>> clean these socket files up before adding ports. > > >>>> > > >>>> Does OVS always run in the same net namespace (pod) as the DPDK > > >>>> application? If yes, then abstract AF_UNIX sockets can be used. > > Abstract > > >>>> AF_UNIX sockets don't have a filesystem path and the socket addres= s > > >>>> disappears when there is no process listening anymore. > > >>> > > >>> OVS is usually started right on the host in a main network namespac= e. > > >>> In case it's started in a pod, it will run in a separate container = but > > >>> configured with a host network. Applications almost exclusively ru= ns > > >>> in separate pods. > > >> > > >> Okay. > > >> > > >>>>>>> This patch-set aims to eliminate most of the inconveniences by > > >>>>>>> leveraging an infrastructure service provided by a SocketPair > > Broker. > > >>>>>> > > >>>>>> I don't understand yet why this is useful for vhost-user, where = the > > >>>>>> creation of the vhost-user device backend and its use by a VMM a= re > > >>>>>> closely managed by one piece of software: > > >>>>>> > > >>>>>> 1. Unlink the socket path. > > >>>>>> 2. Create, bind, and listen on the socket path. > > >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/= SPDK > > >>>>>> RPC, spawn a process, etc) and pass in the listen fd. > > >>>>>> 4. In the meantime the VMM can open the socket path and call > > connect(2). > > >>>>>> As soon as the vhost-user device backend calls accept(2) the > > >>>>>> connection will proceed (there is no need for sleeping). > > >>>>>> > > >>>>>> This approach works across containers without a broker. > > >>>>> > > >>>>> Not sure if I fully understood a question here, but anyway. > > >>>>> > > >>>>> This approach works fine if you know what application to run. > > >>>>> In case of a k8s cluster, it might be a random DPDK application > > >>>>> with virtio-user ports running inside a container and want to > > >>>>> have a network connection. Also, this application needs to run > > >>>>> virtio-user in server mode, otherwise restart of the OVS will > > >>>>> require restart of the application. So, you basically need to > > >>>>> rely on a third-party application to create a socket with a right > > >>>>> name and in a correct location that is shared with a host, so > > >>>>> OVS can find it and connect. > > >>>>> > > >>>>> In a VM world everything is much more simple, since you have > > >>>>> a libvirt and QEMU that will take care of all of these stuff > > >>>>> and which are also under full control of management software > > >>>>> and a system administrator. > > >>>>> In case of a container with a "random" DPDK application inside > > >>>>> there is no such entity that can help. Of course, some solution > > >>>>> might be implemented in docker/podman daemon to create and manage > > >>>>> outside-looking sockets for an application inside the container, > > >>>>> but that is not available today AFAIK and I'm not sure if it > > >>>>> ever will. > > >>>> > > >>>> Wait, when you say there is no entity like management software or = a > > >>>> system administrator, then how does OVS know to instantiate the ne= w > > >>>> port? I guess something still needs to invoke ovs-ctl add-port? > > >>> > > >>> I didn't mean that there is no any application that configures > > >>> everything. Of course, there is. I mean that there is no such > > >>> entity that abstracts all that socket machinery from the user's > > >>> application that runs inside the container. QEMU hides all the > > >>> details of the connection to vhost backend and presents the device > > >>> as a PCI device with a network interface wrapping from the guest > > >>> kernel. So, the application inside VM shouldn't care what actually > > >>> there is a socket connected to OVS that implements backend and > > >>> forward traffic somewhere. For the application it's just a usual > > >>> network interface. > > >>> But in case of a container world, application should handle all > > >>> that by creating a virtio-user device that will connect to some > > >>> socket, that has an OVS on the other side. > > >>> > > >>>> > > >>>> Can you describe the steps used today (without the broker) for > > >>>> instantiating a new DPDK app container and connecting it to OVS? > > >>>> Although my interest is in the vhost-user protocol I think it's > > >>>> necessary to understand the OVS requirements here and I know littl= e > > >>>> about them. > > >>>>> I might describe some things wrong since I worked with k8s and CN= I > > >>> plugins last time ~1.5 years ago, but the basic schema will look > > >>> something like this: > > >>> > > >>> 1. user decides to start a new pod and requests k8s to do that > > >>> via cmdline tools or some API calls. > > >>> > > >>> 2. k8s scheduler looks for available resources asking resource > > >>> manager plugins, finds an appropriate physical host and asks > > >>> local to that node kubelet daemon to launch a new pod there. > > >>> > > > > > > When the CNI is called, the pod has already been created, i.e: a PodI= D > > exists > > > and so does an associated network namespace. Therefore, everything th= at > > has to > > > do with the runtime spec such as mountpoints or devices cannot be > > modified by > > > this time. > > > > > > That's why the Device Plugin API is used to modify the Pod's spec bef= ore > > the CNI > > > chain is called. > > > > > >>> 3. kubelet asks local CNI plugin to allocate network resources > > >>> and annotate the pod with required mount points, devices that > > >>> needs to be passed in and environment variables. > > >>> (this is, IIRC, a gRPC connection. It might be a multus-cni > > >>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is > > >>> usually deployed as a system DaemonSet, so it runs in a > > >>> separate pod. > > >>> > > >>> 4. Assuming that vhost-user connection requested in server mode. > > >>> CNI plugin will: > > >>> 4.1 create a directory for a vhost-user socket. > > >>> 4.2 add this directory to pod annotations as a mount point. > > > > > > I believe this is not possible, it would have to inspect the pod's sp= ec > > or > > > otherwise determine an existing mount point where the socket should b= e > > created. > > > > Uff. Yes, you're right. Thanks for your clarification. > > I mixed up CNI and Device Plugin here. > > > > CNI itself is not able to annotate new resources to the pod, i.e. > > create new mounts or something like this. And I don't recall any > > vhost-user device plugins. Is there any? There is an SR-IOV device > > plugin, but its purpose is to allocate and pass PCI devices, not create > > mounts for vhost-user. > > > > So, IIUC, right now user must create the directory and specify > > a mount point in a pod spec file or pass the whole /var/run/openvswitch > > or something like this, right? > > > > Looking at userspace-cni-network-plugin, it actually just parses > > annotations to find the shared directory and fails if there is > > no any: > > > > https://github.com/intel/userspace-cni-network-plugin/blob/master/users= pace/userspace.go#L122 > > > > And examples suggests to specify a directory to mount: > > > > https://github.com/intel/userspace-cni-network-plugin/blob/master/examp= les/ovs-vhost/userspace-ovs-pod-1.yaml#L41 > > > > Looks like this is done by user's hands. > > > > Yes, I am one of the primary authors of Userspace CNI. Currently, the > directory is by hand. Long term thought was to have a mutating > webhook/admission controller inject a directory into the podspec. Not su= re > if it has changed, but I think when I was originally doing this work, OvS > only lets you choose the directory at install time, so it has to be > something like /var/run/openvswitch/. You can choose the socketfile name > and maybe a subdirectory off the main directory, but not the full path. >=20 > One of the issues I was trying to solve was making sure ContainerA couldn= 't > see ContainerB's socketfiles. That's where the admission controller could > create a unique subdirectory for each container under > /var/run/openvswitch/. But this was more of a PoC CNI and other work item= s > always took precedence so that work never completed. If the CNI plugin has access to the container's network namespace, could it create an abstract AF_UNIX listen socket? That way the application inside the container could connect to an AF_UNIX socket and there is no need to manage container volumes. I'm not familiar with the Open VSwitch, so I'm not sure if there is a sane way of passing the listen socket fd into ovswitchd from the CNI plugin? The steps: 1. CNI plugin enters container's network namespace and opens an abstract AF_UNIX listen socket. 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl add-port step. Instead of using type=3Ddpdkvhostuserclient options:vhost-server-path=3D/tmp/dpdkvhostclient0 it instead create a dpdkvhostuser server with the listen fd. 3. When the container starts, it connects to the abstract AF_UNIX socket. The abstract socket name is provided to the container at startup time in an environment variable. The name is unique, at least to the pod, so that multiple containers in the pod can run vhost-user applications. Stefan --O4yH6+tQV1mKuXZu--