From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3EE17A0A02; Wed, 24 Mar 2021 16:07:45 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2509A140EC5; Wed, 24 Mar 2021 16:07:45 +0100 (CET) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by mails.dpdk.org (Postfix) with ESMTP id 4BA20140EBD for ; Wed, 24 Mar 2021 16:07:44 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1616598463; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=JyCqCiexWuFAyeTALIzsUrsrc7FdM0CjPdQRvzEbOk0=; b=gHbNQaBSYOMhrh6KO0X+cMvtQl0TrL9MS0MHsxTOk3gDOfm70KzAlE+irWp+abolb2vrDe B6A6TTcx7oEEDJ7tDXVSMBX9ChEZGfGfPGcnSw52Q61pW3I4ElrzCYcF90Otp+SmhWFavu efdSz0k4oNbwAFoBafPO/6XOBN0ihsA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-428-jOP8xMExOnCDJdHRfkqaRw-1; Wed, 24 Mar 2021 11:07:40 -0400 X-MC-Unique: jOP8xMExOnCDJdHRfkqaRw-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 801501005D50; Wed, 24 Mar 2021 15:07:39 +0000 (UTC) Received: from localhost (ovpn-115-111.ams2.redhat.com [10.36.115.111]) by smtp.corp.redhat.com (Postfix) with ESMTP id 05E8E614FF; Wed, 24 Mar 2021 15:07:28 +0000 (UTC) Date: Wed, 24 Mar 2021 15:07:27 +0000 From: Stefan Hajnoczi To: Ilya Maximets Cc: Billy McFall , Adrian Moreno , Maxime Coquelin , Chenbo Xia , dev@dpdk.org, Julia Suvorova , =?iso-8859-1?Q?Marc-Andr=E9?= Lureau , Daniel Berrange Message-ID: References: <53dd4b66-9e44-01c3-9f9a-b37dcadb14b7@ovn.org> <597d1ec7-d271-dc0d-522d-b900c9cb00ea@ovn.org> MIME-Version: 1.0 In-Reply-To: <597d1ec7-d271-dc0d-522d-b900c9cb00ea@ovn.org> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=stefanha@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="ICzCtnSOWMpdgeD4" Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" --ICzCtnSOWMpdgeD4 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote: > On 3/24/21 1:05 PM, Stefan Hajnoczi wrote: > > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote: > >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets wro= te: > >> > >>> On 3/23/21 6:57 PM, Adrian Moreno wrote: > >>>> > >>>> > >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote: > >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote: > >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote: > >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: > >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: > >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: > >>>>>>>>>> And some housekeeping usually required for applications in cas= e the > >>>>>>>>>> socket server terminated abnormally and socket files left on a= file > >>>>>>>>>> system: > >>>>>>>>>> "failed to bind to vhu: Address already in use; remove it and= try > >>> again" > >>>>>>>>> > >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is t= hat > >>> users > >>>>>>>>> might accidentally hijack an existing listen socket, but that c= an be > >>>>>>>>> solved with a pidfile. > >>>>>>>> > >>>>>>>> How exactly this could be solved with a pidfile? > >>>>>>> > >>>>>>> A pidfile prevents two instances of the same service from running= at > >>> the > >>>>>>> same time. > >>>>>>> > >>>>>>> The same effect can be achieved by the container orchestrator, > >>> systemd, > >>>>>>> etc too because it refuses to run the same service twice. > >>>>>> > >>>>>> Sure. I understand that. My point was that these could be 2 diffe= rent > >>>>>> applications and they might not know which process to look for. > >>>>>> > >>>>>>> > >>>>>>>> And what if this is > >>>>>>>> a different application that tries to create a socket on a same = path? > >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user > >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead = of > >>>>>>>> dpdkvhostuserclient. This way rte_vhost library will try to bin= d > >>>>>>>> to an existing socket file and will fail. Subsequently port cre= ation > >>>>>>>> in OVS will fail. We can't allow OVS to unlink files because t= his > >>>>>>>> way OVS users will have ability to unlink random sockets that OV= S has > >>>>>>>> access to and we also has no idea if it's a QEMU that created a = file > >>>>>>>> or it was a virtio-user application or someone else. > >>>>>>> > >>>>>>> If rte_vhost unlinks the socket then the user will find that > >>> networking > >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net > >>> device > >>>>>>> or restart QEMU, depending on whether they need to keep the guest > >>>>>>> running or not. This is a misconfiguration that is recoverable. > >>>>>> > >>>>>> True, it's recoverable, but with a high cost. Restart of a VM is > >>> rarely > >>>>>> desirable. And the application inside the guest might not feel it= self > >>>>>> well after hot re-plug of a device that it actively used. I'd exp= ect > >>>>>> a DPDK application that runs inside a guest on some virtio-net dev= ice > >>>>>> to crash after this kind of manipulations. Especially, if it uses= some > >>>>>> older versions of DPDK. > >>>>> > >>>>> This unlink issue is probably something we think differently about. > >>>>> There are many ways for users to misconfigure things when working w= ith > >>>>> system tools. If it's possible to catch misconfigurations that is > >>>>> preferrable. In this case it's just the way pathname AF_UNIX domain > >>>>> sockets work and IMO it's better not to have problems starting the > >>>>> service due to stale files than to insist on preventing > >>>>> misconfigurations. QEMU and DPDK do this differently and both seem = to be > >>>>> successful, so =C2=AF\_(=E3=83=84)_/=C2=AF. > >>>>> > >>>>>>> > >>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if = this > >>>>>>> create a security issue. I don't know the security model of OVS. > >>>>>> > >>>>>> In general privileges of a ovs-vswitchd daemon might be completely > >>>>>> different from privileges required to invoke control utilities or > >>>>>> to access the configuration database. SO, yes, we should not allo= w > >>>>>> that. > >>>>> > >>>>> That can be locked down by restricting the socket path to a file be= neath > >>>>> /var/run/ovs/vhost-user/. > >>>>> > >>>>>>> > >>>>>>>> There are, probably, ways to detect if there is any alive proces= s > >>> that > >>>>>>>> has this socket open, but that sounds like too much for this pur= pose, > >>>>>>>> also I'm not sure if it's possible if actual user is in a differ= ent > >>>>>>>> container. > >>>>>>>> So I don't see a good reliable way to detect these conditions. = This > >>>>>>>> falls on shoulders of a higher level management software or a us= er to > >>>>>>>> clean these socket files up before adding ports. > >>>>>>> > >>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK > >>>>>>> application? If yes, then abstract AF_UNIX sockets can be used. > >>> Abstract > >>>>>>> AF_UNIX sockets don't have a filesystem path and the socket addre= ss > >>>>>>> disappears when there is no process listening anymore. > >>>>>> > >>>>>> OVS is usually started right on the host in a main network namespa= ce. > >>>>>> In case it's started in a pod, it will run in a separate container= but > >>>>>> configured with a host network. Applications almost exclusively r= uns > >>>>>> in separate pods. > >>>>> > >>>>> Okay. > >>>>> > >>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by > >>>>>>>>>> leveraging an infrastructure service provided by a SocketPair > >>> Broker. > >>>>>>>>> > >>>>>>>>> I don't understand yet why this is useful for vhost-user, where= the > >>>>>>>>> creation of the vhost-user device backend and its use by a VMM = are > >>>>>>>>> closely managed by one piece of software: > >>>>>>>>> > >>>>>>>>> 1. Unlink the socket path. > >>>>>>>>> 2. Create, bind, and listen on the socket path. > >>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK= /SPDK > >>>>>>>>> RPC, spawn a process, etc) and pass in the listen fd. > >>>>>>>>> 4. In the meantime the VMM can open the socket path and call > >>> connect(2). > >>>>>>>>> As soon as the vhost-user device backend calls accept(2) the > >>>>>>>>> connection will proceed (there is no need for sleeping). > >>>>>>>>> > >>>>>>>>> This approach works across containers without a broker. > >>>>>>>> > >>>>>>>> Not sure if I fully understood a question here, but anyway. > >>>>>>>> > >>>>>>>> This approach works fine if you know what application to run. > >>>>>>>> In case of a k8s cluster, it might be a random DPDK application > >>>>>>>> with virtio-user ports running inside a container and want to > >>>>>>>> have a network connection. Also, this application needs to run > >>>>>>>> virtio-user in server mode, otherwise restart of the OVS will > >>>>>>>> require restart of the application. So, you basically need to > >>>>>>>> rely on a third-party application to create a socket with a righ= t > >>>>>>>> name and in a correct location that is shared with a host, so > >>>>>>>> OVS can find it and connect. > >>>>>>>> > >>>>>>>> In a VM world everything is much more simple, since you have > >>>>>>>> a libvirt and QEMU that will take care of all of these stuff > >>>>>>>> and which are also under full control of management software > >>>>>>>> and a system administrator. > >>>>>>>> In case of a container with a "random" DPDK application inside > >>>>>>>> there is no such entity that can help. Of course, some solution > >>>>>>>> might be implemented in docker/podman daemon to create and manag= e > >>>>>>>> outside-looking sockets for an application inside the container, > >>>>>>>> but that is not available today AFAIK and I'm not sure if it > >>>>>>>> ever will. > >>>>>>> > >>>>>>> Wait, when you say there is no entity like management software or= a > >>>>>>> system administrator, then how does OVS know to instantiate the n= ew > >>>>>>> port? I guess something still needs to invoke ovs-ctl add-port? > >>>>>> > >>>>>> I didn't mean that there is no any application that configures > >>>>>> everything. Of course, there is. I mean that there is no such > >>>>>> entity that abstracts all that socket machinery from the user's > >>>>>> application that runs inside the container. QEMU hides all the > >>>>>> details of the connection to vhost backend and presents the device > >>>>>> as a PCI device with a network interface wrapping from the guest > >>>>>> kernel. So, the application inside VM shouldn't care what actuall= y > >>>>>> there is a socket connected to OVS that implements backend and > >>>>>> forward traffic somewhere. For the application it's just a usual > >>>>>> network interface. > >>>>>> But in case of a container world, application should handle all > >>>>>> that by creating a virtio-user device that will connect to some > >>>>>> socket, that has an OVS on the other side. > >>>>>> > >>>>>>> > >>>>>>> Can you describe the steps used today (without the broker) for > >>>>>>> instantiating a new DPDK app container and connecting it to OVS? > >>>>>>> Although my interest is in the vhost-user protocol I think it's > >>>>>>> necessary to understand the OVS requirements here and I know litt= le > >>>>>>> about them. > >>>>>>>> I might describe some things wrong since I worked with k8s and C= NI > >>>>>> plugins last time ~1.5 years ago, but the basic schema will look > >>>>>> something like this: > >>>>>> > >>>>>> 1. user decides to start a new pod and requests k8s to do that > >>>>>> via cmdline tools or some API calls. > >>>>>> > >>>>>> 2. k8s scheduler looks for available resources asking resource > >>>>>> manager plugins, finds an appropriate physical host and asks > >>>>>> local to that node kubelet daemon to launch a new pod there. > >>>>>> > >>>> > >>>> When the CNI is called, the pod has already been created, i.e: a Pod= ID > >>> exists > >>>> and so does an associated network namespace. Therefore, everything t= hat > >>> has to > >>>> do with the runtime spec such as mountpoints or devices cannot be > >>> modified by > >>>> this time. > >>>> > >>>> That's why the Device Plugin API is used to modify the Pod's spec be= fore > >>> the CNI > >>>> chain is called. > >>>> > >>>>>> 3. kubelet asks local CNI plugin to allocate network resources > >>>>>> and annotate the pod with required mount points, devices that > >>>>>> needs to be passed in and environment variables. > >>>>>> (this is, IIRC, a gRPC connection. It might be a multus-cni > >>>>>> or kuryr-kubernetes or any other CNI plugin. CNI plugin is > >>>>>> usually deployed as a system DaemonSet, so it runs in a > >>>>>> separate pod. > >>>>>> > >>>>>> 4. Assuming that vhost-user connection requested in server mode. > >>>>>> CNI plugin will: > >>>>>> 4.1 create a directory for a vhost-user socket. > >>>>>> 4.2 add this directory to pod annotations as a mount point. > >>>> > >>>> I believe this is not possible, it would have to inspect the pod's s= pec > >>> or > >>>> otherwise determine an existing mount point where the socket should = be > >>> created. > >>> > >>> Uff. Yes, you're right. Thanks for your clarification. > >>> I mixed up CNI and Device Plugin here. > >>> > >>> CNI itself is not able to annotate new resources to the pod, i.e. > >>> create new mounts or something like this. And I don't recall any > >>> vhost-user device plugins. Is there any? There is an SR-IOV device > >>> plugin, but its purpose is to allocate and pass PCI devices, not crea= te > >>> mounts for vhost-user. > >>> > >>> So, IIUC, right now user must create the directory and specify > >>> a mount point in a pod spec file or pass the whole /var/run/openvswit= ch > >>> or something like this, right? > >>> > >>> Looking at userspace-cni-network-plugin, it actually just parses > >>> annotations to find the shared directory and fails if there is > >>> no any: > >>> > >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/use= rspace/userspace.go#L122 > >>> > >>> And examples suggests to specify a directory to mount: > >>> > >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/exa= mples/ovs-vhost/userspace-ovs-pod-1.yaml#L41 > >>> > >>> Looks like this is done by user's hands. > >>> > >>> Yes, I am one of the primary authors of Userspace CNI. Currently, the > >> directory is by hand. Long term thought was to have a mutating > >> webhook/admission controller inject a directory into the podspec. Not= sure > >> if it has changed, but I think when I was originally doing this work, = OvS > >> only lets you choose the directory at install time, so it has to be > >> something like /var/run/openvswitch/. You can choose the socketfile na= me > >> and maybe a subdirectory off the main directory, but not the full path= . > >> > >> One of the issues I was trying to solve was making sure ContainerA cou= ldn't > >> see ContainerB's socketfiles. That's where the admission controller co= uld > >> create a unique subdirectory for each container under > >> /var/run/openvswitch/. But this was more of a PoC CNI and other work i= tems > >> always took precedence so that work never completed. > >=20 > > If the CNI plugin has access to the container's network namespace, coul= d > > it create an abstract AF_UNIX listen socket? > >=20 > > That way the application inside the container could connect to an > > AF_UNIX socket and there is no need to manage container volumes. > >=20 > > I'm not familiar with the Open VSwitch, so I'm not sure if there is a > > sane way of passing the listen socket fd into ovswitchd from the CNI > > plugin? > >=20 > > The steps: > > 1. CNI plugin enters container's network namespace and opens an abstrac= t > > AF_UNIX listen socket. > > 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl > > add-port step. Instead of using type=3Ddpdkvhostuserclient > > options:vhost-server-path=3D/tmp/dpdkvhostclient0 it instead create = a > > dpdkvhostuser server with the listen fd. >=20 > For this step you will need a side channel, i.e. a separate unix socket > created by ovs-vswitchd (most likely, created by rte_vhost on > rte_vhost_driver_register() call). >=20 > The problem is that ovs-vsctl talks with ovsdb-server and adds the new > port -- just a new row in the 'interface' table of the database. > ovs-vswitchd receives update from the database and creates the actual > port. All the communications done through JSONRPC, so passing fds is > not an option. >=20 > > 3. When the container starts, it connects to the abstract AF_UNIX > > socket. The abstract socket name is provided to the container at > > startup time in an environment variable. The name is unique, at leas= t > > to the pod, so that multiple containers in the pod can run vhost-use= r > > applications. >=20 > Few more problems with this solution: >=20 > - We still want to run application inside the container in a server mode, > because virtio-user PMD in client mode doesn't support re-connection. >=20 > - How to get this fd again after the OVS restart? CNI will not be invoke= d > at this point to pass a new fd. >=20 > - If application will close the connection for any reason (restart, some > reconfiguration internal to the application) and OVS will be re-started > at the same time, abstract socket will be gone. Need a persistent daem= on > to hold it. Okay, if there is no component that has a lifetime suitable for holding the abstract listen socket, then using pathname AF_UNIX sockets seems like a better approach. Stefan --ICzCtnSOWMpdgeD4--