From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 42A71A0562; Fri, 19 Mar 2021 15:06:16 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 0614E40143; Fri, 19 Mar 2021 15:06:16 +0100 (CET) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by mails.dpdk.org (Postfix) with ESMTP id 138DA4003F for ; Fri, 19 Mar 2021 15:06:14 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1616162774; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Hf9+/dkpeRJPkQUjVdr1TH6YRFnAhJd57vRXbGwbPIA=; b=dOM8Hr6yJ14YicPYOdZnGbF41mpEPO+nRhJR3AQsFlC76kMZ3MGnHp3/NL8mx8JySJkNt3 Eunr+K8zpg7gLOYoRrQ/7r6erEKEkNdugBzh5/diPCyyvYPl3ZBvkxiBABE6MlpIQlnHWH cqT7B49p16mlrbU2ElC5VFKzoX5tdmw= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-594-NTbjAmKxPHquMKx1QW9FAA-1; Fri, 19 Mar 2021 10:06:10 -0400 X-MC-Unique: NTbjAmKxPHquMKx1QW9FAA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id EF275801817; Fri, 19 Mar 2021 14:06:08 +0000 (UTC) Received: from localhost (ovpn-112-193.ams2.redhat.com [10.36.112.193]) by smtp.corp.redhat.com (Postfix) with ESMTP id B708B6090F; Fri, 19 Mar 2021 14:05:55 +0000 (UTC) Date: Fri, 19 Mar 2021 14:05:54 +0000 From: Stefan Hajnoczi To: Ilya Maximets Cc: Maxime Coquelin , Chenbo Xia , dev@dpdk.org, Adrian Moreno , Julia Suvorova , =?iso-8859-1?Q?Marc-Andr=E9?= Lureau , Daniel Berrange Message-ID: References: <20210317202530.4145673-1-i.maximets@ovn.org> MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=stefanha@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="+FhmwlspcKwU/q7U" Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" --+FhmwlspcKwU/q7U Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote: > On 3/18/21 6:52 PM, Stefan Hajnoczi wrote: > > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote: > >> And some housekeeping usually required for applications in case the > >> socket server terminated abnormally and socket files left on a file > >> system: > >> "failed to bind to vhu: Address already in use; remove it and try aga= in" > >=20 > > QEMU avoids this by unlinking before binding. The drawback is that user= s > > might accidentally hijack an existing listen socket, but that can be > > solved with a pidfile. >=20 > How exactly this could be solved with a pidfile? A pidfile prevents two instances of the same service from running at the same time. The same effect can be achieved by the container orchestrator, systemd, etc too because it refuses to run the same service twice. > And what if this is > a different application that tries to create a socket on a same path? > e.g. QEMU creates a socket (started in a server mode) and user > accidentally created dpdkvhostuser port in Open vSwitch instead of > dpdkvhostuserclient. This way rte_vhost library will try to bind > to an existing socket file and will fail. Subsequently port creation > in OVS will fail. We can't allow OVS to unlink files because this > way OVS users will have ability to unlink random sockets that OVS has > access to and we also has no idea if it's a QEMU that created a file > or it was a virtio-user application or someone else. If rte_vhost unlinks the socket then the user will find that networking doesn't work. They can either hot unplug the QEMU vhost-user-net device or restart QEMU, depending on whether they need to keep the guest running or not. This is a misconfiguration that is recoverable. Regarding letting OVS unlink files, I agree that it shouldn't if this create a security issue. I don't know the security model of OVS. > There are, probably, ways to detect if there is any alive process that > has this socket open, but that sounds like too much for this purpose, > also I'm not sure if it's possible if actual user is in a different > container. > So I don't see a good reliable way to detect these conditions. This > falls on shoulders of a higher level management software or a user to > clean these socket files up before adding ports. Does OVS always run in the same net namespace (pod) as the DPDK application? If yes, then abstract AF_UNIX sockets can be used. Abstract AF_UNIX sockets don't have a filesystem path and the socket address disappears when there is no process listening anymore. > >> This patch-set aims to eliminate most of the inconveniences by > >> leveraging an infrastructure service provided by a SocketPair Broker. > >=20 > > I don't understand yet why this is useful for vhost-user, where the > > creation of the vhost-user device backend and its use by a VMM are > > closely managed by one piece of software: > >=20 > > 1. Unlink the socket path. > > 2. Create, bind, and listen on the socket path. > > 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK > > RPC, spawn a process, etc) and pass in the listen fd. > > 4. In the meantime the VMM can open the socket path and call connect(2)= . > > As soon as the vhost-user device backend calls accept(2) the > > connection will proceed (there is no need for sleeping). > >=20 > > This approach works across containers without a broker. >=20 > Not sure if I fully understood a question here, but anyway. > > This approach works fine if you know what application to run. > In case of a k8s cluster, it might be a random DPDK application > with virtio-user ports running inside a container and want to > have a network connection. Also, this application needs to run > virtio-user in server mode, otherwise restart of the OVS will > require restart of the application. So, you basically need to > rely on a third-party application to create a socket with a right > name and in a correct location that is shared with a host, so > OVS can find it and connect. >=20 > In a VM world everything is much more simple, since you have > a libvirt and QEMU that will take care of all of these stuff > and which are also under full control of management software > and a system administrator. > In case of a container with a "random" DPDK application inside > there is no such entity that can help. Of course, some solution > might be implemented in docker/podman daemon to create and manage > outside-looking sockets for an application inside the container, > but that is not available today AFAIK and I'm not sure if it > ever will. Wait, when you say there is no entity like management software or a system administrator, then how does OVS know to instantiate the new port? I guess something still needs to invoke ovs-ctl add-port? Can you describe the steps used today (without the broker) for instantiating a new DPDK app container and connecting it to OVS? Although my interest is in the vhost-user protocol I think it's necessary to understand the OVS requirements here and I know little about them. Stefan --+FhmwlspcKwU/q7U--