From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id B6652A0A0E;
	Tue, 23 Mar 2021 18:57:50 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 5D42C140F0A;
	Tue, 23 Mar 2021 18:57:50 +0100 (CET)
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by mails.dpdk.org (Postfix) with ESMTP id D79F24069E
 for <dev@dpdk.org>; Tue, 23 Mar 2021 18:57:48 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1616522268;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 content-transfer-encoding:content-transfer-encoding:
 in-reply-to:in-reply-to:references:references;
 bh=ogdtQnnb6tokGsjO8DPK0wwp0N2QUctAsHA4pJay1a4=;
 b=dhuwaRPk8xzXMEcUzaNIPO5G2D0d8fGHlNtZlWOSaDbfuFMaMMdvEKL39iliUbJv28Ghhn
 Fta7lzUSU7pYUnnl8dmrl20Oq45qmo26iGJ+o7Iw6CZkdeyyN5WJphEc7Hgj9r4LIEyk/9
 FyADLu543pE+lqbD2AF8LQZuxT50Ens=
Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com
 [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-324-42QQN2FpPfy4lioeG01Z_g-1; Tue, 23 Mar 2021 13:57:44 -0400
X-MC-Unique: 42QQN2FpPfy4lioeG01Z_g-1
Received: by mail-wr1-f72.google.com with SMTP id y5so1452242wrp.2
 for <dev@dpdk.org>; Tue, 23 Mar 2021 10:57:44 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=ogdtQnnb6tokGsjO8DPK0wwp0N2QUctAsHA4pJay1a4=;
 b=d8xrn35vFPPbaxMnNFoXFw0DEiFDzBurURWWLaPCTjbdXRp+r/w+gsElj1CINHY2MG
 CvSNLYdx9i23R39i74/sSl0UUwKcrbVhJd5n/j3tXrL5r8iF1tQCqyuprDE6DzHrYrlp
 zDjLjd/8kubZcMajqGPQWQvEs+jt4BjWRT6nVL0wFc2FGWggXepc4QpezvzQkPcqxAoU
 Hba1k1uTWYFs9dAsRDTwO2vrCPJmOOUB683/ghB5kRi5DSgZgdSn4Gl6gm0iqZBlUo1K
 Nqcrs6DSO3FtNoZRNyfLREp7z2dT98ut1SIUFmFXMMI/Eup6aFSZqFd+Qaz1ngAFpXNJ
 0/yQ==
X-Gm-Message-State: AOAM531N7FyOXyVE5UCx4xi5Wz9xdurSpfXJnT7ZDtrssWs3IgaL2HoZ
 ZQUOcCq32xFSFGa5O0hzkioPeBk4nMbiFFhq/sRiZWhc89ArNbHudbh/AlgcygQGv3SqRIHA9Ku
 LGuA=
X-Received: by 2002:adf:c641:: with SMTP id u1mr5218498wrg.332.1616522263409; 
 Tue, 23 Mar 2021 10:57:43 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyJzSMEhtFyIqEn9R9O2XgaoxKKjhuqlCliPDFgA+e69Zv4V+56gMUV59Ymow/l93/NH9h49A==
X-Received: by 2002:adf:c641:: with SMTP id u1mr5218463wrg.332.1616522263024; 
 Tue, 23 Mar 2021 10:57:43 -0700 (PDT)
Received: from amorenoz.users.ipa.redhat.com ([94.73.62.62])
 by smtp.gmail.com with ESMTPSA id b65sm3257674wmh.4.2021.03.23.10.57.42
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Tue, 23 Mar 2021 10:57:42 -0700 (PDT)
To: Stefan Hajnoczi <stefanha@redhat.com>, Ilya Maximets <i.maximets@ovn.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>,
 Chenbo Xia <chenbo.xia@intel.com>, dev@dpdk.org,
 Julia Suvorova <jusual@redhat.com>,
 =?UTF-8?Q?Marc-Andr=c3=a9_Lureau?= <marcandre.lureau@redhat.com>,
 Daniel Berrange <berrange@redhat.com>, Billy McFall <bmcfall@redhat.com>
References: <20210317202530.4145673-1-i.maximets@ovn.org>
 <YFOTU0M50y5GlF25@stefanha-x1.localdomain>
 <eeea4d9f-e600-9b4d-58f3-f8ced9485854@ovn.org>
 <YFSvwk3Z4A/L8fRV@stefanha-x1.localdomain>
 <edd7eb24-a9d6-b50b-04c3-a94d392a011f@ovn.org>
 <YFTdn1wsaHXkr+Bm@stefanha-x1.localdomain>
From: Adrian Moreno <amorenoz@redhat.com>
Message-ID: <def3f022-7f07-fe5c-461a-37f8443a3a97@redhat.com>
Date: Tue, 23 Mar 2021 18:57:41 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.0
MIME-Version: 1.0
In-Reply-To: <YFTdn1wsaHXkr+Bm@stefanha-x1.localdomain>
Authentication-Results: relay.mimecast.com;
 auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=amorenoz@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and
 virtio-user.
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>> And some housekeeping usually required for applications in case the
>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>> system:
>>>>>>  "failed to bind to vhu: Address already in use; remove it and try again"
>>>>>
>>>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>> solved with a pidfile.
>>>>
>>>> How exactly this could be solved with a pidfile?
>>>
>>> A pidfile prevents two instances of the same service from running at the
>>> same time.
>>>
>>> The same effect can be achieved by the container orchestrator, systemd,
>>> etc too because it refuses to run the same service twice.
>>
>> Sure. I understand that.  My point was that these could be 2 different
>> applications and they might not know which process to look for.
>>
>>>
>>>> And what if this is
>>>> a different application that tries to create a socket on a same path?
>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>>>> to an existing socket file and will fail.  Subsequently port creation
>>>> in OVS will fail.   We can't allow OVS to unlink files because this
>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>> access to and we also has no idea if it's a QEMU that created a file
>>>> or it was a virtio-user application or someone else.
>>>
>>> If rte_vhost unlinks the socket then the user will find that networking
>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device
>>> or restart QEMU, depending on whether they need to keep the guest
>>> running or not. This is a misconfiguration that is recoverable.
>>
>> True, it's recoverable, but with a high cost.  Restart of a VM is rarely
>> desirable.  And the application inside the guest might not feel itself
>> well after hot re-plug of a device that it actively used.  I'd expect
>> a DPDK application that runs inside a guest on some virtio-net device
>> to crash after this kind of manipulations.  Especially, if it uses some
>> older versions of DPDK.
> 
> This unlink issue is probably something we think differently about.
> There are many ways for users to misconfigure things when working with
> system tools. If it's possible to catch misconfigurations that is
> preferrable. In this case it's just the way pathname AF_UNIX domain
> sockets work and IMO it's better not to have problems starting the
> service due to stale files than to insist on preventing
> misconfigurations. QEMU and DPDK do this differently and both seem to be
> successful, so ¯\_(ツ)_/¯.
> 
>>>
>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>> create a security issue. I don't know the security model of OVS.
>>
>> In general privileges of a ovs-vswitchd daemon might be completely
>> different from privileges required to invoke control utilities or
>> to access the configuration database.  SO, yes, we should not allow
>> that.
> 
> That can be locked down by restricting the socket path to a file beneath
> /var/run/ovs/vhost-user/.
> 
>>>
>>>> There are, probably, ways to detect if there is any alive process that
>>>> has this socket open, but that sounds like too much for this purpose,
>>>> also I'm not sure if it's possible if actual user is in a different
>>>> container.
>>>> So I don't see a good reliable way to detect these conditions.  This
>>>> falls on shoulders of a higher level management software or a user to
>>>> clean these socket files up before adding ports.
>>>
>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>> disappears when there is no process listening anymore.
>>
>> OVS is usually started right on the host in a main network namespace.
>> In case it's started in a pod, it will run in a separate container but
>> configured with a host network.  Applications almost exclusively runs
>> in separate pods.
> 
> Okay.
> 
>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>>>
>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>> closely managed by one piece of software:
>>>>>
>>>>> 1. Unlink the socket path.
>>>>> 2. Create, bind, and listen on the socket path.
>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>>>    As soon as the vhost-user device backend calls accept(2) the
>>>>>    connection will proceed (there is no need for sleeping).
>>>>>
>>>>> This approach works across containers without a broker.
>>>>
>>>> Not sure if I fully understood a question here, but anyway.
>>>>
>>>> This approach works fine if you know what application to run.
>>>> In case of a k8s cluster, it might be a random DPDK application
>>>> with virtio-user ports running inside a container and want to
>>>> have a network connection.  Also, this application needs to run
>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>> require restart of the application.  So, you basically need to
>>>> rely on a third-party application to create a socket with a right
>>>> name and in a correct location that is shared with a host, so
>>>> OVS can find it and connect.
>>>>
>>>> In a VM world everything is much more simple, since you have
>>>> a libvirt and QEMU that will take care of all of these stuff
>>>> and which are also under full control of management software
>>>> and a system administrator.
>>>> In case of a container with a "random" DPDK application inside
>>>> there is no such entity that can help.  Of course, some solution
>>>> might be implemented in docker/podman daemon to create and manage
>>>> outside-looking sockets for an application inside the container,
>>>> but that is not available today AFAIK and I'm not sure if it
>>>> ever will.
>>>
>>> Wait, when you say there is no entity like management software or a
>>> system administrator, then how does OVS know to instantiate the new
>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>
>> I didn't mean that there is no any application that configures
>> everything.  Of course, there is.  I mean that there is no such
>> entity that abstracts all that socket machinery from the user's
>> application that runs inside the container.  QEMU hides all the
>> details of the connection to vhost backend and presents the device
>> as a PCI device with a network interface wrapping from the guest
>> kernel.  So, the application inside VM shouldn't care what actually
>> there is a socket connected to OVS that implements backend and
>> forward traffic somewhere.  For the application it's just a usual
>> network interface.
>> But in case of a container world, application should handle all
>> that by creating a virtio-user device that will connect to some
>> socket, that has an OVS on the other side.
>>
>>>
>>> Can you describe the steps used today (without the broker) for
>>> instantiating a new DPDK app container and connecting it to OVS?
>>> Although my interest is in the vhost-user protocol I think it's
>>> necessary to understand the OVS requirements here and I know little
>>> about them.
>>>> I might describe some things wrong since I worked with k8s and CNI
>> plugins last time ~1.5 years ago, but the basic schema will look
>> something like this:
>>
>> 1. user decides to start a new pod and requests k8s to do that
>>    via cmdline tools or some API calls.
>>
>> 2. k8s scheduler looks for available resources asking resource
>>    manager plugins, finds an appropriate physical host and asks
>>    local to that node kubelet daemon to launch a new pod there.
>>

When the CNI is called, the pod has already been created, i.e: a PodID exists
and so does an associated network namespace. Therefore, everything that has to
do with the runtime spec such as mountpoints or devices cannot be modified by
this time.

That's why the Device Plugin API is used to modify the Pod's spec before the CNI
chain is called.

>> 3. kubelet asks local CNI plugin to allocate network resources
>>    and annotate the pod with required mount points, devices that
>>    needs to be passed in and environment variables.
>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>>    usually deployed as a system DaemonSet, so it runs in a
>>    separate pod.
>>
>> 4. Assuming that vhost-user connection requested in server mode.
>>    CNI plugin will:
>>    4.1 create a directory for a vhost-user socket.
>>    4.2 add this directory to pod annotations as a mount point.

I believe this is not possible, it would have to inspect the pod's spec or
otherwise determine an existing mount point where the socket should be created.

+Billy might give more insights on this

>>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>>        by connecting to ovsdb-server by JSONRPC directly.
>>        It will set port type as dpdkvhostuserclient and specify
>>        socket-path as a path inside the directory it created.
>>        (OVS will create a port and rte_vhost will enter the
>>         re-connection loop since socket does not exist yet.)
>>    4.4 Set up socket file location as environment variable in
>>        pod annotations.
>>    4.5 report success to kubelet.
>>

Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin
or other external entity that can inject the mount point before the pod is created.

However, there is another usecase that might be relevant: dynamic attachment of
network interfaces. In this case the CNI cannot work in collaboration with a
Device Plugin or "mount-point injector" and an existing mount point has to be used.
Also, some form of notification mechanism has to exist to tell the workload a
new socket is ready.

>> 5. kubelet will finish all other preparations and resource
>>    allocations and will ask docker/podman to start a container
>>    with all mount points, devices and environment variables from
>>    the pod annotation.
>>
>> 6. docker/podman starts a container.
>>    Need to mention here that in many cases initial process of
>>    a container is not the actual application that will use a
>>    vhost-user connection, but likely a shell that will invoke
>>    the actual application.
>>
>> 7. Application starts inside the container, checks the environment
>>    variables (actually, checking of environment variables usually
>>    happens in a shell script that invokes the application with
>>    correct arguments) and creates a net_virtio_user port in server
>>    mode.  At this point socket file will be created.
>>    (since we're running third-party application inside the container
>>     we can only assume that it will do what is written here, it's
>>     a responsibility of an application developer to do the right
>>     thing.)
>>
>> 8. OVS successfully re-connects to the newly created socket in a
>>    shared directory and vhost-user protocol establishes the network
>>    connectivity.
>>
>> As you can wee, there are way too many entities and communication
>> methods involved.  So, passing a pre-opened file descriptor from
>> CNI all the way down to application is not that easy as it is in
>> case of QEMU+LibVirt.
> 
> File descriptor passing isn't necessary if OVS owns the listen socket
> and the application container is the one who connects. That's why I
> asked why dpdkvhostuser was deprecated in another email. The benefit of
> doing this would be that the application container can instantly connect
> to OVS without a sleep loop.
> 
> I still don't get the attraction of the broker idea. The pros:
> + Overcomes the issue with stale UNIX domain socket files
> + Eliminates the re-connect sleep loop
> 
> Neutral:
> * vhost-user UNIX domain socket directory container volume is replaced
>   by broker UNIX domain socket bind mount
> * UNIX domain socket naming conflicts become broker key naming conflicts
> 
> The cons:
> - Requires running a new service on the host with potential security
>   issues
> - Requires support in third-party applications, QEMU, and DPDK/OVS
> - The old code must be kept for compatibility with non-broker
>   configurations, especially since third-party applications may not
>   support the broker. Developers and users will have to learn about both
>   options and decide which one to use.
> 
> This seems like a modest improvement for the complexity and effort
> involved. The same pros can be achieved by:
> * Adding unlink(2) to rte_vhost (or applications can add rm -f
>   $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
>   it doesn't catch a misconfiguration where the user launches two
>   processes with the same socket path.
> * Reversing the direction of the client/server relationship to
>   eliminate the re-connect sleep loop at startup. I'm unsure whether
>   this is possible.
> 
> That said, the broker idea doesn't affect the vhost-user protocol itself
> and is more of an OVS/DPDK topic. I may just not be familiar enough with
> OVS/DPDK to understand the benefits of the approach.
> 
> Stefan
>