Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: "Billy McFall" <bmcfall@redhat.com>,
	"Adrian Moreno" <amorenoz@redhat.com>,
	"Maxime Coquelin" <maxime.coquelin@redhat.com>,
	"Chenbo Xia" <chenbo.xia@intel.com>,
	dev@dpdk.org, "Julia Suvorova" <jusual@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Daniel Berrange" <berrange@redhat.com>
Subject: Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.
Date: Wed, 24 Mar 2021 15:07:27 +0000	[thread overview]
Message-ID: <YFtVr0uM3Dpynqyv@stefanha-x1.localdomain> (raw)
In-Reply-To: <597d1ec7-d271-dc0d-522d-b900c9cb00ea@ovn.org>

[-- Attachment #1: Type: text/plain, Size: 15907 bytes --]

On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>
> >>>>
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>>>>> system:
> >>>>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
> >>> again"
> >>>>>>>>>
> >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> >>> users
> >>>>>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>>>>> solved with a pidfile.
> >>>>>>>>
> >>>>>>>> How exactly this could be solved with a pidfile?
> >>>>>>>
> >>>>>>> A pidfile prevents two instances of the same service from running at
> >>> the
> >>>>>>> same time.
> >>>>>>>
> >>>>>>> The same effect can be achieved by the container orchestrator,
> >>> systemd,
> >>>>>>> etc too because it refuses to run the same service twice.
> >>>>>>
> >>>>>> Sure. I understand that.  My point was that these could be 2 different
> >>>>>> applications and they might not know which process to look for.
> >>>>>>
> >>>>>>>
> >>>>>>>> And what if this is
> >>>>>>>> a different application that tries to create a socket on a same path?
> >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> >>>>>>>> to an existing socket file and will fail.  Subsequently port creation
> >>>>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
> >>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>>>>> or it was a virtio-user application or someone else.
> >>>>>>>
> >>>>>>> If rte_vhost unlinks the socket then the user will find that
> >>> networking
> >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> >>> device
> >>>>>>> or restart QEMU, depending on whether they need to keep the guest
> >>>>>>> running or not. This is a misconfiguration that is recoverable.
> >>>>>>
> >>>>>> True, it's recoverable, but with a high cost.  Restart of a VM is
> >>> rarely
> >>>>>> desirable.  And the application inside the guest might not feel itself
> >>>>>> well after hot re-plug of a device that it actively used.  I'd expect
> >>>>>> a DPDK application that runs inside a guest on some virtio-net device
> >>>>>> to crash after this kind of manipulations.  Especially, if it uses some
> >>>>>> older versions of DPDK.
> >>>>>
> >>>>> This unlink issue is probably something we think differently about.
> >>>>> There are many ways for users to misconfigure things when working with
> >>>>> system tools. If it's possible to catch misconfigurations that is
> >>>>> preferrable. In this case it's just the way pathname AF_UNIX domain
> >>>>> sockets work and IMO it's better not to have problems starting the
> >>>>> service due to stale files than to insist on preventing
> >>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be
> >>>>> successful, so ¯\_(ツ)_/¯.
> >>>>>
> >>>>>>>
> >>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> >>>>>>> create a security issue. I don't know the security model of OVS.
> >>>>>>
> >>>>>> In general privileges of a ovs-vswitchd daemon might be completely
> >>>>>> different from privileges required to invoke control utilities or
> >>>>>> to access the configuration database.  SO, yes, we should not allow
> >>>>>> that.
> >>>>>
> >>>>> That can be locked down by restricting the socket path to a file beneath
> >>>>> /var/run/ovs/vhost-user/.
> >>>>>
> >>>>>>>
> >>>>>>>> There are, probably, ways to detect if there is any alive process
> >>> that
> >>>>>>>> has this socket open, but that sounds like too much for this purpose,
> >>>>>>>> also I'm not sure if it's possible if actual user is in a different
> >>>>>>>> container.
> >>>>>>>> So I don't see a good reliable way to detect these conditions.  This
> >>>>>>>> falls on shoulders of a higher level management software or a user to
> >>>>>>>> clean these socket files up before adding ports.
> >>>>>>>
> >>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK
> >>>>>>> application? If yes, then abstract AF_UNIX sockets can be used.
> >>> Abstract
> >>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address
> >>>>>>> disappears when there is no process listening anymore.
> >>>>>>
> >>>>>> OVS is usually started right on the host in a main network namespace.
> >>>>>> In case it's started in a pod, it will run in a separate container but
> >>>>>> configured with a host network.  Applications almost exclusively runs
> >>>>>> in separate pods.
> >>>>>
> >>>>> Okay.
> >>>>>
> >>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by
> >>>>>>>>>> leveraging an infrastructure service provided by a SocketPair
> >>> Broker.
> >>>>>>>>>
> >>>>>>>>> I don't understand yet why this is useful for vhost-user, where the
> >>>>>>>>> creation of the vhost-user device backend and its use by a VMM are
> >>>>>>>>> closely managed by one piece of software:
> >>>>>>>>>
> >>>>>>>>> 1. Unlink the socket path.
> >>>>>>>>> 2. Create, bind, and listen on the socket path.
> >>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>>>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
> >>>>>>>>> 4. In the meantime the VMM can open the socket path and call
> >>> connect(2).
> >>>>>>>>>    As soon as the vhost-user device backend calls accept(2) the
> >>>>>>>>>    connection will proceed (there is no need for sleeping).
> >>>>>>>>>
> >>>>>>>>> This approach works across containers without a broker.
> >>>>>>>>
> >>>>>>>> Not sure if I fully understood a question here, but anyway.
> >>>>>>>>
> >>>>>>>> This approach works fine if you know what application to run.
> >>>>>>>> In case of a k8s cluster, it might be a random DPDK application
> >>>>>>>> with virtio-user ports running inside a container and want to
> >>>>>>>> have a network connection.  Also, this application needs to run
> >>>>>>>> virtio-user in server mode, otherwise restart of the OVS will
> >>>>>>>> require restart of the application.  So, you basically need to
> >>>>>>>> rely on a third-party application to create a socket with a right
> >>>>>>>> name and in a correct location that is shared with a host, so
> >>>>>>>> OVS can find it and connect.
> >>>>>>>>
> >>>>>>>> In a VM world everything is much more simple, since you have
> >>>>>>>> a libvirt and QEMU that will take care of all of these stuff
> >>>>>>>> and which are also under full control of management software
> >>>>>>>> and a system administrator.
> >>>>>>>> In case of a container with a "random" DPDK application inside
> >>>>>>>> there is no such entity that can help.  Of course, some solution
> >>>>>>>> might be implemented in docker/podman daemon to create and manage
> >>>>>>>> outside-looking sockets for an application inside the container,
> >>>>>>>> but that is not available today AFAIK and I'm not sure if it
> >>>>>>>> ever will.
> >>>>>>>
> >>>>>>> Wait, when you say there is no entity like management software or a
> >>>>>>> system administrator, then how does OVS know to instantiate the new
> >>>>>>> port? I guess something still needs to invoke ovs-ctl add-port?
> >>>>>>
> >>>>>> I didn't mean that there is no any application that configures
> >>>>>> everything.  Of course, there is.  I mean that there is no such
> >>>>>> entity that abstracts all that socket machinery from the user's
> >>>>>> application that runs inside the container.  QEMU hides all the
> >>>>>> details of the connection to vhost backend and presents the device
> >>>>>> as a PCI device with a network interface wrapping from the guest
> >>>>>> kernel.  So, the application inside VM shouldn't care what actually
> >>>>>> there is a socket connected to OVS that implements backend and
> >>>>>> forward traffic somewhere.  For the application it's just a usual
> >>>>>> network interface.
> >>>>>> But in case of a container world, application should handle all
> >>>>>> that by creating a virtio-user device that will connect to some
> >>>>>> socket, that has an OVS on the other side.
> >>>>>>
> >>>>>>>
> >>>>>>> Can you describe the steps used today (without the broker) for
> >>>>>>> instantiating a new DPDK app container and connecting it to OVS?
> >>>>>>> Although my interest is in the vhost-user protocol I think it's
> >>>>>>> necessary to understand the OVS requirements here and I know little
> >>>>>>> about them.
> >>>>>>>> I might describe some things wrong since I worked with k8s and CNI
> >>>>>> plugins last time ~1.5 years ago, but the basic schema will look
> >>>>>> something like this:
> >>>>>>
> >>>>>> 1. user decides to start a new pod and requests k8s to do that
> >>>>>>    via cmdline tools or some API calls.
> >>>>>>
> >>>>>> 2. k8s scheduler looks for available resources asking resource
> >>>>>>    manager plugins, finds an appropriate physical host and asks
> >>>>>>    local to that node kubelet daemon to launch a new pod there.
> >>>>>>
> >>>>
> >>>> When the CNI is called, the pod has already been created, i.e: a PodID
> >>> exists
> >>>> and so does an associated network namespace. Therefore, everything that
> >>> has to
> >>>> do with the runtime spec such as mountpoints or devices cannot be
> >>> modified by
> >>>> this time.
> >>>>
> >>>> That's why the Device Plugin API is used to modify the Pod's spec before
> >>> the CNI
> >>>> chain is called.
> >>>>
> >>>>>> 3. kubelet asks local CNI plugin to allocate network resources
> >>>>>>    and annotate the pod with required mount points, devices that
> >>>>>>    needs to be passed in and environment variables.
> >>>>>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
> >>>>>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
> >>>>>>    usually deployed as a system DaemonSet, so it runs in a
> >>>>>>    separate pod.
> >>>>>>
> >>>>>> 4. Assuming that vhost-user connection requested in server mode.
> >>>>>>    CNI plugin will:
> >>>>>>    4.1 create a directory for a vhost-user socket.
> >>>>>>    4.2 add this directory to pod annotations as a mount point.
> >>>>
> >>>> I believe this is not possible, it would have to inspect the pod's spec
> >>> or
> >>>> otherwise determine an existing mount point where the socket should be
> >>> created.
> >>>
> >>> Uff.  Yes, you're right.  Thanks for your clarification.
> >>> I mixed up CNI and Device Plugin here.
> >>>
> >>> CNI itself is not able to annotate new resources to the pod, i.e.
> >>> create new mounts or something like this.   And I don't recall any
> >>> vhost-user device plugins.  Is there any?  There is an SR-IOV device
> >>> plugin, but its purpose is to allocate and pass PCI devices, not create
> >>> mounts for vhost-user.
> >>>
> >>> So, IIUC, right now user must create the directory and specify
> >>> a mount point in a pod spec file or pass the whole /var/run/openvswitch
> >>> or something like this, right?
> >>>
> >>> Looking at userspace-cni-network-plugin, it actually just parses
> >>> annotations to find the shared directory and fails if there is
> >>> no any:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
> >>>
> >>> And examples suggests to specify a directory to mount:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
> >>>
> >>> Looks like this is done by user's hands.
> >>>
> >>> Yes, I am one of the primary authors of Userspace CNI. Currently, the
> >> directory is by hand. Long term thought was to have a mutating
> >> webhook/admission controller inject a directory into the podspec.  Not sure
> >> if it has changed, but I think when I was originally doing this work, OvS
> >> only lets you choose the directory at install time, so it has to be
> >> something like /var/run/openvswitch/. You can choose the socketfile name
> >> and maybe a subdirectory off the main directory, but not the full path.
> >>
> >> One of the issues I was trying to solve was making sure ContainerA couldn't
> >> see ContainerB's socketfiles. That's where the admission controller could
> >> create a unique subdirectory for each container under
> >> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
> >> always took precedence so that work never completed.
> > 
> > If the CNI plugin has access to the container's network namespace, could
> > it create an abstract AF_UNIX listen socket?
> > 
> > That way the application inside the container could connect to an
> > AF_UNIX socket and there is no need to manage container volumes.
> > 
> > I'm not familiar with the Open VSwitch, so I'm not sure if there is a
> > sane way of passing the listen socket fd into ovswitchd from the CNI
> > plugin?
> > 
> > The steps:
> > 1. CNI plugin enters container's network namespace and opens an abstract
> >    AF_UNIX listen socket.
> > 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
> >    add-port step. Instead of using type=dpdkvhostuserclient
> >    options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
> >    dpdkvhostuser server with the listen fd.
> 
> For this step you will need a side channel, i.e. a separate unix socket
> created by ovs-vswitchd (most likely, created by rte_vhost on
> rte_vhost_driver_register() call).
> 
> The problem is that ovs-vsctl talks with ovsdb-server and adds the new
> port -- just a new row in the 'interface' table of the database.
> ovs-vswitchd receives update from the database and creates the actual
> port.  All the communications done through JSONRPC, so passing fds is
> not an option.
> 
> > 3. When the container starts, it connects to the abstract AF_UNIX
> >    socket. The abstract socket name is provided to the container at
> >    startup time in an environment variable. The name is unique, at least
> >    to the pod, so that multiple containers in the pod can run vhost-user
> >    applications.
> 
> Few more problems with this solution:
> 
> - We still want to run application inside the container in a server mode,
>   because virtio-user PMD in client mode doesn't support re-connection.
> 
> - How to get this fd again after the OVS restart?  CNI will not be invoked
>   at this point to pass a new fd.
> 
> - If application will close the connection for any reason (restart, some
>   reconfiguration internal to the application) and OVS will be re-started
>   at the same time, abstract socket will be gone.  Need a persistent daemon
>   to hold it.

Okay, if there is no component that has a lifetime suitable for holding
the abstract listen socket, then using pathname AF_UNIX sockets seems
like a better approach.

Stefan

next prev parent reply	other threads:[~2021-03-24 15:07 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17 20:25 Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [PATCH 1/4] net/virtio: fix interrupt unregistering for listening socket Ilya Maximets
2021-03-25  8:32   ` Maxime Coquelin
2021-04-07  7:21     ` Xia, Chenbo
2021-03-17 20:25 ` [dpdk-dev] [RFC 2/4] vhost: add support for SocketPair Broker Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 3/4] net/vhost: " Ilya Maximets
2021-03-17 20:25 ` [dpdk-dev] [RFC 4/4] net/virtio: " Ilya Maximets
2021-03-18 17:52 ` [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user Stefan Hajnoczi
2021-03-18 19:47   ` Ilya Maximets
2021-03-18 20:14     ` Ilya Maximets
2021-03-19 14:16       ` Stefan Hajnoczi
2021-03-19 15:37         ` Ilya Maximets
2021-03-19 16:01           ` Stefan Hajnoczi
2021-03-19 16:02           ` Marc-André Lureau
2021-03-19  8:51     ` Marc-André Lureau
2021-03-19 11:25       ` Ilya Maximets
2021-03-19 14:05     ` Stefan Hajnoczi
2021-03-19 15:29       ` Ilya Maximets
2021-03-19 17:21         ` Stefan Hajnoczi
2021-03-23 17:57           ` Adrian Moreno
2021-03-23 18:27             ` Ilya Maximets
2021-03-23 20:54               ` Billy McFall
2021-03-24 12:05                 ` Stefan Hajnoczi
2021-03-24 13:11                   ` Ilya Maximets
2021-03-24 15:07                     ` Stefan Hajnoczi [this message]
2021-03-25  9:35                     ` Stefan Hajnoczi
2021-03-25 11:00                       ` Ilya Maximets
2021-03-25 16:43                         ` Stefan Hajnoczi
2021-03-25 17:58                           ` Ilya Maximets
2021-03-30 15:01                             ` Stefan Hajnoczi
2021-03-19 14:39 ` Stefan Hajnoczi
2021-03-19 16:11   ` Ilya Maximets
2021-03-19 16:45     ` Ilya Maximets
2021-03-24 20:56       ` Maxime Coquelin
2021-03-24 21:39         ` Ilya Maximets
2021-03-24 21:51           ` Maxime Coquelin
2021-03-24 22:17             ` Ilya Maximets
2023-06-30  3:45 ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YFtVr0uM3Dpynqyv@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=amorenoz@redhat.com \
    --cc=berrange@redhat.com \
    --cc=bmcfall@redhat.com \
    --cc=chenbo.xia@intel.com \
    --cc=dev@dpdk.org \
    --cc=i.maximets@ovn.org \
    --cc=jusual@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=maxime.coquelin@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).