* [dpdk-dev] [RFC] vhost: new rte_vhost API proposal
@ 2018-05-10 13:22 Dariusz Stojaczyk
[not found] ` <20180510163643.GD9308@stefanha-x1.localdomain>
2018-05-18 13:01 ` [dpdk-dev] [RFC v2] " Dariusz Stojaczyk
0 siblings, 2 replies; 17+ messages in thread
From: Dariusz Stojaczyk @ 2018-05-10 13:22 UTC (permalink / raw)
To: dev, Maxime Coquelin, Tiwei Bie, Tetsuya Mukawa, Thomas Monjalon
Cc: yliu, Stefan Hajnoczi, Dariusz Stojaczyk
rte_vhost has been confirmed not to work with some Virtio devices
(it's not vhost-user spec compliant, see details below) and fixing
it directly would require quite a big amount of changes which would
completely break backwards compatibility. This library is intended
to smooth out the transition. It exposes a low-level API for
implementing new Virtio drivers/targets. The existing rte_vhost
is about to be refactored to use rte_virtio library underneath, and
demanding drivers could now use rte_virtio directly.
rte_virtio would offer both vhost and virtio driver APIs. These two
have a lot of common code for vhost-user handling or PCI access for
initiator/virtio-vhost-user (and possibly vDPA) so there's little
sense to keep target and initiator code separated between different
libs. Of course, the APIs would be separate - only some parts of
the code would be shared.
rte_virtio intends to abstract away most vhost-user/virtio-vhost-user
specifics and to allow developers to implement Virtio targets/drivers
with an ease. It calls user-provided callbacks once proper device
initialization state has been reached. That is - memory mappings
have changed, virtqueues are ready to be processed, features have
changed in runtime, etc.
Compared to the rte_vhost, this lib additionally allows the following:
* ability to start/stop particular queues - that's required
by the vhost-user spec. rte_vhost has been already confirmed
not to work with some Virtio devices which do not initialize
some of their management queues.
* most callbacks are now asynchronous - it greatly simplifies
the event handling for asynchronous applications and doesn't
make anything harder for synchronous ones.
* this is low-level API. It doesn't have any vhost-net, nvme
or crypto references. These backend-specific libraries will
be later refactored to use *this* generic library underneath.
This implies that the library doesn't do any virtqueue processing,
it only delivers vring addresses to the user, so he can process
virtqueues by himself.
* abstracting away PCI/vhost-user.
* The API imposes how public functions can be called and how
internal data can change, so there's only a minimal work required
to ensure thread-safety. Possibly no mutexes are required at all.
* full Virtio 1.0/vhost-user specification compliance.
This patch only introduces the API. Some additional functions
for vDPA might be still required, but everything present here
so far shouldn't need changing.
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
---
lib/librte_virtio/rte_virtio.h | 245 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 245 insertions(+)
create mode 100644 lib/librte_virtio/rte_virtio.h
diff --git a/lib/librte_virtio/rte_virtio.h b/lib/librte_virtio/rte_virtio.h
new file mode 100644
index 0000000..0203d5e
--- /dev/null
+++ b/lib/librte_virtio/rte_virtio.h
@@ -0,0 +1,245 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) Intel Corporation.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <linux/vhost.h>
+
+/** Single memory region. Both physically and virtually contiguous */
+struct rte_virtio_mem_region {
+ uint64_t guest_phys_addr;
+ uint64_t guest_user_addr;
+ uint64_t host_user_addr;
+ uint64_t size;
+ void *mmap_addr;
+ uint64_t mmap_size;
+ int fd;
+};
+
+struct rte_virtio_memory {
+ uint32_t nregions;
+ struct rte_virtio_mem_region regions[];
+};
+
+/**
+ * Vhost device created and managed by rte_virtio. Accessible via
+ * \c rte_virtio_tgt_ops callbacks. This is only a part of the real
+ * vhost device data. This struct is published just for inline vdev
+ * functions to access their data directly.
+ */
+struct rte_virtio_dev {
+ struct rte_virtio_memory *mem;
+ uint64_t features;
+};
+
+/**
+ * Virtqueue created and managed by rte_virtio. Accessible via
+ * \c rte_virtio_tgt_ops callbacks.
+ */
+struct rte_virtio_vq {
+ struct vring_desc *desc;
+ struct vring_avail *avail;
+ struct vring_used *used;
+ /* available only if F_LOG_ALL has been negotiated */
+ void *log;
+ uint16_t size;
+};
+
+/**
+ * Device/queue related callbacks, all optional. Provided callback
+ * parameters are guaranteed not to be NULL until explicitly specified.
+ */
+struct rte_virtio_tgt_ops {
+ /** New initiator connected. */
+ void (*device_create)(struct rte_virtio_dev *vdev);
+ /**
+ * Device is ready to operate. vdev->mem is now available.
+ * This callback may be called multiple times as memory mappings
+ * can change dynamically. All queues are guaranteed to be stopped
+ * by now.
+ */
+ void (*device_init)(struct rte_virtio_dev *vdev);
+ /**
+ * Features have changed in runtime. Queues might be still running
+ * at this point.
+ */
+ void (*device_features_changed)(struct rte_virtio_dev *vdev);
+ /**
+ * Start processing vq. The `vq` is guaranteed not to be modified before
+ * `queue_stop` is called.
+ */
+ void (*queue_start)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+ /**
+ * Stop processing vq. It shouldn't be accessed after this callback
+ * completes (via tgt_cb_complete). This can be called prior to shutdown
+ * or before actions that require changing vhost device/vq state.
+ */
+ void (*queue_stop)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+ /** Device disconnected. All queues are guaranteed to be stopped by now */
+ void (*device_destroy)(struct rte_virtio_dev *vdev);
+ /**
+ * Custom message handler. `vdev` and `vq` can be NULL. This is called
+ * for backend-specific actions. The `id` should be prefixed by the
+ * backend name (net/crypto/scsi) and `ctx` is message-specific data
+ * that should be available until tgt_cb_complete is called.
+ */
+ void (*custom_msg)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq,
+ char *id, void *ctx);
+
+ /**
+ * Interrupt handler, synchronous. If this callback is set to NULL,
+ * rte_virtio will hint the initiators not to send any interrupts.
+ */
+ void (*queue_kick)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+ /** Device config read, synchronous. */
+ int (*get_config)(struct rte_virtio_dev *vdev, uint8_t *config,
+ uint32_t config_len);
+ /** Device config changed by the driver, synchronous. */
+ int (*set_config)(struct rte_virtio_dev *vdev, uint8_t *config,
+ uint32_t offset, uint32_t len, uint32_t flags);
+};
+
+/**
+ * Registers a new vhost target accepting remote connections. Multiple
+ * available transports are available. It is possible to create a Vhost-user
+ * Unix domain socket polling local connections or connect to a physical
+ * Virtio device and install an interrupt handler .
+ * \param trtype type of the transport used, e.g. "PCI", "PCI-vhost-user",
+ * "PCI-vDPA", "vhost-user".
+ * \param trid identifier of the device. For PCI this would be the BDF address,
+ * for vhost-user the socket name.
+ * \param trctx additional data for the specified transport. Can be NULL.
+ * \param tgt_ops callbacks to be called upon reaching specific initialization
+ * states.
+ * \param features supported Virtio features. To be negotiated with the
+ * driver ones. rte_virtio will append a couple of generic feature bits
+ * which are required by the Virtio spec. TODO list these features here
+ * \return 0 on success, negative errno otherwise
+ */
+int rte_virtio_tgt_register(char *trtype, char *trid, void *trctx,
+ struct rte_virtio_tgt_ops *tgt_ops,
+ uint64_t features);
+
+/**
+ * Finish async device tgt ops callback. Unless a tgt op has been documented
+ * as 'synchronous' this function must be called at the end of the op handler.
+ * It can be called either before or after the op handler returns. rte_virtio
+ * won't call any callbacks while another one hasn't been finished yet.
+ * \param vdev vhost device
+ * \param rc 0 on success, negative errno otherwise.
+ */
+int rte_virtio_tgt_cb_complete(struct rte_virtio_dev *vdev, int rc);
+
+/**
+ * Unregisters a vhost target asynchronously.
+ * \param cb_fn callback to be called on finish
+ * \param cb_arg argument for \c cb_fn
+ */
+void rte_virtio_tgt_unregister(char *trid,
+ void (*cb_fn)(void *arg), void *cb_arg);
+
+/**
+ * Bypass F_IOMMU_PLATFORM and translate gpa directly.
+ * \param mem vhost device memory
+ * \param gpa guest physical address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \return vhost virtual address or NULL if requested `gpa` is not mapped.
+ */
+static inline void *
+rte_virtio_gpa_to_vva(struct rte_virtio_memory *mem, uint64_t gpa, uint64_t *len)
+{
+ struct rte_virtio_mem_region *r;
+ uint32_t i;
+
+ for (i = 0; i < mem->nregions; i++) {
+ r = &mem->regions[i];
+ if (gpa >= r->guest_phys_addr &&
+ gpa < r->guest_phys_addr + r->size) {
+
+ if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
+ *len = r->guest_phys_addr + r->size - gpa;
+ }
+
+ return gpa - r->guest_phys_addr +
+ r->host_user_addr;
+ }
+ }
+ *len = 0;
+
+ return 0;
+}
+
+/**
+ * Translate I/O virtual address to vhost address space.
+ * If F_IOMMU_PLATFORM has been negotiated, this might potentially
+ * send a TLB miss and wait for the TLB update response.
+ * If F_IOMMU_PLATFORM has not been negotiated, `iova` is
+ * a physical address and `perm` is ignored.
+ * \param vdev vhost device
+ * \param iova I/O virtual address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
+ * \return vhost virtual address or NULL if requested `iova` is not mapped
+ * or the `perm` doesn't match.
+ */
+static inline void *
+rte_virtio_iova_to_vva(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq,
+ uint64_t iova, uint32_t *len, uint8_t perm)
+{
+ void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
+ uint64_t iova, uint64_t size, uint8_t perm);
+
+ if (!(vdev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))) {
+ return rte_virtio_gpa_to_vva(vdev->mem, iova, len);
+ }
+
+ return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
+}
+
+/**
+ * Notify the driver about vq change. This is an eventfd_write for vhost-user
+ * or MMIO write for PCI devices.
+ */
+void rte_virtio_dev_call(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+
+/**
+ * Notify the driver about device config change. This will result in \c
+ * rte_virtio_tgt_ops->get_config being called. This is an eventfd_write
+ * for vhost-user or MMIO write for PCI devices
+ */
+void rte_virtio_dev_cfg_call(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+
--
2.7.4
^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20180510163643.GD9308@stefanha-x1.localdomain>]
* Re: [dpdk-dev] [RFC] vhost: new rte_vhost API proposal
[not found] ` <20180510163643.GD9308@stefanha-x1.localdomain>
@ 2018-05-11 5:55 ` Stojaczyk, DariuszX
[not found] ` <20180511100531.GA19894@stefanha-x1.localdomain>
0 siblings, 1 reply; 17+ messages in thread
From: Stojaczyk, DariuszX @ 2018-05-11 5:55 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: dev, Maxime Coquelin, Bie, Tiwei, Tetsuya Mukawa,
Thomas Monjalon, yliu, Liu, Changpeng
Hi,
> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Friday, May 11, 2018 12:37 AM
> On Thu, May 10, 2018 at 03:22:53PM +0200, Dariusz Stojaczyk wrote:
> > rte_virtio would offer both vhost and virtio driver APIs. These two
> > have a lot of common code for vhost-user handling or PCI access for
> > initiator/virtio-vhost-user (and possibly vDPA) so there's little
> > sense to keep target and initiator code separated between different
> > libs. Of course, the APIs would be separate - only some parts of
> > the code would be shared.
>
> The API below seems to be for vhost backends (aka slaves). rte_virtio_*
> is a misnomer because vhost and virtio are two different things. This
> is not for implementing virtio devices, it's specifically for vhost
> devices.
I agree it's named a bit off if we're talking about vhost. My idea was to introduce a generic library for userspace Virtio processing and that's where the name came from. Even when you use just the vhost API that's introduced here, you are only required to implement vring processing, config access, and possibly interrupt handling, all of which are typical Virtio things. The vhost logic is hidden inside.
>
> Vhost does not offer the full virtio device model - otherwise it would
> be just another transport in the VIRTIO specification. Instead vhost is
> a protocol for vhost devices, which are subsets of virtio devices.
>
> I suggest calling it rte_vhost2 since it's basically a new, incompatible
> rte_vhost API.
Rte_vhost2 sounds better for what we have now, but would that name be still valid once we add a true Virtio driver functionality? (I believe it's called Virtio PMD in DPDK right now). That driver would reuse a lot of the vhost code for PCI and vhost-user, so it makes some sense to put these two together.
I don't think rte_vhost2 is a permanent name anyway, so maybe we could call it like so for now, and rename it later once I introduce that additional Virtio functionality? Would that work?
>
> Also, the initiator/target terminology does not match the vhost-user
> specification. It uses master/client and slave/backend/server. Adding
> another pair of words makes things more confusing. Please stick to the
> words used by the spec.
Ack.
>
> > +/**
> > + * Device/queue related callbacks, all optional. Provided callback
> > + * parameters are guaranteed not to be NULL until explicitly specified.
>
> s/until/unless/ ?
Ack.
> > + /**
> > + * Stop processing vq. It shouldn't be accessed after this callback
> > + * completes (via tgt_cb_complete). This can be called prior to
> shutdown
>
> s/tgt_cb_complete/rte_virtio_tgt_cb_complete/
Ack.
>
> > + * or before actions that require changing vhost device/vq state.
> > + */
> > + void (*queue_stop)(struct rte_virtio_dev *vdev, struct rte_virtio_vq
> *vq);
> > + /** Device disconnected. All queues are guaranteed to be stopped by
> now */
> > + void (*device_destroy)(struct rte_virtio_dev *vdev);
> > + /**
> > + * Custom message handler. `vdev` and `vq` can be NULL. This is called
> > + * for backend-specific actions. The `id` should be prefixed by the
>
> Since vdev can be NULL, does this mean custom_msg() may be invoked at
> any time during the lifecycle and even before/after
> device_create()/device_destroy()?
Theoretically. I was thinking of some poorly-written backends notifying they're out of internal resources, but I agree it's just poor. I'll remove the `vdev can be NULL` part.
> > + */
> > + void (*custom_msg)(struct rte_virtio_dev *vdev, struct rte_virtio_vq
> *vq,
> > + char *id, void *ctx);
>
> What is the purpose of id and why is it char* instead of const char*?
Ack, It should be const. (same thing goes to every other char* in this patch)
For example vhost-crypto introduces two new vhost-user messages for initializing and destroying crypto session. The underlying vhost-crypto vhost-user backend after receiving such message could execute this callback as follows:
struct my_crypto_data *data = calloc();
[...]
Ops->custom_msg(vdev, NULL, "crypto_sess_init", data);
>
> Is ctx the "message"? If ctx is untrusted message data from an external
> process, why is there no size argument? Who validates the message size?
Ack. Will add size parameter.
>
> > +
> > + /**
> > + * Interrupt handler, synchronous. If this callback is set to NULL,
> > + * rte_virtio will hint the initiators not to send any interrupts.
> > + */
> > + void (*queue_kick)(struct rte_virtio_dev *vdev, struct rte_virtio_vq
> *vq);
>
> Devices often have multiple types of queues. Some of them may be
> suitable for polling, others may be suitable for an interrupt-driven
> model. Is there a way to enable/disable interrupts for specific queues?
Thanks, I didn't think of that. I'll need to move the responsibility of setting vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT to the user.
>
> > + /** Device config read, synchronous. */
>
> What is the meaning of the return value?
How about the following:
\return 0 if `config` has been successfully set, -1 otherwise.
An error ( -1 ) is propagated all the way to the master so he can handle it his way.
>
> > + int (*get_config)(struct rte_virtio_dev *vdev, uint8_t *config,
> > + uint32_t config_len);
> > + /** Device config changed by the driver, synchronous. */
>
> What is the meaning of the return value?
>
> What is the meaning of the flags?
Good call. I actually can't find any doc/usage of this API.
Changpeng (the original get/set_config author, +CC'ed), could you document this function briefly here?
>
> > + int (*set_config)(struct rte_virtio_dev *vdev, uint8_t *config,
> > + uint32_t offset, uint32_t len, uint32_t flags);
> > +};
> > +
> > +/**
> > + * Registers a new vhost target accepting remote connections. Multiple
> > + * available transports are available. It is possible to create a Vhost-
> user
> > + * Unix domain socket polling local connections or connect to a
> physical
> > + * Virtio device and install an interrupt handler .
> > + * \param trtype type of the transport used, e.g. "PCI", "PCI-vhost-
> user",
> > + * "PCI-vDPA", "vhost-user".
> > + * \param trid identifier of the device. For PCI this would be the BDF
> address,
> > + * for vhost-user the socket name.
> > + * \param trctx additional data for the specified transport. Can be
> NULL.
> > + * \param tgt_ops callbacks to be called upon reaching specific
> initialization
> > + * states.
> > + * \param features supported Virtio features. To be negotiated with
> the
> > + * driver ones. rte_virtio will append a couple of generic feature bits
> > + * which are required by the Virtio spec. TODO list these features here
> > + * \return 0 on success, negative errno otherwise
> > + */
> > +int rte_virtio_tgt_register(char *trtype, char *trid, void *trctx,
> > + struct rte_virtio_tgt_ops *tgt_ops,
> > + uint64_t features);
> > +
> > +/**
> > + * Finish async device tgt ops callback. Unless a tgt op has been
> documented
> > + * as 'synchronous' this function must be called at the end of the op
> handler.
> > + * It can be called either before or after the op handler returns.
> rte_virtio
> > + * won't call any callbacks while another one hasn't been finished yet.
> > + * \param vdev vhost device
> > + * \param rc 0 on success, negative errno otherwise.
> > + */
> > +int rte_virtio_tgt_cb_complete(struct rte_virtio_dev *vdev, int rc);
>
> How can this function fail and how is the caller supposed to handle
> failure?
If -1 is returned, the current callback will be perceived as failed. So if ` device_create` is completed with rc != 0, this lib will teardown the device and no subsequent `device_destroy` will be called. Similar thing goes with queues - if a queue failed to start, it won't need to be stopped. Since you pointed it out - I'll mention it somewhere in the doc. I didn't do so in the first place because it's analogous to how rte_vhost works now.
>
> Are there any thread-safety rules regarding this API? Can it be called
> from a different thread than the callback?
Yes. I should mention it.
>
> > +
> > +/**
> > + * Unregisters a vhost target asynchronously.
>
> How are existing device instances affected?
Ack. How about:
All active queues will be stopped and all devices destroyed.
This is analogous to what rte_vhost has now.
>
> > + * \param cb_fn callback to be called on finish
> > + * \param cb_arg argument for \c cb_fn
> > + */
> > +void rte_virtio_tgt_unregister(char *trid,
>
> One of the rte_vhost API limitations is that the ID namespace is shared
> between transports. The same seems to be the case here.
>
> It assumes that "PCI", "PCI-vhost-user", "PCI-vDPA", and "vhost-user"
> are never instantiated with the same trid. UNIX domain sockets can have
> arbitrary filenames (that resemble a PCI BDF). And who knows what
> other
> transports will be added in the future.
>
> I think namespace collisions could be a problem.
Ack, I'll add `char *trtype` param to the unregister func.
>
> > + void (*cb_fn)(void *arg), void *cb_arg);
>
> Which thread is the callback invoked from?
It'll be called from the same thread that calls rte_virtio_tgt_ops. I'll mention it in the doc, thanks.
rte_virtio_tgt_unregister should also return an error code for cases where a device with given trtype/trid couldn't be found. It'll prevent some implementations from waiting endlessly for cb_fn to be called.
Thanks,
D.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal
2018-05-10 13:22 [dpdk-dev] [RFC] vhost: new rte_vhost API proposal Dariusz Stojaczyk
[not found] ` <20180510163643.GD9308@stefanha-x1.localdomain>
@ 2018-05-18 13:01 ` Dariusz Stojaczyk
2018-05-18 13:50 ` Maxime Coquelin
` (2 more replies)
1 sibling, 3 replies; 17+ messages in thread
From: Dariusz Stojaczyk @ 2018-05-18 13:01 UTC (permalink / raw)
To: dev, Maxime Coquelin, Tiwei Bie, Tetsuya Mukawa, Thomas Monjalon
Cc: yliu, Stefan Hajnoczi, James Harris, Dariusz Stojaczyk
rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
stage. This is perfectly fine from the Vhost-user spec perspective, but
doesn't meet rte_vhost expectations. rte_vhost waits for all queues
to be fully initialized before it allows the entire device to be
processed. qFixing rte_vhost directly would require quite a big amount
of changes, which would completely break backwards compatibility.
This rte_vhost2 library is intended to smooth out the transition.
It exposes a low-level API for implementing new Vhost-user slaves.
The existing rte_vhost is about to be refactored to use rte_vhost2
library underneath, and demanding backends could now use rte_vhost2
directly.
We're designing a fresh library here, so this opens up a great
amount of possibilities and improvements we can make to the public
API that will pay off significantly for the sake of future
specification/library extensions.
rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics
and allows developers to implement Vhost devices with an ease.
It calls user-provided callbacks once proper device initialization
state has been reached. That is - memory mappings have changed,
virtqueues are ready to be processed, features have changed in
runtime, etc.
Compared to the rte_vhost, this lib additionally allows the following:
* ability to start/stop particular queues
* most callbacks are now asynchronous - it greatly simplifies
the event handling for asynchronous applications and doesn't
make anything harder for synchronous ones.
* this is low-level API. It doesn't have any vhost-net, nvme
or crypto references. These backend-specific libraries will
be later refactored to use *this* generic library underneath.
This implies that the library doesn't do any virtqueue processing,
it only delivers vring addresses to the user, so he can process
virtqueues by himself.
* abstracting away Unix domain sockets (vhost-user) and PCI
(virtio-vhost-user).
* The API imposes how public functions can be called and how
internal data can change, so there's only a minimal work required
to ensure thread-safety. Possibly no mutexes are required at all.
* full Virtio 1.0/vhost-user specification compliance.
This patch only introduces the API. Some additional functions
for vDPA might be still required, but everything present here
so far shouldn't need changing.
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
---
lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 331 insertions(+)
create mode 100644 lib/librte_vhost2/rte_vhost2.h
diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h
new file mode 100644
index 0000000..385b093
--- /dev/null
+++ b/lib/librte_vhost2/rte_vhost2.h
@@ -0,0 +1,331 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) Intel Corporation.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_VHOST2_H_
+#define _RTE_VHOST2_H_
+
+/**
+ * @file
+ * This library abstracts away most Vhost-user/virtio-vhost-user specifics
+ * and allows developers to implement Vhost devices with an ease.
+ * It calls user-provided callbacks once proper device initialization
+ * state has been reached. That is - memory mappings have changed,
+ * virtqueues are ready to be processed, features have changed in runtime, etc.
+ */
+
+#include <stdint.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Not C++-aware. */
+#include <linux/vhost.h>
+
+#define RTE_VHOST2_MEMORY_MAX_NREGIONS 8
+
+#define RTE_VHOST2_CLIENT (1ULL << 0)
+#define RTE_VHOST2_NO_RECONNECT (1ULL << 1)
+
+enum rte_vhost2_set_config_type {
+ /** Config changed on request by the vhost driver. */
+ VHOST_SET_CONFIG_TYPE_MASTER = 0,
+ /** Config is being restored after a successful migration. */
+ VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
+};
+
+#define RTE_VHOST2_MSG_VERSION_MASK (0x3)
+#define RTE_VHOST2_MSG_REPLY_MASK (0x1 << 2)
+#define RTE_VHOST2_MSG_NEED_REPLY (0x1 << 3)
+
+struct rte_vhost2_msg {
+ uint32_t id;
+ uint32_t flags;
+ uint32_t size; /**< The following payload size. */
+ void *payload;
+ int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
+};
+
+/** Single memory region. Both physically and virtually contiguous */
+struct rte_vhost2_mem_region {
+ uint64_t guest_phys_addr;
+ uint64_t guest_user_addr;
+ uint64_t host_user_addr;
+ uint64_t size;
+ void *mmap_addr;
+ uint64_t mmap_size;
+ int fd;
+};
+
+struct rte_vhost2_memory {
+ uint32_t nregions;
+ struct rte_vhost2_mem_region regions[];
+};
+
+/**
+ * Vhost device created and managed by rte_vhost2. Accessible via
+ * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real
+ * vhost device data. This struct is published just for inline vdev
+ * functions to access their data directly.
+ */
+struct rte_vhost2_dev {
+ struct rte_vhost2_memory *mem;
+ bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */
+};
+
+/**
+ * Virtqueue created and managed by rte_vhost2. Accessible via
+ * \c rte_vhost2_tgt_ops callbacks.
+ */
+struct rte_vhost2_vq {
+ struct vring_desc *desc;
+ struct vring_avail *avail;
+ struct vring_used *used;
+ /* available only if \c VHOST_F_LOG_ALL has been negotiated */
+ void *log;
+ uint16_t size;
+};
+
+/**
+ * Device/queue related callbacks, all optional. Provided callback
+ * parameters are guaranteed not to be NULL unless explicitly specified.
+ */
+struct rte_vhost2_tgt_ops {
+ /**
+ * New driver connected. If this is completed with a non-zero status,
+ * rte_vhost2 will terminate the connection.
+ */
+ void (*device_create)(struct rte_vhost2_dev *vdev);
+ /**
+ * Device is ready to operate. vdev data is now initialized. This callback
+ * may be called multiple times as e.g. memory mappings can change
+ * dynamically. All queues are guaranteed to be stopped by now.
+ */
+ void (*device_init)(struct rte_vhost2_dev *vdev);
+ /**
+ * Features have changed in runtime. This is called at least once during
+ * initialization before `device_init`. Queues might be still running
+ * at this point.
+ */
+ void (*device_features_changed)(struct rte_vhost2_dev *vdev,
+ uint64_t features);
+ /**
+ * Start processing vq. The `vq` is guaranteed not to be modified before
+ * `queue_stop` is called.
+ */
+ void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+ /**
+ * Stop processing vq. It shouldn't be accessed after this callback
+ * completes (via \c rte_vhost2_tgt_cb_complete). This can be called
+ * prior to shutdown or before actions that require changing vhost
+ * device/vq state.
+ */
+ void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+ /** Device disconnected. All queues are guaranteed to be stopped by now */
+ void (*device_destroy)(struct rte_vhost2_dev *vdev);
+ /**
+ * Custom vhost-user message handler. This is called for
+ * backend-specific messages (net/crypto/scsi) that weren't recognized
+ * by the generic message parser. `msg` is available until
+ * \c rte_vhost2_tgt_cb_complete is called.
+ */
+ void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg);
+
+ /** Interrupt handler, synchronous. */
+ void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+ /**
+ * Full device config read, synchronous. Return 0 if `len` bytes of
+ * `config` have been successfully set, -1 otherwise.
+ */
+ int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
+ uint32_t len);
+ /**
+ * Device config changed by the driver, synchronous. `type` indicates
+ * the reason of change.
+ */
+ int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
+ uint32_t offset, uint32_t len,
+ enum rte_vhost2_set_config_type type);
+
+ void *reserved[8]; /**< Reserved for future extension */
+};
+
+/**
+ * Registers a new vhost target accepting remote connections. Multiple
+ * available transports are available. It is possible to create a Vhost-user
+ * Unix domain socket polling local connections or connect to a physical
+ * Virtio device and install an interrupt handler .
+ *
+ * This function is thread-safe.
+ *
+ * \param trtype type of the transport used, e.g. "vhost-user",
+ * "PCI-vhost-user", "PCI-vDPA".
+ * \param trid identifier of the device. For PCI this would be the BDF address,
+ * for vhost-user the socket name.
+ * \param trflags additional options for the specified transport
+ * \param trctx additional data for the specified transport. Can be NULL.
+ * \param tgt_ops callbacks to be called upon reaching specific initialization
+ * states.
+ * \param features supported Virtio features. To be negotiated with the
+ * driver ones. rte_vhost2 will append a couple of generic feature bits
+ * which are required by the Virtio spec. TODO list these features here
+ * \return 0 on success, negative errno otherwise
+ */
+int rte_vhost2_tgt_register(const char *trtype, const char *trid,
+ uint64_t trflags, void *trctx,
+ struct rte_vhost2_tgt_ops *tgt_ops,
+ uint64_t features);
+
+/**
+ * Finish async device tgt ops callback. Unless a tgt op has been documented
+ * as 'synchronous' this function must be called at the end of the op handler.
+ * It can be called either before or after the op handler returns. rte_vhost2
+ * won't call any tgt ops callbacks while another one hasn't been finished yet.
+ *
+ * This function is thread-safe.
+ *
+ * \param vdev vhost device
+ * \param rc 0 on success, negative errno otherwise. If non-zero value is
+ * given, the current callback will be perceived as failed. A queue that failed
+ * to start won't need to be stopped.
+ */
+void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc);
+
+/**
+ * Unregisters a vhost target asynchronously. All active queue will be stopped
+ * and all devices destroyed.
+ *
+ * This function is thread-safe.
+ *
+ * \param cb_fn callback to be called on finish. It'll be called from the same
+ * thread that calls \c rte_vhost2_tgt_ops.
+ * \param cb_arg argument for \c cb_fn
+ * \return 0 on success, negative errno otherwise. `cb_fn` won't be called
+ * if non-zero value is returned.
+ */
+int rte_vhost2_tgt_unregister(const char *trtype, const char *trid,
+ void (*cb_fn)(void *arg), void *cb_arg);
+
+/**
+ * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
+ *
+ * This function is thread-safe.
+ *
+ * \param mem vhost device memory
+ * \param gpa guest physical address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \return vhost virtual address or NULL if requested `gpa` is not mapped.
+ */
+static inline void *
+rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len)
+{
+ struct rte_vhost2_mem_region *r;
+ uint32_t i;
+
+ for (i = 0; i < mem->nregions; i++) {
+ r = &mem->regions[i];
+ if (gpa >= r->guest_phys_addr &&
+ gpa < r->guest_phys_addr + r->size) {
+
+ if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
+ *len = r->guest_phys_addr + r->size - gpa;
+ }
+
+ return gpa - r->guest_phys_addr + r->host_user_addr;
+ }
+ }
+ *len = 0;
+
+ return 0;
+}
+
+/**
+ * Translate I/O virtual address to vhost address space.
+ *
+ * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send
+ * a TLB miss and wait for the TLB update response.
+ * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical
+ * address and `perm` is ignored.
+ *
+ * This function is thread-safe.
+ *
+ * \param vdev vhost device
+ * \param vq virtqueue. Must be started.
+ * \param iova I/O virtual address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
+ * \return vhost virtual address or NULL if requested `iova` is not mapped
+ * or the `perm` doesn't match.
+ */
+static inline void *
+rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq,
+ uint64_t iova, uint32_t *len, uint8_t perm)
+{
+ void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
+ uint64_t iova, uint64_t size, uint8_t perm);
+
+ if (!vdev->iommu) {
+ return rte_vhost2_gpa_to_vva(vdev->mem, iova, len);
+ }
+
+ return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
+}
+
+/**
+ * Notify the driver about vq change. This is an eventfd_write for vhost-user
+ * or MMIO write for PCI devices.
+ *
+ * \param vdev vhost device
+ * \param vq virtqueue. Must be started.
+ */
+void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+
+/**
+ * Notify the driver about device config change. This will result in \c
+ * rte_vhost2_tgt_ops->get_config being called.
+ *
+ * \param vdev vhost device
+ */
+void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_VHOST2_H_ */
--
2.7.4
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal
2018-05-18 13:01 ` [dpdk-dev] [RFC v2] " Dariusz Stojaczyk
@ 2018-05-18 13:50 ` Maxime Coquelin
2018-05-20 7:07 ` Yuanhan Liu
2018-05-22 10:19 ` Stojaczyk, DariuszX
[not found] ` <20180525100550.GD14757@stefanha-x1.localdomain>
[not found] ` <20180607151227.23660-1-darek.stojaczyk@gmail.com>
2 siblings, 2 replies; 17+ messages in thread
From: Maxime Coquelin @ 2018-05-18 13:50 UTC (permalink / raw)
To: Dariusz Stojaczyk, dev, Tiwei Bie, Tetsuya Mukawa, Thomas Monjalon
Cc: yliu, Stefan Hajnoczi, James Harris
Hi Dariusz,
On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> stage. This is perfectly fine from the Vhost-user spec perspective, but
> doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> to be fully initialized before it allows the entire device to be
> processed. qFixing rte_vhost directly would require quite a big amount
> of changes, which would completely break backwards compatibility.
>
> This rte_vhost2 library is intended to smooth out the transition.
> It exposes a low-level API for implementing new Vhost-user slaves.
> The existing rte_vhost is about to be refactored to use rte_vhost2
> library underneath, and demanding backends could now use rte_vhost2
> directly.
I like the idea, and the proposed way to smooth the transition.
I will certainly have other comments later, but please find below
the ones I have for the moment.
> We're designing a fresh library here, so this opens up a great
> amount of possibilities and improvements we can make to the public
> API that will pay off significantly for the sake of future
> specification/library extensions.
>
> rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics
> and allows developers to implement Vhost devices with an ease.
> It calls user-provided callbacks once proper device initialization
> state has been reached. That is - memory mappings have changed,
> virtqueues are ready to be processed, features have changed in
> runtime, etc.
>
> Compared to the rte_vhost, this lib additionally allows the following:
> * ability to start/stop particular queues
> * most callbacks are now asynchronous - it greatly simplifies
> the event handling for asynchronous applications and doesn't
> make anything harder for synchronous ones.
> * this is low-level API. It doesn't have any vhost-net, nvme
> or crypto references. These backend-specific libraries will
> be later refactored to use *this* generic library underneath.
> This implies that the library doesn't do any virtqueue processing,
> it only delivers vring addresses to the user, so he can process
> virtqueues by himself.
> * abstracting away Unix domain sockets (vhost-user) and PCI
> (virtio-vhost-user).
> * The API imposes how public functions can be called and how
> internal data can change, so there's only a minimal work required
> to ensure thread-safety. Possibly no mutexes are required at all.
> * full Virtio 1.0/vhost-user specification compliance.
>
> This patch only introduces the API. Some additional functions
> for vDPA might be still required, but everything present here
> so far shouldn't need changing.
>
> Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
> ---
> lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 331 insertions(+)
> create mode 100644 lib/librte_vhost2/rte_vhost2.h
>
> diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h
> new file mode 100644
> index 0000000..385b093
> --- /dev/null
> +++ b/lib/librte_vhost2/rte_vhost2.h
> @@ -0,0 +1,331 @@
> +/*-
> + * BSD LICENSE
> + *
> + * Copyright (c) Intel Corporation.
> + * All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + *
> + * * Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * * Redistributions in binary form must reproduce the above copyright
> + * notice, this list of conditions and the following disclaimer in
> + * the documentation and/or other materials provided with the
> + * distribution.
> + * * Neither the name of Intel Corporation nor the names of its
> + * contributors may be used to endorse or promote products derived
> + * from this software without specific prior written permission.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_VHOST2_H_
> +#define _RTE_VHOST2_H_
> +
> +/**
> + * @file
> + * This library abstracts away most Vhost-user/virtio-vhost-user specifics
> + * and allows developers to implement Vhost devices with an ease.
> + * It calls user-provided callbacks once proper device initialization
> + * state has been reached. That is - memory mappings have changed,
> + * virtqueues are ready to be processed, features have changed in runtime, etc.
> + */
> +
> +#include <stdint.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/* Not C++-aware. */
> +#include <linux/vhost.h>
> +
> +#define RTE_VHOST2_MEMORY_MAX_NREGIONS 8
> +
> +#define RTE_VHOST2_CLIENT (1ULL << 0)
> +#define RTE_VHOST2_NO_RECONNECT (1ULL << 1)
> +
> +enum rte_vhost2_set_config_type {
> + /** Config changed on request by the vhost driver. */
> + VHOST_SET_CONFIG_TYPE_MASTER = 0,
> + /** Config is being restored after a successful migration. */
> + VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> +};
> +
> +#define RTE_VHOST2_MSG_VERSION_MASK (0x3)
> +#define RTE_VHOST2_MSG_REPLY_MASK (0x1 << 2)
> +#define RTE_VHOST2_MSG_NEED_REPLY (0x1 << 3)
> +
> +struct rte_vhost2_msg {
> + uint32_t id;
> + uint32_t flags;
> + uint32_t size; /**< The following payload size. */
> + void *payload;
> + int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> +};
> +
> +/** Single memory region. Both physically and virtually contiguous */
> +struct rte_vhost2_mem_region {
> + uint64_t guest_phys_addr;
> + uint64_t guest_user_addr;
> + uint64_t host_user_addr;
> + uint64_t size;
> + void *mmap_addr;
> + uint64_t mmap_size;
> + int fd;
> +};
> +
> +struct rte_vhost2_memory {
> + uint32_t nregions;
> + struct rte_vhost2_mem_region regions[];
> +};
> +
> +/**
> + * Vhost device created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real
> + * vhost device data. This struct is published just for inline vdev
> + * functions to access their data directly.
> + */
> +struct rte_vhost2_dev {
> + struct rte_vhost2_memory *mem;
> + bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */
> +};
> +
> +/**
> + * Virtqueue created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks.
> + */
> +struct rte_vhost2_vq {
> + struct vring_desc *desc;
> + struct vring_avail *avail;
> + struct vring_used *used;
> + /* available only if \c VHOST_F_LOG_ALL has been negotiated */
> + void *log;
> + uint16_t size;
> +};
> +
> +/**
> + * Device/queue related callbacks, all optional. Provided callback
> + * parameters are guaranteed not to be NULL unless explicitly specified.
> + */
> +struct rte_vhost2_tgt_ops {
> + /**
> + * New driver connected. If this is completed with a non-zero status,
> + * rte_vhost2 will terminate the connection.
> + */
> + void (*device_create)(struct rte_vhost2_dev *vdev);
> + /**
> + * Device is ready to operate. vdev data is now initialized. This callback
> + * may be called multiple times as e.g. memory mappings can change
> + * dynamically. All queues are guaranteed to be stopped by now.
> + */
> + void (*device_init)(struct rte_vhost2_dev *vdev);
> + /**
> + * Features have changed in runtime. This is called at least once during
> + * initialization before `device_init`. Queues might be still running
> + * at this point.
> + */
> + void (*device_features_changed)(struct rte_vhost2_dev *vdev,
> + uint64_t features);
> + /**
> + * Start processing vq. The `vq` is guaranteed not to be modified before
> + * `queue_stop` is called.
> + */
> + void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> + /**
> + * Stop processing vq. It shouldn't be accessed after this callback
> + * completes (via \c rte_vhost2_tgt_cb_complete). This can be called
> + * prior to shutdown or before actions that require changing vhost
> + * device/vq state.
> + */
> + void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> + /** Device disconnected. All queues are guaranteed to be stopped by now */
> + void (*device_destroy)(struct rte_vhost2_dev *vdev);
> + /**
> + * Custom vhost-user message handler. This is called for
> + * backend-specific messages (net/crypto/scsi) that weren't recognized
> + * by the generic message parser. `msg` is available until
> + * \c rte_vhost2_tgt_cb_complete is called.
> + */
> + void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg);
> +
> + /** Interrupt handler, synchronous. */
> + void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> + /**
> + * Full device config read, synchronous. Return 0 if `len` bytes of
> + * `config` have been successfully set, -1 otherwise.
> + */
> + int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> + uint32_t len);
> + /**
> + * Device config changed by the driver, synchronous. `type` indicates
> + * the reason of change.
> + */
> + int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> + uint32_t offset, uint32_t len,
> + enum rte_vhost2_set_config_type type);
> +
> + void *reserved[8]; /**< Reserved for future extension */
> +};
> +
> +/**
> + * Registers a new vhost target accepting remote connections. Multiple
> + * available transports are available. It is possible to create a Vhost-user
> + * Unix domain socket polling local connections or connect to a physical
> + * Virtio device and install an interrupt handler .
> + *
> + * This function is thread-safe.
> + *
> + * \param trtype type of the transport used, e.g. "vhost-user",
> + * "PCI-vhost-user", "PCI-vDPA".
> + * \param trid identifier of the device. For PCI this would be the BDF address,
> + * for vhost-user the socket name.
> + * \param trflags additional options for the specified transport
> + * \param trctx additional data for the specified transport. Can be NULL.
> + * \param tgt_ops callbacks to be called upon reaching specific initialization
> + * states.
> + * \param features supported Virtio features. To be negotiated with the
> + * driver ones. rte_vhost2 will append a couple of generic feature bits
> + * which are required by the Virtio spec. TODO list these features here
> + * \return 0 on success, negative errno otherwise
> + */
> +int rte_vhost2_tgt_register(const char *trtype, const char *trid,
> + uint64_t trflags, void *trctx,
> + struct rte_vhost2_tgt_ops *tgt_ops,
> + uint64_t features);
Couldn't the register API also pass the vdev?
Doing this, the backend could have rte_vhost2_dev in its device
struct.
> +
> +/**
> + * Finish async device tgt ops callback. Unless a tgt op has been documented
> + * as 'synchronous' this function must be called at the end of the op handler.
> + * It can be called either before or after the op handler returns. rte_vhost2
> + * won't call any tgt ops callbacks while another one hasn't been finished yet.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param rc 0 on success, negative errno otherwise. If non-zero value is
> + * given, the current callback will be perceived as failed. A queue that failed
> + * to start won't need to be stopped.
> + */
> +void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc);
> +
> +/**
> + * Unregisters a vhost target asynchronously. All active queue will be stopped
> + * and all devices destroyed.
> + *
> + * This function is thread-safe.
> + *
> + * \param cb_fn callback to be called on finish. It'll be called from the same
> + * thread that calls \c rte_vhost2_tgt_ops.
> + * \param cb_arg argument for \c cb_fn
> + * \return 0 on success, negative errno otherwise. `cb_fn` won't be called
> + * if non-zero value is returned.
> + */
> +int rte_vhost2_tgt_unregister(const char *trtype, const char *trid,
> + void (*cb_fn)(void *arg), void *cb_arg);
> +
> +/**
> + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
> + *
> + * This function is thread-safe.
> + *
> + * \param mem vhost device memory
> + * \param gpa guest physical address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \return vhost virtual address or NULL if requested `gpa` is not mapped.
> + */
> +static inline void *
> +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len)
> +{
> + struct rte_vhost2_mem_region *r;
> + uint32_t i;
> +
> + for (i = 0; i < mem->nregions; i++) {
> + r = &mem->regions[i];
> + if (gpa >= r->guest_phys_addr &&
> + gpa < r->guest_phys_addr + r->size) {
> +
> + if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
> + *len = r->guest_phys_addr + r->size - gpa;
> + }
> +
> + return gpa - r->guest_phys_addr + r->host_user_addr;
> + }
> + }
> + *len = 0;
> +
> + return 0;
> +}
Maybe we could take the opportunity to only have rte_vhost2_iova_to_vva.
> +/**
> + * Translate I/O virtual address to vhost address space.
> + *
> + * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send
> + * a TLB miss and wait for the TLB update response.
> + * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical
> + * address and `perm` is ignored.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + * \param iova I/O virtual address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
> + * \return vhost virtual address or NULL if requested `iova` is not mapped
> + * or the `perm` doesn't match.
> + */
> +static inline void *
> +rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq,
> + uint64_t iova, uint32_t *len, uint8_t perm)
> +{
> + void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
> + uint64_t iova, uint64_t size, uint8_t perm);
> +
> + if (!vdev->iommu) {
> + return rte_vhost2_gpa_to_vva(vdev->mem, iova, len);
> + }
> +
> + return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
> +}
> +
> +/**
> + * Notify the driver about vq change. This is an eventfd_write for vhost-user
> + * or MMIO write for PCI devices.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + */
> +void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +
> +/**
> + * Notify the driver about device config change. This will result in \c
> + * rte_vhost2_tgt_ops->get_config being called.
> + *
> + * \param vdev vhost device
> + */
> +void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_VHOST2_H_ */
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal
2018-05-18 13:50 ` Maxime Coquelin
@ 2018-05-20 7:07 ` Yuanhan Liu
2018-05-22 10:19 ` Stojaczyk, DariuszX
1 sibling, 0 replies; 17+ messages in thread
From: Yuanhan Liu @ 2018-05-20 7:07 UTC (permalink / raw)
To: Maxime Coquelin
Cc: Dariusz Stojaczyk, dev, Tiwei Bie, Tetsuya Mukawa,
Thomas Monjalon, Stefan Hajnoczi, James Harris
On Fri, May 18, 2018 at 03:50:45PM +0200, Maxime Coquelin wrote:
> Hi Dariusz,
>
> On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> > rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> > been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> > in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> > stage. This is perfectly fine from the Vhost-user spec perspective, but
> > doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> > to be fully initialized before it allows the entire device to be
> > processed. qFixing rte_vhost directly would require quite a big amount
> > of changes, which would completely break backwards compatibility.
> >
> > This rte_vhost2 library is intended to smooth out the transition.
> > It exposes a low-level API for implementing new Vhost-user slaves.
> > The existing rte_vhost is about to be refactored to use rte_vhost2
> > library underneath, and demanding backends could now use rte_vhost2
> > directly.
>
> I like the idea, and the proposed way to smooth the transition.
Please be aware of that I just had a quick glimpse of this patch and it's
likely I don't have too much time to follow this.
However, I also like this idea. And thank you for working on it.
--yliu
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal
2018-05-18 13:50 ` Maxime Coquelin
2018-05-20 7:07 ` Yuanhan Liu
@ 2018-05-22 10:19 ` Stojaczyk, DariuszX
1 sibling, 0 replies; 17+ messages in thread
From: Stojaczyk, DariuszX @ 2018-05-22 10:19 UTC (permalink / raw)
To: Maxime Coquelin, dev, Bie, Tiwei, Tetsuya Mukawa, Thomas Monjalon
Cc: yliu, Stefan Hajnoczi, Harris, James R
Hi Maxime,
> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, May 18, 2018 9:51 PM
> On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> > rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> > been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> > in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> > stage. This is perfectly fine from the Vhost-user spec perspective, but
> > doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> > to be fully initialized before it allows the entire device to be
> > processed. qFixing rte_vhost directly would require quite a big amount
> > of changes, which would completely break backwards compatibility.
> >
> > This rte_vhost2 library is intended to smooth out the transition.
> > It exposes a low-level API for implementing new Vhost-user slaves.
> > The existing rte_vhost is about to be refactored to use rte_vhost2
> > library underneath, and demanding backends could now use rte_vhost2
> > directly.
>
> I like the idea, and the proposed way to smooth the transition.
>
> I will certainly have other comments later, but please find below
> the ones I have for the moment.
>
> > <snip>
> > +
> > +/**
> > + * Registers a new vhost target accepting remote connections. Multiple
> > + * available transports are available. It is possible to create a Vhost-
> user
> > + * Unix domain socket polling local connections or connect to a
> physical
> > + * Virtio device and install an interrupt handler .
> > + *
> > + * This function is thread-safe.
> > + *
> > + * \param trtype type of the transport used, e.g. "vhost-user",
> > + * "PCI-vhost-user", "PCI-vDPA".
> > + * \param trid identifier of the device. For PCI this would be the BDF
> address,
> > + * for vhost-user the socket name.
> > + * \param trflags additional options for the specified transport
> > + * \param trctx additional data for the specified transport. Can be
> NULL.
> > + * \param tgt_ops callbacks to be called upon reaching specific
> initialization
> > + * states.
> > + * \param features supported Virtio features. To be negotiated with
> the
> > + * driver ones. rte_vhost2 will append a couple of generic feature bits
> > + * which are required by the Virtio spec. TODO list these features here
> > + * \return 0 on success, negative errno otherwise
> > + */
> > +int rte_vhost2_tgt_register(const char *trtype, const char *trid,
> > + uint64_t trflags, void *trctx,
> > + struct rte_vhost2_tgt_ops *tgt_ops,
> > + uint64_t features);
>
> Couldn't the register API also pass the vdev?
> Doing this, the backend could have rte_vhost2_dev in its device
> struct.
Please notice the register API is for registering targets, not devices. A single Vhost-user server target can spawn multiple devices - one for each connection. I know the nomenclature is different from rte_vhost, but since each connection uses its own (virt)queues it makes sense to call things this way.
Initially I thought about adding some rte_vhost2_tgt struct declaration that register function would return, but later on came to a conclusion that it would only complicate things for the library user. A parent struct that would keep rte_vhost2_tgt* needs to contain `const char *trtype` and `const char *trid` anyway, so it's just easier to use these two strings for target identification.
> > <snip>
> > +/**
> > + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
> > + *
> > + * This function is thread-safe.
> > + *
> > + * \param mem vhost device memory
> > + * \param gpa guest physical address
> > + * \param len length of the memory to translate (in bytes). If
> requested
> > + * memory chunk crosses memory region boundary, the *len will be
> set to
> > + * the remaining, maximum length of virtually contiguous memory. In
> such
> > + * case the user will be required to call another gpa_to_vva(gpa +
> *len).
> > + * \return vhost virtual address or NULL if requested `gpa` is not
> mapped.
> > + */
> > +static inline void *
> > +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t
> gpa, uint64_t *len)
> > +{
> > + struct rte_vhost2_mem_region *r;
> > + uint32_t i;
> > +
> > + for (i = 0; i < mem->nregions; i++) {
> > + r = &mem->regions[i];
> > + if (gpa >= r->guest_phys_addr &&
> > + gpa < r->guest_phys_addr + r->size) {
> > +
> > + if (unlikely(*len > r->guest_phys_addr + r->size -
> gpa)) {
> > + *len = r->guest_phys_addr + r->size - gpa;
> > + }
> > +
> > + return gpa - r->guest_phys_addr + r-
> >host_user_addr;
> > + }
> > + }
> > + *len = 0;
> > +
> > + return 0;
> > +}
>
> Maybe we could take the opportunity to only have
> rte_vhost2_iova_to_vva.
Good idea; will remove it in v3.
Thanks,
D.
^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20180525100550.GD14757@stefanha-x1.localdomain>]
* Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal
[not found] ` <20180525100550.GD14757@stefanha-x1.localdomain>
@ 2018-05-29 13:38 ` Stojaczyk, DariuszX
[not found] ` <20180530085700.GC14623@stefanha-x1.localdomain>
0 siblings, 1 reply; 17+ messages in thread
From: Stojaczyk, DariuszX @ 2018-05-29 13:38 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: dev, Maxime Coquelin, Bie, Tiwei, Tetsuya Mukawa,
Thomas Monjalon, yliu, Harris, James R
> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Friday, May 25, 2018 12:06 PM
> On Fri, May 18, 2018 at 03:01:05PM +0200, Dariusz Stojaczyk wrote:
> > +struct rte_vhost2_msg {
> > + uint32_t id;
>
> Is this what the vhost-user specification calls the "request type"? I
> suggest following the vhost-user spec terminology.
>
> > + uint32_t flags;
> > + uint32_t size; /**< The following payload size. */
> > + void *payload;
> > + int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
>
> Is it necessary to expose file descriptor passing in the API?
> virtio-vhost-user doesn't have file descriptor passing, so it's best if this
> can be hidden inside rte_vhost2.
So it's another argument for not exposing raw message handling to the user.
If there's some backend-specific vhost-user message in future that contains an fd, it will need a set of new abstractions to work with virtio-vhost-user anyway.
I guess I'll get back the original custom_msg idea from V1.
>
> > +};
> > +
> > +/** Single memory region. Both physically and virtually contiguous */
> > +struct rte_vhost2_mem_region {
> > + uint64_t guest_phys_addr;
> > + uint64_t guest_user_addr;
> > + uint64_t host_user_addr;
> > + uint64_t size;
> > + void *mmap_addr;
> > + uint64_t mmap_size;
> > + int fd;
>
> virtio-vhost-user doesn't have an fd. Why do API consumers need to
> know about the fd?
They don't. Ack. I'll strip this struct.
>
> > +/**
> > + * Device/queue related callbacks, all optional. Provided callback
> > + * parameters are guaranteed not to be NULL unless explicitly
> specified.
> > + */
>
> This is a good place to mention that all callbacks are asynchronous unless
> specified otherwise. Without that knowledge statements below like "If
> this is completed with a non-zero status" are confusing on a void
> function.
Ack.
>
> > +struct rte_vhost2_tgt_ops {
> > + /**
> > + * New driver connected. If this is completed with a non-zero
> status,
> > + * rte_vhost2 will terminate the connection.
> > + */
> > + void (*device_create)(struct rte_vhost2_dev *vdev);
> > + /**
> > + * Device is ready to operate. vdev data is now initialized. This
> callback
> > + * may be called multiple times as e.g. memory mappings can
> change
> > + * dynamically. All queues are guaranteed to be stopped by now.
> > + */
> > + void (*device_init)(struct rte_vhost2_dev *vdev);
> > + /**
> > + * Features have changed in runtime. This is called at least once
> > +during
>
> s/in/at/
Ack.
>
> > + /**
> > + * Custom vhost-user message handler. This is called for
> > + * backend-specific messages (net/crypto/scsi) that weren't
> recognized
> > + * by the generic message parser. `msg` is available until
> > + * \c rte_vhost2_tgt_cb_complete is called.
> > + */
> > + void (*custom_msg)(struct rte_vhost2_dev *vdev, struct
> > +rte_vhost2_msg *msg);
>
> What happens if rte_vhost2_tgt_cb_complete() is called with a negative
> rc? Does the specific errno value matter?
My current implementation only checks for rc != 0 now. I'm still working this out.
>
> Where is the API for sending a vhost-user reply message?
I didn't push any. Now that you pointed out the fds in public API I think I'll rollback this custom_msg stuff to V1.
D.
^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20180607151227.23660-1-darek.stojaczyk@gmail.com>]
[parent not found: <20180608100852.GA31164@stefanha-x1.localdomain>]
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
[not found] ` <20180608100852.GA31164@stefanha-x1.localdomain>
@ 2018-06-13 9:41 ` Dariusz Stojaczyk
0 siblings, 0 replies; 17+ messages in thread
From: Dariusz Stojaczyk @ 2018-06-13 9:41 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: dev, Maxime Coquelin, Tiwei Bie, Tetsuya Mukawa, Thomas Monjalon,
yliu, James Harris, Tomasz Kulasek, Wodkowski, PawelX
Hi Stefan,
I'm sorry for the late response. My email client filtered out this
mail. I fixed it just now.
pt., 8 cze 2018 o 15:29 Stefan Hajnoczi <stefanha@redhat.com> napisał(a):
>
> On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> > The proposed structure for the new library is described below.
> > * rte_vhost2.h
> > - public API
> > - registering targets with provided ops
> > - unregistering targets
> > - iova_to_vva()
> > * transport.h/.c
> > - implements rte_vhost2.h
> > - allows registering vhost transports, which are opaquely required by the
> > rte_vhost2.h API (target register function requires transport name).
> > - exposes a set of callbacks to be implemented by each transport
> > * vhost_user.c
> > - vhost-user Unix domain socket transport
>
> This file should be called transport_unix.c or similar so it's clear
> that it only handles UNIX domain socket transport aspects, not general
> vhost-user protocol aspects. If the distinction is unclear then
> layering violations are easy to make in the future (especially when
> people other than you contribute to the code).
Ack. Also, virtio-vhost-user transport still has to be placed in
drivers/ directory, right? We could move most of this library
somewhere into drivers/, leaving only rte_vhost2.h, transport.h and
transport.c in lib/librte_vhost2. What do you think?
>
> > - does recvmsg()
> > - uses the underlying vhost-user helper lib to process messages, but still
> > handles some transport-specific ones, e.g. SET_MEM_TABLE
> > - calls some of the rte_vhost2.h ops registered with a target
> > * fd_man.h/.c
> > - polls provided descriptors, calls user callbacks on fd events
> > - based on the original rte_vhost version
> > - additionally allows calling user-provided callbacks on the poll thread
>
> Ths is general-purpose functionality that should be a core DPDK utility.
>
> Are you sure you cannot use existing (e)poll functionality in DPDK?
We have to use poll here, and I haven't seen any DPDK APIs for poll,
only rte_epoll_*. Since received vhost-user messages may be handled
asynchronously, we have to temporarily remove an fd from the poll
group for the time each message is handled. We do it by setting the fd
in the polled fds array to -1. man page for poll(2) explicitly
suggests this to ignore poll() events.
>
> > * vhost.h/.c
> > - a transport-agnostic vhost-user library
> > - calls most of the rte_vhost2.h ops registered with a target
> > - manages virtqueues state
> > - hopefully to be reused by the virtio-vhost-user
> > - exposes a set of callbacks to be implemented by each transport
> > (for e.g. sending message replies)
> >
> > This series includes only vhost-user transport. Virtio-vhost-user
> > is to be done later.
> >
> > The following items are still TBD:
> > * vhost-user slave channel
> > * get/set_config
> > * cfg_call() implementation
> > * IOTLB
> > * NUMA awareness
>
> This is important to think about while the API is still being designed.
>
> Some initial thoughts. NUMA affinity is optimal when:
>
> 1. The vring, indirect descriptor tables, and data buffers are allocated
> on the same NUMA node.
>
> 2. The virtqueue interrupts go to vcpus associated with the same NUMA
> node as the vring.
>
> 3. The guest NUMA node corresponds to the host NUMA node of the backing
> storage device (e.g. NVMe PCI adapter).
>
> This way memory is local to the NVMe PCI adapter on the way down the
> stack when submitting I/O and back up again when completing I/O.
>
> Achieving #1 & #2 is up to the guest drivers.
>
> Achieving #3 is up to virtual machine configuration (QEMU command-line).
>
> The role that DPDK plays in all of this is that each vhost-user
> virtqueue should be polled by a thread that has been placed on the same
> host NUMA node mentioned above for #1, #2, and #3.
>
> Per-virtqueue state should also be allocated on this host NUMA node.
I agree on all points.
> Device backends should be able to query this information so they, too,
> can allocate memory with optimal NUMA affinity.
Of course. Since we offer raw vq pointers, they will be able to use
get_mempolicy(2). No additional APIs required.
>
> > * Live migration
>
> Another important feature to design in from the beginning.
This is being worked on at the moment.
Thanks,
D.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
[not found] ` <20180607151227.23660-1-darek.stojaczyk@gmail.com>
[not found] ` <20180608100852.GA31164@stefanha-x1.localdomain>
@ 2018-06-25 11:01 ` Tiwei Bie
2018-06-25 12:17 ` Stojaczyk, DariuszX
1 sibling, 1 reply; 17+ messages in thread
From: Tiwei Bie @ 2018-06-25 11:01 UTC (permalink / raw)
To: Dariusz Stojaczyk
Cc: dev, Maxime Coquelin, Tetsuya Mukawa, Stefan Hajnoczi,
Thomas Monjalon, yliu, James Harris, Tomasz Kulasek,
Pawel Wodkowski
On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> stage. This is perfectly fine from the Vhost-user spec perspective, but
> doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> to be fully initialized before it allows the entire device to be
> processed.
>
> The problem can be temporarily worked around by start-stopping entire
> device (all queues) on each vhost-user message that could change queue
> state. This would increase general VM boot time and is totally
> unacceptable though.
>
> Fixing rte_vhost properly would require quite a big amount
> of changes which would completely break backward compatibility.
> We're introducing a new rte_vhost2 library intended to smooth the transition.
> It exposes a low-level API for implementing new vhost devices.
> The existing rte_vhost is about to be refactored to use rte_vhost2
> underneath and demanding users could now use rte_vhost2 directly.
>
> We're designing a fresh library here, so this opens up a great
> amount of possibilities and improvements we can make to the public
> API that will pay off significantly for the sake of future
> specification/library extensions.
>
> rte_vhost2 abstracts away most vhost-user/virtio-vhost-user specifics
> and allows developers to implement vhost devices with an ease.
> It calls user-provided callbacks once proper device initialization
> state has been reached. That is - memory mappings have changed,
> virtqueues are ready to be processed, features have changed at
> runtime, etc.
>
> Compared to the rte_vhost, this lib additionally allows the following:
>
> * ability to start/stop particular queues
> * most callbacks are now asynchronous - it greatly simplifies
> the event handling for asynchronous applications and doesn't
> make anything harder for synchronous ones.
> * this is low-level vhost API. It doesn't have any vhost-net, nvme
> or crypto references. These backend-specific libraries will
> be later refactored to use *this* generic library underneath.
> This implies that the library doesn't do any virtqueue processing,
> it only delivers vring addresses to the user, so he can process
> virtqueues by himself.
> * abstracting away Unix domain sockets (vhost-user) and PCI
> (virtio-vhost-user).
> * APIs for interrupt-driven drivers
> * replacing gpa_to_vva function with an IOMMU-aware variant
> * The API imposes how public functions can be called and how
> internal data can change, so there's only a minimal work required
> to ensure thread-safety. Possibly no mutexes are required at all.
> * full Virtio 1.0/vhost-user specification compliance.
>
> The proposed structure for the new library is described below.
> * rte_vhost2.h
> - public API
> - registering targets with provided ops
> - unregistering targets
> - iova_to_vva()
> * transport.h/.c
> - implements rte_vhost2.h
> - allows registering vhost transports, which are opaquely required by the
> rte_vhost2.h API (target register function requires transport name).
> - exposes a set of callbacks to be implemented by each transport
> * vhost_user.c
> - vhost-user Unix domain socket transport
> - does recvmsg()
> - uses the underlying vhost-user helper lib to process messages, but still
> handles some transport-specific ones, e.g. SET_MEM_TABLE
> - calls some of the rte_vhost2.h ops registered with a target
> * fd_man.h/.c
> - polls provided descriptors, calls user callbacks on fd events
> - based on the original rte_vhost version
> - additionally allows calling user-provided callbacks on the poll thread
> * vhost.h/.c
> - a transport-agnostic vhost-user library
> - calls most of the rte_vhost2.h ops registered with a target
> - manages virtqueues state
> - hopefully to be reused by the virtio-vhost-user
> - exposes a set of callbacks to be implemented by each transport
> (for e.g. sending message replies)
>
> This series includes only vhost-user transport. Virtio-vhost-user
> is to be done later.
>
> The following items are still TBD:
> * vhost-user slave channel
> * get/set_config
> * cfg_call() implementation
> * IOTLB
> * NUMA awareness
> * Live migration
> * vhost-net implementation (to port rte_vhost over)
> * vhost-crypto implementation (to port rte_vhost over)
> * vhost-user client mode (connecting to an external socket)
> * various error logging
>
> Dariusz Stojaczyk (7):
> vhost2: add initial implementation
> vhost2: add vhost-user helper layer
> vhost2/vhost: handle queue_stop/device_destroy failure
> vhost2: add asynchronous fd_man
> vhost2: add initial vhost-user transport
> vhost2/vhost: support protocol features
> vhost2/user: implement tgt unregister path
Hi Dariusz,
Thank you for putting efforts in making the DPDK
vhost more generic!
>From my understanding, your proposal is that:
1) Introduce rte_vhost2 to provide the APIs which
allow users to implement vhost backends like
SCSI, net, crypto, ..
2) Refactor the existing rte_vhost to use rte_vhost2.
The rte_vhost will still provide below existing
sets of APIs:
1. The APIs which allow users to implement
external vhost backends (these APIs were
designed for SPDK previously)
2. The APIs provided by the net backend
3. The APIs provided by the crypto backend
And above APIs in rte_vhost won't be changed.
Is my above understanding correct? Thanks!
Best regards,
Tiwei Bie
>
> lib/Makefile | 4 +-
> lib/librte_vhost2/Makefile | 25 +
> lib/librte_vhost2/fd_man.c | 288 +++++++++++
> lib/librte_vhost2/fd_man.h | 125 +++++
> lib/librte_vhost2/rte_vhost2.h | 304 +++++++++++
> lib/librte_vhost2/rte_vhost2_version.map | 12 +
> lib/librte_vhost2/transport.c | 74 +++
> lib/librte_vhost2/transport.h | 32 ++
> lib/librte_vhost2/vhost.c | 728 ++++++++++++++++++++++++++
> lib/librte_vhost2/vhost.h | 203 ++++++++
> lib/librte_vhost2/vhost_user.c | 845 +++++++++++++++++++++++++++++++
> 11 files changed, 2638 insertions(+), 2 deletions(-)
> create mode 100644 lib/librte_vhost2/Makefile
> create mode 100644 lib/librte_vhost2/fd_man.c
> create mode 100644 lib/librte_vhost2/fd_man.h
> create mode 100644 lib/librte_vhost2/rte_vhost2.h
> create mode 100644 lib/librte_vhost2/rte_vhost2_version.map
> create mode 100644 lib/librte_vhost2/transport.c
> create mode 100644 lib/librte_vhost2/transport.h
> create mode 100644 lib/librte_vhost2/vhost.c
> create mode 100644 lib/librte_vhost2/vhost.h
> create mode 100644 lib/librte_vhost2/vhost_user.c
>
> --
> 2.11.0
>
>
> Dariusz Stojaczyk (7):
> vhost2: add initial implementation
> vhost2: add vhost-user helper layer
> vhost2/vhost: handle queue_stop/device_destroy failure
> vhost2: add asynchronous fd_man
> vhost2: add initial vhost-user transport
> vhost2/vhost: support protocol features
> vhost2/user: implement tgt unregister path
>
> lib/Makefile | 4 +-
> lib/librte_vhost2/Makefile | 25 +
> lib/librte_vhost2/fd_man.c | 288 +++++++++++
> lib/librte_vhost2/fd_man.h | 125 +++++
> lib/librte_vhost2/rte_vhost2.h | 304 +++++++++++
> lib/librte_vhost2/rte_vhost2_version.map | 12 +
> lib/librte_vhost2/transport.c | 74 +++
> lib/librte_vhost2/transport.h | 32 ++
> lib/librte_vhost2/vhost.c | 728 ++++++++++++++++++++++++++
> lib/librte_vhost2/vhost.h | 203 ++++++++
> lib/librte_vhost2/vhost_user.c | 845 +++++++++++++++++++++++++++++++
> 11 files changed, 2638 insertions(+), 2 deletions(-)
> create mode 100644 lib/librte_vhost2/Makefile
> create mode 100644 lib/librte_vhost2/fd_man.c
> create mode 100644 lib/librte_vhost2/fd_man.h
> create mode 100644 lib/librte_vhost2/rte_vhost2.h
> create mode 100644 lib/librte_vhost2/rte_vhost2_version.map
> create mode 100644 lib/librte_vhost2/transport.c
> create mode 100644 lib/librte_vhost2/transport.h
> create mode 100644 lib/librte_vhost2/vhost.c
> create mode 100644 lib/librte_vhost2/vhost.h
> create mode 100644 lib/librte_vhost2/vhost_user.c
>
> --
> 2.11.0
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
2018-06-25 11:01 ` Tiwei Bie
@ 2018-06-25 12:17 ` Stojaczyk, DariuszX
2018-06-26 8:22 ` Tiwei Bie
0 siblings, 1 reply; 17+ messages in thread
From: Stojaczyk, DariuszX @ 2018-06-25 12:17 UTC (permalink / raw)
To: Bie, Tiwei, Dariusz Stojaczyk
Cc: dev, Maxime Coquelin, Tetsuya Mukawa, Stefan Hajnoczi,
Thomas Monjalon, yliu, Harris, James R, Kulasek, TomaszX,
Wodkowski, PawelX
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> Sent: Monday, June 25, 2018 1:02 PM
> <snip>
>
> Hi Dariusz,
>
Hi Tiwei,
> Thank you for putting efforts in making the DPDK
> vhost more generic!
>
> From my understanding, your proposal is that:
>
> 1) Introduce rte_vhost2 to provide the APIs which
> allow users to implement vhost backends like
> SCSI, net, crypto, ..
>
That's right.
> 2) Refactor the existing rte_vhost to use rte_vhost2.
> The rte_vhost will still provide below existing
> sets of APIs:
> 1. The APIs which allow users to implement
> external vhost backends (these APIs were
> designed for SPDK previously)
> 2. The APIs provided by the net backend
> 3. The APIs provided by the crypto backend
> And above APIs in rte_vhost won't be changed.
That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath and will call existing vhost_device_ops for e.g. starting the device once all queues are started.
Regards,
D.
>
> Is my above understanding correct? Thanks!
>
> Best regards,
> Tiwei Bie
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
2018-06-25 12:17 ` Stojaczyk, DariuszX
@ 2018-06-26 8:22 ` Tiwei Bie
2018-06-26 8:30 ` Thomas Monjalon
2018-06-26 8:47 ` Stojaczyk, DariuszX
0 siblings, 2 replies; 17+ messages in thread
From: Tiwei Bie @ 2018-06-26 8:22 UTC (permalink / raw)
To: Stojaczyk, DariuszX
Cc: Dariusz Stojaczyk, dev, Maxime Coquelin, Tetsuya Mukawa,
Stefan Hajnoczi, Thomas Monjalon, yliu, Harris, James R, Kulasek,
TomaszX, Wodkowski, PawelX
On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> > Sent: Monday, June 25, 2018 1:02 PM
> > <snip>
> >
> > Hi Dariusz,
> >
>
> Hi Tiwei,
>
> > Thank you for putting efforts in making the DPDK
> > vhost more generic!
> >
> > From my understanding, your proposal is that:
> >
> > 1) Introduce rte_vhost2 to provide the APIs which
> > allow users to implement vhost backends like
> > SCSI, net, crypto, ..
> >
>
> That's right.
>
> > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > The rte_vhost will still provide below existing
> > sets of APIs:
> > 1. The APIs which allow users to implement
> > external vhost backends (these APIs were
> > designed for SPDK previously)
> > 2. The APIs provided by the net backend
> > 3. The APIs provided by the crypto backend
> > And above APIs in rte_vhost won't be changed.
>
> That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath and will call existing vhost_device_ops for e.g. starting the device once all queues are started.
Currently I have below concerns and questions:
- The rte_vhost's problem is still there. Even though
rte_vhost2 is introduced, the net and crypto backends
in rte_vhost won't benefit from the new callbacks.
The existing rte_vhost in DPDK not only provides the
APIs for DPDK applications to implement the external
backends. But also provides high performance net and
crypto backends implementation (maybe more in the
future). So it's important that besides the DPDK
applications which implement their external backends,
the DPDK applications which use the builtin backends
will also benefit from the new callbacks.
So we should have a clear plan on how will the legacy
callbacks in rte_vhost be dealt with in the next step.
Besides, the new library's name is a bit misleading.
It makes the existing rte_vhost library sound like an
obsolete library. But actually the existing rte_vhost
isn't an obsolete library. It will still provide the
net and crypto backends. So if we want to introduce
this new library, we should give it a better name.
- It's possible to solve rte_vhost's problem you met
by refactoring the existing vhost library directly
instead of re-implementing a new vhost library from
scratch and keeping the old one's problem as is.
In this way, it will solve the problem you met and
also solve the problem for rte_vhost. Why not go
this way? Something like:
Below is the existing callbacks set in rte_vhost.h:
/**
* Device and vring operations.
*/
struct vhost_device_ops {
......
};
It's a legacy implementation, and doesn't really
follow the DPDK API design (e.g. no rte_ prefix).
We can design and implement a new message handling
and a new set of callbacks for rte_vhost to solve
the problem you met without changing the old one.
Something like:
struct rte_vhost_device_ops {
......
}
int
vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg *msg)
{
......
if (!vdev->is_using_new_device_ops) {
// Call the existing message handler
return vhost_user_msg_handler_legacy(vdev, msg);
}
// Implement the new logic here
......
}
A vhost application is allowed to register only struct
rte_vhost_device_ops or struct vhost_device_ops (which
should be deprecated in the future). The two ops cannot
be registered at the same time.
The existing applications could use the old ops. And
if an application registers struct rte_vhost_device_ops,
the new callbacks and message handler will be used.
Best regards,
Tiwei Bie
> Regards,
> D.
>
> >
> > Is my above understanding correct? Thanks!
> >
> > Best regards,
> > Tiwei Bie
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
2018-06-26 8:22 ` Tiwei Bie
@ 2018-06-26 8:30 ` Thomas Monjalon
2018-06-26 8:47 ` Stojaczyk, DariuszX
1 sibling, 0 replies; 17+ messages in thread
From: Thomas Monjalon @ 2018-06-26 8:30 UTC (permalink / raw)
To: Tiwei Bie
Cc: Stojaczyk, DariuszX, Dariusz Stojaczyk, dev, Maxime Coquelin,
Tetsuya Mukawa, Stefan Hajnoczi, yliu, Harris, James R, Kulasek,
TomaszX, Wodkowski, PawelX
26/06/2018 10:22, Tiwei Bie:
> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > From: Tiwei Bie
> > >
> > > Hi Dariusz,
> >
> > Hi Tiwei,
> >
> > > Thank you for putting efforts in making the DPDK
> > > vhost more generic!
> > >
> > > From my understanding, your proposal is that:
> > >
> > > 1) Introduce rte_vhost2 to provide the APIs which
> > > allow users to implement vhost backends like
> > > SCSI, net, crypto, ..
> >
> > That's right.
> >
> > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > > The rte_vhost will still provide below existing
> > > sets of APIs:
> > > 1. The APIs which allow users to implement
> > > external vhost backends (these APIs were
> > > designed for SPDK previously)
> > > 2. The APIs provided by the net backend
> > > 3. The APIs provided by the crypto backend
> > > And above APIs in rte_vhost won't be changed.
> >
> > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath and will call existing vhost_device_ops for e.g. starting the device once all queues are started.
>
> Currently I have below concerns and questions:
>
> - The rte_vhost's problem is still there. Even though
> rte_vhost2 is introduced, the net and crypto backends
> in rte_vhost won't benefit from the new callbacks.
>
> The existing rte_vhost in DPDK not only provides the
> APIs for DPDK applications to implement the external
> backends. But also provides high performance net and
> crypto backends implementation (maybe more in the
> future). So it's important that besides the DPDK
> applications which implement their external backends,
> the DPDK applications which use the builtin backends
> will also benefit from the new callbacks.
>
> So we should have a clear plan on how will the legacy
> callbacks in rte_vhost be dealt with in the next step.
>
> Besides, the new library's name is a bit misleading.
> It makes the existing rte_vhost library sound like an
> obsolete library. But actually the existing rte_vhost
> isn't an obsolete library. It will still provide the
> net and crypto backends. So if we want to introduce
> this new library, we should give it a better name.
>
> - It's possible to solve rte_vhost's problem you met
> by refactoring the existing vhost library directly
> instead of re-implementing a new vhost library from
> scratch and keeping the old one's problem as is.
+1
> In this way, it will solve the problem you met and
> also solve the problem for rte_vhost. Why not go
> this way? Something like:
>
> Below is the existing callbacks set in rte_vhost.h:
>
> /**
> * Device and vring operations.
> */
> struct vhost_device_ops {
> ......
> };
>
> It's a legacy implementation, and doesn't really
> follow the DPDK API design (e.g. no rte_ prefix).
> We can design and implement a new message handling
> and a new set of callbacks for rte_vhost to solve
> the problem you met without changing the old one.
> Something like:
>
> struct rte_vhost_device_ops {
> ......
> }
>
> int
> vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg *msg)
> {
> ......
>
> if (!vdev->is_using_new_device_ops) {
> // Call the existing message handler
> return vhost_user_msg_handler_legacy(vdev, msg);
> }
>
> // Implement the new logic here
> ......
> }
>
> A vhost application is allowed to register only struct
> rte_vhost_device_ops or struct vhost_device_ops (which
> should be deprecated in the future). The two ops cannot
> be registered at the same time.
>
> The existing applications could use the old ops. And
> if an application registers struct rte_vhost_device_ops,
> the new callbacks and message handler will be used.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
2018-06-26 8:22 ` Tiwei Bie
2018-06-26 8:30 ` Thomas Monjalon
@ 2018-06-26 8:47 ` Stojaczyk, DariuszX
2018-06-26 9:14 ` Tiwei Bie
1 sibling, 1 reply; 17+ messages in thread
From: Stojaczyk, DariuszX @ 2018-06-26 8:47 UTC (permalink / raw)
To: Bie, Tiwei
Cc: Dariusz Stojaczyk, dev, Maxime Coquelin, Tetsuya Mukawa,
Stefan Hajnoczi, Thomas Monjalon, yliu, Harris, James R, Kulasek,
TomaszX, Wodkowski, PawelX
> -----Original Message-----
> From: Bie, Tiwei
> Sent: Tuesday, June 26, 2018 10:22 AM
> To: Stojaczyk, DariuszX <dariuszx.stojaczyk@intel.com>
> Cc: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>; dev@dpdk.org; Maxime
> Coquelin <maxime.coquelin@redhat.com>; Tetsuya Mukawa
> <mtetsuyah@gmail.com>; Stefan Hajnoczi <stefanha@redhat.com>; Thomas
> Monjalon <thomas@monjalon.net>; yliu@fridaylinux.org; Harris, James R
> <james.r.harris@intel.com>; Kulasek, TomaszX <tomaszx.kulasek@intel.com>;
> Wodkowski, PawelX <pawelx.wodkowski@intel.com>
> Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
>
> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> > > Sent: Monday, June 25, 2018 1:02 PM
> > > <snip>
> > >
> > > Hi Dariusz,
> > >
> >
> > Hi Tiwei,
> >
> > > Thank you for putting efforts in making the DPDK
> > > vhost more generic!
> > >
> > > From my understanding, your proposal is that:
> > >
> > > 1) Introduce rte_vhost2 to provide the APIs which
> > > allow users to implement vhost backends like
> > > SCSI, net, crypto, ..
> > >
> >
> > That's right.
> >
> > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > > The rte_vhost will still provide below existing
> > > sets of APIs:
> > > 1. The APIs which allow users to implement
> > > external vhost backends (these APIs were
> > > designed for SPDK previously)
> > > 2. The APIs provided by the net backend
> > > 3. The APIs provided by the crypto backend
> > > And above APIs in rte_vhost won't be changed.
> >
> > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
> underneath and will call existing vhost_device_ops for e.g. starting the device
> once all queues are started.
>
> Currently I have below concerns and questions:
>
> - The rte_vhost's problem is still there. Even though
> rte_vhost2 is introduced, the net and crypto backends
> in rte_vhost won't benefit from the new callbacks.
>
> The existing rte_vhost in DPDK not only provides the
> APIs for DPDK applications to implement the external
> backends. But also provides high performance net and
> crypto backends implementation (maybe more in the
> future). So it's important that besides the DPDK
> applications which implement their external backends,
> the DPDK applications which use the builtin backends
> will also benefit from the new callbacks.
>
> So we should have a clear plan on how will the legacy
> callbacks in rte_vhost be dealt with in the next step.
>
> Besides, the new library's name is a bit misleading.
> It makes the existing rte_vhost library sound like an
> obsolete library. But actually the existing rte_vhost
> isn't an obsolete library. It will still provide the
> net and crypto backends. So if we want to introduce
> this new library, we should give it a better name.
>
> - It's possible to solve rte_vhost's problem you met
> by refactoring the existing vhost library directly
> instead of re-implementing a new vhost library from
> scratch and keeping the old one's problem as is.
>
> In this way, it will solve the problem you met and
> also solve the problem for rte_vhost. Why not go
> this way? Something like:
>
> Below is the existing callbacks set in rte_vhost.h:
>
> /**
> * Device and vring operations.
> */
> struct vhost_device_ops {
> ......
> };
>
> It's a legacy implementation, and doesn't really
> follow the DPDK API design (e.g. no rte_ prefix).
> We can design and implement a new message handling
> and a new set of callbacks for rte_vhost to solve
> the problem you met without changing the old one.
> Something like:
>
> struct rte_vhost_device_ops {
> ......
> }
>
> int
> vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
> *msg)
> {
> ......
>
> if (!vdev->is_using_new_device_ops) {
> // Call the existing message handler
> return vhost_user_msg_handler_legacy(vdev, msg);
> }
>
> // Implement the new logic here
> ......
> }
>
> A vhost application is allowed to register only struct
> rte_vhost_device_ops or struct vhost_device_ops (which
> should be deprecated in the future). The two ops cannot
> be registered at the same time.
>
> The existing applications could use the old ops. And
> if an application registers struct rte_vhost_device_ops,
> the new callbacks and message handler will be used.
Please notice that some features like vIOMMU are not even a part of the public rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating vhost-net from a generic vhost library (rte_vhost2) would avoid making such design mistakes in future. What's the point of having a single rte_vhost library, if some vhost-user features are only implemented for vhost-net.
>
> Best regards,
> Tiwei Bie
>
>
> > Regards,
> > D.
> >
> > >
> > > Is my above understanding correct? Thanks!
> > >
> > > Best regards,
> > > Tiwei Bie
> > >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
2018-06-26 8:47 ` Stojaczyk, DariuszX
@ 2018-06-26 9:14 ` Tiwei Bie
2018-06-26 9:38 ` Maxime Coquelin
0 siblings, 1 reply; 17+ messages in thread
From: Tiwei Bie @ 2018-06-26 9:14 UTC (permalink / raw)
To: Stojaczyk, DariuszX
Cc: Dariusz Stojaczyk, dev, Maxime Coquelin, Tetsuya Mukawa,
Stefan Hajnoczi, Thomas Monjalon, yliu, Harris, James R, Kulasek,
TomaszX, Wodkowski, PawelX
On Tue, Jun 26, 2018 at 04:47:33PM +0800, Stojaczyk, DariuszX wrote:
> > -----Original Message-----
> > From: Bie, Tiwei
> > Sent: Tuesday, June 26, 2018 10:22 AM
> > To: Stojaczyk, DariuszX <dariuszx.stojaczyk@intel.com>
> > Cc: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>; dev@dpdk.org; Maxime
> > Coquelin <maxime.coquelin@redhat.com>; Tetsuya Mukawa
> > <mtetsuyah@gmail.com>; Stefan Hajnoczi <stefanha@redhat.com>; Thomas
> > Monjalon <thomas@monjalon.net>; yliu@fridaylinux.org; Harris, James R
> > <james.r.harris@intel.com>; Kulasek, TomaszX <tomaszx.kulasek@intel.com>;
> > Wodkowski, PawelX <pawelx.wodkowski@intel.com>
> > Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
> >
> > On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> > > > Sent: Monday, June 25, 2018 1:02 PM
> > > > <snip>
> > > >
> > > > Hi Dariusz,
> > > >
> > >
> > > Hi Tiwei,
> > >
> > > > Thank you for putting efforts in making the DPDK
> > > > vhost more generic!
> > > >
> > > > From my understanding, your proposal is that:
> > > >
> > > > 1) Introduce rte_vhost2 to provide the APIs which
> > > > allow users to implement vhost backends like
> > > > SCSI, net, crypto, ..
> > > >
> > >
> > > That's right.
> > >
> > > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > > > The rte_vhost will still provide below existing
> > > > sets of APIs:
> > > > 1. The APIs which allow users to implement
> > > > external vhost backends (these APIs were
> > > > designed for SPDK previously)
> > > > 2. The APIs provided by the net backend
> > > > 3. The APIs provided by the crypto backend
> > > > And above APIs in rte_vhost won't be changed.
> > >
> > > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
> > underneath and will call existing vhost_device_ops for e.g. starting the device
> > once all queues are started.
> >
> > Currently I have below concerns and questions:
> >
> > - The rte_vhost's problem is still there. Even though
> > rte_vhost2 is introduced, the net and crypto backends
> > in rte_vhost won't benefit from the new callbacks.
> >
> > The existing rte_vhost in DPDK not only provides the
> > APIs for DPDK applications to implement the external
> > backends. But also provides high performance net and
> > crypto backends implementation (maybe more in the
> > future). So it's important that besides the DPDK
> > applications which implement their external backends,
> > the DPDK applications which use the builtin backends
> > will also benefit from the new callbacks.
> >
> > So we should have a clear plan on how will the legacy
> > callbacks in rte_vhost be dealt with in the next step.
> >
> > Besides, the new library's name is a bit misleading.
> > It makes the existing rte_vhost library sound like an
> > obsolete library. But actually the existing rte_vhost
> > isn't an obsolete library. It will still provide the
> > net and crypto backends. So if we want to introduce
> > this new library, we should give it a better name.
> >
> > - It's possible to solve rte_vhost's problem you met
> > by refactoring the existing vhost library directly
> > instead of re-implementing a new vhost library from
> > scratch and keeping the old one's problem as is.
> >
> > In this way, it will solve the problem you met and
> > also solve the problem for rte_vhost. Why not go
> > this way? Something like:
> >
> > Below is the existing callbacks set in rte_vhost.h:
> >
> > /**
> > * Device and vring operations.
> > */
> > struct vhost_device_ops {
> > ......
> > };
> >
> > It's a legacy implementation, and doesn't really
> > follow the DPDK API design (e.g. no rte_ prefix).
> > We can design and implement a new message handling
> > and a new set of callbacks for rte_vhost to solve
> > the problem you met without changing the old one.
> > Something like:
> >
> > struct rte_vhost_device_ops {
> > ......
> > }
> >
> > int
> > vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
> > *msg)
> > {
> > ......
> >
> > if (!vdev->is_using_new_device_ops) {
> > // Call the existing message handler
> > return vhost_user_msg_handler_legacy(vdev, msg);
> > }
> >
> > // Implement the new logic here
> > ......
> > }
> >
> > A vhost application is allowed to register only struct
> > rte_vhost_device_ops or struct vhost_device_ops (which
> > should be deprecated in the future). The two ops cannot
> > be registered at the same time.
> >
> > The existing applications could use the old ops. And
> > if an application registers struct rte_vhost_device_ops,
> > the new callbacks and message handler will be used.
>
> Please notice that some features like vIOMMU are not even a part of the public rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating vhost-net from a generic vhost library (rte_vhost2) would avoid making such design mistakes in future. What's the point of having a single rte_vhost library, if some vhost-user features are only implemented for vhost-net.
These APIs can be safely added at any time.
And we can also ask developers to add public
APIs if it matters when adding new features
in the future. I don't think it's a big
problem.
Best regards,
Tiwei Bie
>
> >
> > Best regards,
> > Tiwei Bie
> >
> >
> > > Regards,
> > > D.
> > >
> > > >
> > > > Is my above understanding correct? Thanks!
> > > >
> > > > Best regards,
> > > > Tiwei Bie
> > > >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
2018-06-26 9:14 ` Tiwei Bie
@ 2018-06-26 9:38 ` Maxime Coquelin
0 siblings, 0 replies; 17+ messages in thread
From: Maxime Coquelin @ 2018-06-26 9:38 UTC (permalink / raw)
To: Tiwei Bie, Stojaczyk, DariuszX
Cc: Dariusz Stojaczyk, dev, Tetsuya Mukawa, Stefan Hajnoczi,
Thomas Monjalon, yliu, Harris, James R, Kulasek, TomaszX,
Wodkowski, PawelX
On 06/26/2018 11:14 AM, Tiwei Bie wrote:
> On Tue, Jun 26, 2018 at 04:47:33PM +0800, Stojaczyk, DariuszX wrote:
>>> -----Original Message-----
>>> From: Bie, Tiwei
>>> Sent: Tuesday, June 26, 2018 10:22 AM
>>> To: Stojaczyk, DariuszX <dariuszx.stojaczyk@intel.com>
>>> Cc: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>; dev@dpdk.org; Maxime
>>> Coquelin <maxime.coquelin@redhat.com>; Tetsuya Mukawa
>>> <mtetsuyah@gmail.com>; Stefan Hajnoczi <stefanha@redhat.com>; Thomas
>>> Monjalon <thomas@monjalon.net>; yliu@fridaylinux.org; Harris, James R
>>> <james.r.harris@intel.com>; Kulasek, TomaszX <tomaszx.kulasek@intel.com>;
>>> Wodkowski, PawelX <pawelx.wodkowski@intel.com>
>>> Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
>>>
>>> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
>>>>> -----Original Message-----
>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
>>>>> Sent: Monday, June 25, 2018 1:02 PM
>>>>> <snip>
>>>>>
>>>>> Hi Dariusz,
>>>>>
>>>>
>>>> Hi Tiwei,
>>>>
>>>>> Thank you for putting efforts in making the DPDK
>>>>> vhost more generic!
>>>>>
>>>>> From my understanding, your proposal is that:
>>>>>
>>>>> 1) Introduce rte_vhost2 to provide the APIs which
>>>>> allow users to implement vhost backends like
>>>>> SCSI, net, crypto, ..
>>>>>
>>>>
>>>> That's right.
>>>>
>>>>> 2) Refactor the existing rte_vhost to use rte_vhost2.
>>>>> The rte_vhost will still provide below existing
>>>>> sets of APIs:
>>>>> 1. The APIs which allow users to implement
>>>>> external vhost backends (these APIs were
>>>>> designed for SPDK previously)
>>>>> 2. The APIs provided by the net backend
>>>>> 3. The APIs provided by the crypto backend
>>>>> And above APIs in rte_vhost won't be changed.
>>>>
>>>> That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
>>> underneath and will call existing vhost_device_ops for e.g. starting the device
>>> once all queues are started.
>>>
>>> Currently I have below concerns and questions:
>>>
>>> - The rte_vhost's problem is still there. Even though
>>> rte_vhost2 is introduced, the net and crypto backends
>>> in rte_vhost won't benefit from the new callbacks.
>>>
>>> The existing rte_vhost in DPDK not only provides the
>>> APIs for DPDK applications to implement the external
>>> backends. But also provides high performance net and
>>> crypto backends implementation (maybe more in the
>>> future). So it's important that besides the DPDK
>>> applications which implement their external backends,
>>> the DPDK applications which use the builtin backends
>>> will also benefit from the new callbacks.
>>>
>>> So we should have a clear plan on how will the legacy
>>> callbacks in rte_vhost be dealt with in the next step.
>>>
>>> Besides, the new library's name is a bit misleading.
>>> It makes the existing rte_vhost library sound like an
>>> obsolete library. But actually the existing rte_vhost
>>> isn't an obsolete library. It will still provide the
>>> net and crypto backends. So if we want to introduce
>>> this new library, we should give it a better name.
>>>
>>> - It's possible to solve rte_vhost's problem you met
>>> by refactoring the existing vhost library directly
>>> instead of re-implementing a new vhost library from
>>> scratch and keeping the old one's problem as is.
>>>
>>> In this way, it will solve the problem you met and
>>> also solve the problem for rte_vhost. Why not go
>>> this way? Something like:
>>>
>>> Below is the existing callbacks set in rte_vhost.h:
>>>
>>> /**
>>> * Device and vring operations.
>>> */
>>> struct vhost_device_ops {
>>> ......
>>> };
>>>
>>> It's a legacy implementation, and doesn't really
>>> follow the DPDK API design (e.g. no rte_ prefix).
>>> We can design and implement a new message handling
>>> and a new set of callbacks for rte_vhost to solve
>>> the problem you met without changing the old one.
>>> Something like:
>>>
>>> struct rte_vhost_device_ops {
>>> ......
>>> }
>>>
>>> int
>>> vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
>>> *msg)
>>> {
>>> ......
>>>
>>> if (!vdev->is_using_new_device_ops) {
>>> // Call the existing message handler
>>> return vhost_user_msg_handler_legacy(vdev, msg);
>>> }
>>>
>>> // Implement the new logic here
>>> ......
>>> }
>>>
>>> A vhost application is allowed to register only struct
>>> rte_vhost_device_ops or struct vhost_device_ops (which
>>> should be deprecated in the future). The two ops cannot
>>> be registered at the same time.
>>>
>>> The existing applications could use the old ops. And
>>> if an application registers struct rte_vhost_device_ops,
>>> the new callbacks and message handler will be used.
>>
>> Please notice that some features like vIOMMU are not even a part of the public rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating vhost-net from a generic vhost library (rte_vhost2) would avoid making such design mistakes in future. What's the point of having a single rte_vhost library, if some vhost-user features are only implemented for vhost-net.
>
> These APIs can be safely added at any time.
> And we can also ask developers to add public
> APIs if it matters when adding new features
> in the future. I don't think it's a big
> problem.
+1, I don't think it is a problem.
It is better to have it internal only at the beginning than having to
break the API.
Thanks,
Maxime
> Best regards,
> Tiwei Bie
>
>>
>>>
>>> Best regards,
>>> Tiwei Bie
>>>
>>>
>>>> Regards,
>>>> D.
>>>>
>>>>>
>>>>> Is my above understanding correct? Thanks!
>>>>>
>>>>> Best regards,
>>>>> Tiwei Bie
>>>>>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-06-26 9:38 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-10 13:22 [dpdk-dev] [RFC] vhost: new rte_vhost API proposal Dariusz Stojaczyk
[not found] ` <20180510163643.GD9308@stefanha-x1.localdomain>
2018-05-11 5:55 ` Stojaczyk, DariuszX
[not found] ` <20180511100531.GA19894@stefanha-x1.localdomain>
2018-05-18 7:51 ` Stojaczyk, DariuszX
2018-05-18 13:01 ` [dpdk-dev] [RFC v2] " Dariusz Stojaczyk
2018-05-18 13:50 ` Maxime Coquelin
2018-05-20 7:07 ` Yuanhan Liu
2018-05-22 10:19 ` Stojaczyk, DariuszX
[not found] ` <20180525100550.GD14757@stefanha-x1.localdomain>
2018-05-29 13:38 ` Stojaczyk, DariuszX
[not found] ` <20180530085700.GC14623@stefanha-x1.localdomain>
2018-05-30 12:24 ` Stojaczyk, DariuszX
[not found] ` <20180607151227.23660-1-darek.stojaczyk@gmail.com>
[not found] ` <20180608100852.GA31164@stefanha-x1.localdomain>
2018-06-13 9:41 ` [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal Dariusz Stojaczyk
2018-06-25 11:01 ` Tiwei Bie
2018-06-25 12:17 ` Stojaczyk, DariuszX
2018-06-26 8:22 ` Tiwei Bie
2018-06-26 8:30 ` Thomas Monjalon
2018-06-26 8:47 ` Stojaczyk, DariuszX
2018-06-26 9:14 ` Tiwei Bie
2018-06-26 9:38 ` Maxime Coquelin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).