DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: yuanhan.liu@linux.intel.com, huawei.xie@intel.com, dev@dpdk.org,
	vkaplans@redhat.com, stephen@networkplumber.org
Subject: Re: [dpdk-dev] [PATCH v3] vhost: Add indirect descriptors support to the TX path
Date: Fri, 23 Sep 2016 21:22:23 +0300	[thread overview]
Message-ID: <20160923212116-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <425573ad-216f-54e7-f4ee-998a4f87e189@redhat.com>

On Fri, Sep 23, 2016 at 08:16:09PM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/23/2016 08:06 PM, Michael S. Tsirkin wrote:
> > On Fri, Sep 23, 2016 at 08:02:27PM +0200, Maxime Coquelin wrote:
> > > 
> > > 
> > > On 09/23/2016 05:49 PM, Michael S. Tsirkin wrote:
> > > > On Fri, Sep 23, 2016 at 10:28:23AM +0200, Maxime Coquelin wrote:
> > > > > Indirect descriptors are usually supported by virtio-net devices,
> > > > > allowing to dispatch a larger number of requests.
> > > > > 
> > > > > When the virtio device sends a packet using indirect descriptors,
> > > > > only one slot is used in the ring, even for large packets.
> > > > > 
> > > > > The main effect is to improve the 0% packet loss benchmark.
> > > > > A PVP benchmark using Moongen (64 bytes) on the TE, and testpmd
> > > > > (fwd io for host, macswap for VM) on DUT shows a +50% gain for
> > > > > zero loss.
> > > > > 
> > > > > On the downside, micro-benchmark using testpmd txonly in VM and
> > > > > rxonly on host shows a loss between 1 and 4%.i But depending on
> > > > > the needs, feature can be disabled at VM boot time by passing
> > > > > indirect_desc=off argument to vhost-user device in Qemu.
> > > > 
> > > > Even better, change guest pmd to only use indirect
> > > > descriptors when this makes sense (e.g. sufficiently
> > > > large packets).
> > > With the micro-benchmark, the degradation is quite constant whatever
> > > the packet size.
> > > 
> > > For PVP, I could not test with larger packets than 64 bytes, as I don't
> > > have a 40G interface,
> > 
> > Don't 64 byte packets fit in a single slot anyway?
> No, indirect is used. I didn't checked in details, but I think this is
> because there is no headroom reserved in the mbuf.
> 
> This is the condition to meet to fit in a single slot:
> /* optimize ring usage */
> if (vtpci_with_feature(hw, VIRTIO_F_ANY_LAYOUT) &&
> 	rte_mbuf_refcnt_read(txm) == 1 &&
> 	RTE_MBUF_DIRECT(txm) &&
> 	txm->nb_segs == 1 &&
> 	rte_pktmbuf_headroom(txm) >= hdr_size &&
> 	rte_is_aligned(rte_pktmbuf_mtod(txm, char *),
> 		__alignof__(struct virtio_net_hdr_mrg_rxbuf)))
>     can_push = 1;
> else if (vtpci_with_feature(hw, VIRTIO_RING_F_INDIRECT_DESC) &&
> 	txm->nb_segs < VIRTIO_MAX_TX_INDIRECT)
>     use_indirect = 1;
> 
> I will check more in details next week.

Two thoughts then
1. so can some headroom be reserved?
2. how about using indirect with 3 s/g entries,
   but direct with 2 and down?


> > Why would there be an effect with that?
> > 
> > > and line rate with 10G is reached rapidly.
> > 
> > Right but focus on packet loss. you can have that at any rate.
> > 
> > > 
> > > > I would be very interested to know when does it make
> > > > sense.
> > > > 
> > > > The feature is there, it's up to guest whether to
> > > > use it.
> > > Do you mean the PMD should detect dynamically whether using indirect,
> > > or having an option at device init time to enable or not the feature?
> > 
> > guest PMD should not use indirect where it slows things down.
> > 
> > > > 
> > > > 
> > > > > Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > ---
> > > > > Changes since v2:
> > > > > =================
> > > > >  - Revert back to not checking feature flag to be aligned with
> > > > > kernel implementation
> > > > >  - Ensure we don't have nested indirect descriptors
> > > > >  - Ensure the indirect desc address is valid, to protect against
> > > > > malicious guests
> > > > > 
> > > > > Changes since RFC:
> > > > > =================
> > > > >  - Enrich commit message with figures
> > > > >  - Rebased on top of dpdk-next-virtio's master
> > > > >  - Add feature check to ensure we don't receive an indirect desc
> > > > > if not supported by the virtio driver
> > > > > 
> > > > >  lib/librte_vhost/vhost.c      |  3 ++-
> > > > >  lib/librte_vhost/virtio_net.c | 41 +++++++++++++++++++++++++++++++----------
> > > > >  2 files changed, 33 insertions(+), 11 deletions(-)
> > > > > 
> > > > > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> > > > > index 46095c3..30bb0ce 100644
> > > > > --- a/lib/librte_vhost/vhost.c
> > > > > +++ b/lib/librte_vhost/vhost.c
> > > > > @@ -65,7 +65,8 @@
> > > > >  				(1ULL << VIRTIO_NET_F_CSUM)    | \
> > > > >  				(1ULL << VIRTIO_NET_F_GUEST_CSUM) | \
> > > > >  				(1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
> > > > > -				(1ULL << VIRTIO_NET_F_GUEST_TSO6))
> > > > > +				(1ULL << VIRTIO_NET_F_GUEST_TSO6) | \
> > > > > +				(1ULL << VIRTIO_RING_F_INDIRECT_DESC))
> > > > > 
> > > > >  uint64_t VHOST_FEATURES = VHOST_SUPPORTED_FEATURES;
> > > > > 
> > > > > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > > > > index 8a151af..2e0a587 100644
> > > > > --- a/lib/librte_vhost/virtio_net.c
> > > > > +++ b/lib/librte_vhost/virtio_net.c
> > > > > @@ -679,8 +679,8 @@ make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
> > > > >  }
> > > > > 
> > > > >  static inline int __attribute__((always_inline))
> > > > > -copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > > -		  struct rte_mbuf *m, uint16_t desc_idx,
> > > > > +copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > > > +		  uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > > >  		  struct rte_mempool *mbuf_pool)
> > > > >  {
> > > > >  	struct vring_desc *desc;
> > > > > @@ -693,8 +693,9 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > >  	/* A counter to avoid desc dead loop chain */
> > > > >  	uint32_t nr_desc = 1;
> > > > > 
> > > > > -	desc = &vq->desc[desc_idx];
> > > > > -	if (unlikely(desc->len < dev->vhost_hlen))
> > > > > +	desc = &descs[desc_idx];
> > > > > +	if (unlikely((desc->len < dev->vhost_hlen)) ||
> > > > > +			(desc->flags & VRING_DESC_F_INDIRECT))
> > > > >  		return -1;
> > > > > 
> > > > >  	desc_addr = gpa_to_vva(dev, desc->addr);
> > > > > @@ -711,7 +712,9 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > >  	 */
> > > > >  	if (likely((desc->len == dev->vhost_hlen) &&
> > > > >  		   (desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > > -		desc = &vq->desc[desc->next];
> > > > > +		desc = &descs[desc->next];
> > > > > +		if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > > +			return -1;
> > > > > 
> > > > >  		desc_addr = gpa_to_vva(dev, desc->addr);
> > > > >  		if (unlikely(!desc_addr))
> > > > 
> > > > 
> > > > Just to make sure, does this still allow a chain of
> > > > direct descriptors ending with an indirect one?
> > > > This is legal as per spec.
> > > > 
> > > > > @@ -747,10 +750,12 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > >  			if ((desc->flags & VRING_DESC_F_NEXT) == 0)
> > > > >  				break;
> > > > > 
> > > > > -			if (unlikely(desc->next >= vq->size ||
> > > > > -				     ++nr_desc > vq->size))
> > > > > +			if (unlikely(desc->next >= max_desc ||
> > > > > +				     ++nr_desc > max_desc))
> > > > > +				return -1;
> > > > > +			desc = &descs[desc->next];
> > > > > +			if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > >  				return -1;
> > > > > -			desc = &vq->desc[desc->next];
> > > > > 
> > > > >  			desc_addr = gpa_to_vva(dev, desc->addr);
> > > > >  			if (unlikely(!desc_addr))
> > > > > @@ -878,19 +883,35 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> > > > >  	/* Prefetch descriptor index. */
> > > > >  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
> > > > >  	for (i = 0; i < count; i++) {
> > > > > +		struct vring_desc *desc;
> > > > > +		uint16_t sz, idx;
> > > > >  		int err;
> > > > > 
> > > > >  		if (likely(i + 1 < count))
> > > > >  			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
> > > > > 
> > > > > +		if (vq->desc[desc_indexes[i]].flags & VRING_DESC_F_INDIRECT) {
> > > > > +			desc = (struct vring_desc *)gpa_to_vva(dev,
> > > > > +					vq->desc[desc_indexes[i]].addr);
> > > > > +			if (unlikely(!desc))
> > > > > +				break;
> > > > > +
> > > > > +			rte_prefetch0(desc);
> > > > > +			sz = vq->desc[desc_indexes[i]].len / sizeof(*desc);
> > > > > +			idx = 0;
> > > > > +		} else {
> > > > > +			desc = vq->desc;
> > > > > +			sz = vq->size;
> > > > > +			idx = desc_indexes[i];
> > > > > +		}
> > > > > +
> > > > >  		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
> > > > >  		if (unlikely(pkts[i] == NULL)) {
> > > > >  			RTE_LOG(ERR, VHOST_DATA,
> > > > >  				"Failed to allocate memory for mbuf.\n");
> > > > >  			break;
> > > > >  		}
> > > > > -		err = copy_desc_to_mbuf(dev, vq, pkts[i], desc_indexes[i],
> > > > > -					mbuf_pool);
> > > > > +		err = copy_desc_to_mbuf(dev, desc, sz, pkts[i], idx, mbuf_pool);
> > > > >  		if (unlikely(err)) {
> > > > >  			rte_pktmbuf_free(pkts[i]);
> > > > >  			break;
> > > > > --
> > > > > 2.7.4
> > > > 
> > > > Something that I'm missing here: it's legal for guest
> > > > to add indirect descriptors for RX.
> > > > I don't see the handling of RX here though.
> > > > I think it's required for spec compliance.
> > > > 

  reply	other threads:[~2016-09-23 18:22 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-23  8:28 Maxime Coquelin
2016-09-23 15:49 ` Michael S. Tsirkin
2016-09-23 18:02   ` Maxime Coquelin
2016-09-23 18:06     ` Michael S. Tsirkin
2016-09-23 18:16       ` Maxime Coquelin
2016-09-23 18:22         ` Michael S. Tsirkin [this message]
2016-09-23 20:24           ` Stephen Hemminger
2016-09-26  3:03             ` Yuanhan Liu
2016-09-26 12:25               ` Michael S. Tsirkin
2016-09-26 13:04                 ` Yuanhan Liu
2016-09-27  4:15 ` Yuanhan Liu
2016-09-27  7:25   ` Maxime Coquelin
2016-09-27  8:42 ` [dpdk-dev] [PATCH v4] " Maxime Coquelin
2016-09-27 12:18   ` Yuanhan Liu
2016-10-14  7:24   ` Wang, Zhihong
2016-10-14  7:34     ` Wang, Zhihong
2016-10-14 15:50     ` Maxime Coquelin
2016-10-17 11:23       ` Maxime Coquelin
2016-10-17 13:21         ` Yuanhan Liu
2016-10-17 14:14           ` Maxime Coquelin
2016-10-27  9:00             ` Wang, Zhihong
2016-10-27  9:10               ` Maxime Coquelin
2016-10-27  9:55                 ` Maxime Coquelin
2016-10-27 10:19                   ` Wang, Zhihong
2016-10-28  7:32                     ` Pierre Pfister (ppfister)
2016-10-28  7:58                       ` Maxime Coquelin
2016-11-01  8:15                         ` Yuanhan Liu
2016-11-01  9:39                           ` Thomas Monjalon
2016-11-02  2:44                             ` Yuanhan Liu
2016-10-27 10:33                 ` Yuanhan Liu
2016-10-27 10:35                   ` Maxime Coquelin
2016-10-27 10:46                     ` Yuanhan Liu
2016-10-28  0:49                       ` Wang, Zhihong
2016-10-28  7:42                         ` Maxime Coquelin
2016-10-31 10:01                           ` Wang, Zhihong
2016-11-02 10:51                             ` Maxime Coquelin
2016-11-03  8:11                               ` Maxime Coquelin
2016-11-04  6:18                                 ` Xu, Qian Q
2016-11-04  7:41                                   ` Maxime Coquelin
2016-11-04  7:20                                 ` Wang, Zhihong
2016-11-04  7:57                                   ` Maxime Coquelin
2016-11-04  7:59                                     ` Maxime Coquelin
2016-11-04 10:43                                       ` Wang, Zhihong
2016-11-04 11:22                                         ` Maxime Coquelin
2016-11-04 11:36                                           ` Yuanhan Liu
2016-11-04 11:39                                             ` Maxime Coquelin
2016-11-04 12:30                                           ` Wang, Zhihong
2016-11-04 12:54                                             ` Maxime Coquelin
2016-11-04 13:09                                               ` Wang, Zhihong
2016-11-08 10:51                                                 ` Wang, Zhihong
2016-10-27 10:53                   ` Maxime Coquelin
2016-10-28  6:05                     ` Xu, Qian Q

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160923212116-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=dev@dpdk.org \
    --cc=huawei.xie@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=stephen@networkplumber.org \
    --cc=vkaplans@redhat.com \
    --cc=yuanhan.liu@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).