From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maxime.coquelin@redhat.com>
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 by dpdk.org (Postfix) with ESMTP id 08167231C
 for <dev@dpdk.org>; Fri, 23 Sep 2016 20:16:14 +0200 (CEST)
Received: from int-mx14.intmail.prod.int.phx2.redhat.com
 (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.redhat.com (Postfix) with ESMTPS id 205F6148DE;
 Fri, 23 Sep 2016 18:16:13 +0000 (UTC)
Received: from [10.36.7.3] (vpn1-7-3.ams2.redhat.com [10.36.7.3])
 by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id
 u8NIG9JB017558
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Fri, 23 Sep 2016 14:16:11 -0400
To: "Michael S. Tsirkin" <mst@redhat.com>
References: <1474619303-16709-1-git-send-email-maxime.coquelin@redhat.com>
 <20160923184310-mutt-send-email-mst@kernel.org>
 <aec9bfb3-cca2-1b2a-01cb-7b643b0d6ae3@redhat.com>
 <20160923210259-mutt-send-email-mst@kernel.org>
Cc: yuanhan.liu@linux.intel.com, huawei.xie@intel.com, dev@dpdk.org,
 vkaplans@redhat.com, stephen@networkplumber.org
From: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-ID: <425573ad-216f-54e7-f4ee-998a4f87e189@redhat.com>
Date: Fri, 23 Sep 2016 20:16:09 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.2.0
MIME-Version: 1.0
In-Reply-To: <20160923210259-mutt-send-email-mst@kernel.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
 (mx1.redhat.com [10.5.110.25]); Fri, 23 Sep 2016 18:16:13 +0000 (UTC)
Subject: Re: [dpdk-dev] [PATCH v3] vhost: Add indirect descriptors support
	to the TX path
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 23 Sep 2016 18:16:14 -0000


On 09/23/2016 08:06 PM, Michael S. Tsirkin wrote:
> On Fri, Sep 23, 2016 at 08:02:27PM +0200, Maxime Coquelin wrote:
>>
>>
>> On 09/23/2016 05:49 PM, Michael S. Tsirkin wrote:
>>> On Fri, Sep 23, 2016 at 10:28:23AM +0200, Maxime Coquelin wrote:
>>>> Indirect descriptors are usually supported by virtio-net devices,
>>>> allowing to dispatch a larger number of requests.
>>>>
>>>> When the virtio device sends a packet using indirect descriptors,
>>>> only one slot is used in the ring, even for large packets.
>>>>
>>>> The main effect is to improve the 0% packet loss benchmark.
>>>> A PVP benchmark using Moongen (64 bytes) on the TE, and testpmd
>>>> (fwd io for host, macswap for VM) on DUT shows a +50% gain for
>>>> zero loss.
>>>>
>>>> On the downside, micro-benchmark using testpmd txonly in VM and
>>>> rxonly on host shows a loss between 1 and 4%.i But depending on
>>>> the needs, feature can be disabled at VM boot time by passing
>>>> indirect_desc=off argument to vhost-user device in Qemu.
>>>
>>> Even better, change guest pmd to only use indirect
>>> descriptors when this makes sense (e.g. sufficiently
>>> large packets).
>> With the micro-benchmark, the degradation is quite constant whatever
>> the packet size.
>>
>> For PVP, I could not test with larger packets than 64 bytes, as I don't
>> have a 40G interface,
>
> Don't 64 byte packets fit in a single slot anyway?
No, indirect is used. I didn't checked in details, but I think this is
because there is no headroom reserved in the mbuf.

This is the condition to meet to fit in a single slot:
/* optimize ring usage */
if (vtpci_with_feature(hw, VIRTIO_F_ANY_LAYOUT) &&
	rte_mbuf_refcnt_read(txm) == 1 &&
	RTE_MBUF_DIRECT(txm) &&
	txm->nb_segs == 1 &&
	rte_pktmbuf_headroom(txm) >= hdr_size &&
	rte_is_aligned(rte_pktmbuf_mtod(txm, char *),
		__alignof__(struct virtio_net_hdr_mrg_rxbuf)))
     can_push = 1;
else if (vtpci_with_feature(hw, VIRTIO_RING_F_INDIRECT_DESC) &&
	txm->nb_segs < VIRTIO_MAX_TX_INDIRECT)
     use_indirect = 1;

I will check more in details next week.

> Why would there be an effect with that?
>
>> and line rate with 10G is reached rapidly.
>
> Right but focus on packet loss. you can have that at any rate.
>
>>
>>> I would be very interested to know when does it make
>>> sense.
>>>
>>> The feature is there, it's up to guest whether to
>>> use it.
>> Do you mean the PMD should detect dynamically whether using indirect,
>> or having an option at device init time to enable or not the feature?
>
> guest PMD should not use indirect where it slows things down.
>
>>>
>>>
>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>> ---
>>>> Changes since v2:
>>>> =================
>>>>  - Revert back to not checking feature flag to be aligned with
>>>> kernel implementation
>>>>  - Ensure we don't have nested indirect descriptors
>>>>  - Ensure the indirect desc address is valid, to protect against
>>>> malicious guests
>>>>
>>>> Changes since RFC:
>>>> =================
>>>>  - Enrich commit message with figures
>>>>  - Rebased on top of dpdk-next-virtio's master
>>>>  - Add feature check to ensure we don't receive an indirect desc
>>>> if not supported by the virtio driver
>>>>
>>>>  lib/librte_vhost/vhost.c      |  3 ++-
>>>>  lib/librte_vhost/virtio_net.c | 41 +++++++++++++++++++++++++++++++----------
>>>>  2 files changed, 33 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
>>>> index 46095c3..30bb0ce 100644
>>>> --- a/lib/librte_vhost/vhost.c
>>>> +++ b/lib/librte_vhost/vhost.c
>>>> @@ -65,7 +65,8 @@
>>>>  				(1ULL << VIRTIO_NET_F_CSUM)    | \
>>>>  				(1ULL << VIRTIO_NET_F_GUEST_CSUM) | \
>>>>  				(1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
>>>> -				(1ULL << VIRTIO_NET_F_GUEST_TSO6))
>>>> +				(1ULL << VIRTIO_NET_F_GUEST_TSO6) | \
>>>> +				(1ULL << VIRTIO_RING_F_INDIRECT_DESC))
>>>>
>>>>  uint64_t VHOST_FEATURES = VHOST_SUPPORTED_FEATURES;
>>>>
>>>> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
>>>> index 8a151af..2e0a587 100644
>>>> --- a/lib/librte_vhost/virtio_net.c
>>>> +++ b/lib/librte_vhost/virtio_net.c
>>>> @@ -679,8 +679,8 @@ make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
>>>>  }
>>>>
>>>>  static inline int __attribute__((always_inline))
>>>> -copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>>> -		  struct rte_mbuf *m, uint16_t desc_idx,
>>>> +copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
>>>> +		  uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
>>>>  		  struct rte_mempool *mbuf_pool)
>>>>  {
>>>>  	struct vring_desc *desc;
>>>> @@ -693,8 +693,9 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>>>  	/* A counter to avoid desc dead loop chain */
>>>>  	uint32_t nr_desc = 1;
>>>>
>>>> -	desc = &vq->desc[desc_idx];
>>>> -	if (unlikely(desc->len < dev->vhost_hlen))
>>>> +	desc = &descs[desc_idx];
>>>> +	if (unlikely((desc->len < dev->vhost_hlen)) ||
>>>> +			(desc->flags & VRING_DESC_F_INDIRECT))
>>>>  		return -1;
>>>>
>>>>  	desc_addr = gpa_to_vva(dev, desc->addr);
>>>> @@ -711,7 +712,9 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>>>  	 */
>>>>  	if (likely((desc->len == dev->vhost_hlen) &&
>>>>  		   (desc->flags & VRING_DESC_F_NEXT) != 0)) {
>>>> -		desc = &vq->desc[desc->next];
>>>> +		desc = &descs[desc->next];
>>>> +		if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
>>>> +			return -1;
>>>>
>>>>  		desc_addr = gpa_to_vva(dev, desc->addr);
>>>>  		if (unlikely(!desc_addr))
>>>
>>>
>>> Just to make sure, does this still allow a chain of
>>> direct descriptors ending with an indirect one?
>>> This is legal as per spec.
>>>
>>>> @@ -747,10 +750,12 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>>>  			if ((desc->flags & VRING_DESC_F_NEXT) == 0)
>>>>  				break;
>>>>
>>>> -			if (unlikely(desc->next >= vq->size ||
>>>> -				     ++nr_desc > vq->size))
>>>> +			if (unlikely(desc->next >= max_desc ||
>>>> +				     ++nr_desc > max_desc))
>>>> +				return -1;
>>>> +			desc = &descs[desc->next];
>>>> +			if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
>>>>  				return -1;
>>>> -			desc = &vq->desc[desc->next];
>>>>
>>>>  			desc_addr = gpa_to_vva(dev, desc->addr);
>>>>  			if (unlikely(!desc_addr))
>>>> @@ -878,19 +883,35 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>>>>  	/* Prefetch descriptor index. */
>>>>  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
>>>>  	for (i = 0; i < count; i++) {
>>>> +		struct vring_desc *desc;
>>>> +		uint16_t sz, idx;
>>>>  		int err;
>>>>
>>>>  		if (likely(i + 1 < count))
>>>>  			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
>>>>
>>>> +		if (vq->desc[desc_indexes[i]].flags & VRING_DESC_F_INDIRECT) {
>>>> +			desc = (struct vring_desc *)gpa_to_vva(dev,
>>>> +					vq->desc[desc_indexes[i]].addr);
>>>> +			if (unlikely(!desc))
>>>> +				break;
>>>> +
>>>> +			rte_prefetch0(desc);
>>>> +			sz = vq->desc[desc_indexes[i]].len / sizeof(*desc);
>>>> +			idx = 0;
>>>> +		} else {
>>>> +			desc = vq->desc;
>>>> +			sz = vq->size;
>>>> +			idx = desc_indexes[i];
>>>> +		}
>>>> +
>>>>  		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
>>>>  		if (unlikely(pkts[i] == NULL)) {
>>>>  			RTE_LOG(ERR, VHOST_DATA,
>>>>  				"Failed to allocate memory for mbuf.\n");
>>>>  			break;
>>>>  		}
>>>> -		err = copy_desc_to_mbuf(dev, vq, pkts[i], desc_indexes[i],
>>>> -					mbuf_pool);
>>>> +		err = copy_desc_to_mbuf(dev, desc, sz, pkts[i], idx, mbuf_pool);
>>>>  		if (unlikely(err)) {
>>>>  			rte_pktmbuf_free(pkts[i]);
>>>>  			break;
>>>> --
>>>> 2.7.4
>>>
>>> Something that I'm missing here: it's legal for guest
>>> to add indirect descriptors for RX.
>>> I don't see the handling of RX here though.
>>> I think it's required for spec compliance.
>>>