From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maxime.coquelin@redhat.com>
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 by dpdk.org (Postfix) with ESMTP id E4443239;
 Wed,  6 Dec 2017 19:02:05 +0100 (CET)
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com
 [10.5.11.15])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.redhat.com (Postfix) with ESMTPS id 3A1A861D1A;
 Wed,  6 Dec 2017 18:02:05 +0000 (UTC)
Received: from [10.36.112.28] (ovpn-112-28.ams2.redhat.com [10.36.112.28])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 629A06269B;
 Wed,  6 Dec 2017 18:01:55 +0000 (UTC)
To: Victor Kaplansky <vkaplans@redhat.com>, dev@dpdk.org,
 yliu@fridaylinux.org, tiwei.bie@intel.com, jianfeng.tan@intel.com
Cc: stable@dpdk.org, jfreiman@redhat.com, Victor Kaplansky <victork@redhat.com>
References: <20171206152731.2242-1-vkaplans@redhat.com>
From: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-ID: <268e153e-041a-42f8-0ca6-af4feee729e1@redhat.com>
Date: Wed, 6 Dec 2017 19:01:51 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <20171206152731.2242-1-vkaplans@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
 (mx1.redhat.com [10.5.110.39]); Wed, 06 Dec 2017 18:02:05 +0000 (UTC)
Subject: Re: [dpdk-stable] [RFC PATCH v2] vhost_user: protect active rings
 from async ring changes
X-BeenThere: stable@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches for DPDK stable branches <stable.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/stable>,
 <mailto:stable-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/stable/>
List-Post: <mailto:stable@dpdk.org>
List-Help: <mailto:stable-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/stable>,
 <mailto:stable-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Dec 2017 18:02:06 -0000


On 12/06/2017 04:28 PM, Victor Kaplansky wrote:
> 
> v2:
>     o Fixed checkpatch complains
>     o Added Signed-off-by
>     o Refined placement of guard to exclude IOMMU messages
>     o TODO: performance degradation measurement
> 
> See https://bugzilla.redhat.com/show_bug.cgi?id=1450680
> 
> 
> When performing live migration or memory hot-plugging,
> the changes to the device and vrings made by message handler
> done independently from vring usage by PMD threads.
> 
> This causes for example segfauls during live-migration
s/segfauls/segfault/

> with MQ enable, but in general virtually any request
> sent by qemu changing the state of device can cause
> problems.
> 
> These patches fixes all above issues by adding a spinlock
> to every vring and requiring message handler to start operation
> only after ensuring that all PMD threads related to the divece
> are out of critical section accessing the vring data.
> 
> Each vring has its own lock in order to not create contention
> between PMD threads of different vrings and to prevent
> performance degradation by scaling queue pair number.

I will run some performance benchmarks to see the impact of the lock.
But I agree this is necessary, because this is not robust enough
currently.

> Signed-off-by: Victor Kaplansky <victork@redhat.com>
> ---
>   lib/librte_vhost/vhost.h      |  1 +
>   lib/librte_vhost/vhost_user.c | 46 +++++++++++++++++++++++++++++++++++++++++++
>   lib/librte_vhost/virtio_net.c |  8 ++++++++
>   3 files changed, 55 insertions(+)
> 
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 1cc81c17..2ebe147f 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -137,6 +137,7 @@ struct vhost_virtqueue {
>   	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
>   	int				iotlb_cache_nr;
>   	TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
> +	rte_spinlock_t active_lock;
>   } __rte_cache_aligned;
>   
>   /* Old kernels have no such macros defined */
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> index f4c7ce46..60d82e01 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -1190,6 +1190,46 @@ vhost_user_check_and_alloc_queue_pair(struct virtio_net *dev, VhostUserMsg *msg)
>   	return alloc_vring_queue(dev, vring_idx);
>   }
>   
> +static void
> +vhost_user_lock_all_queue_pairs(struct virtio_net *dev)
> +{
> +	unsigned int i = 0;
> +	unsigned int vq_num = 0;
> +
> +	while (vq_num < dev->nr_vring) {
> +		struct vhost_virtqueue *vq = dev->virtqueue[i];
> +
> +		if (!vq) {
> +			i++;
> +			continue;
> +		}
> +
> +		rte_spinlock_lock(&vq->active_lock);
> +		vq_num++;
> +		i++;
> +	}
> +}
> +
> +static void
> +vhost_user_unlock_all_queue_pairs(struct virtio_net *dev)
> +{
> +	unsigned int i = 0;
> +	unsigned int vq_num = 0;
> +
> +	while (vq_num < dev->nr_vring) {
> +		struct vhost_virtqueue *vq = dev->virtqueue[i];
> +
> +		if (!vq) {
> +			i++;
> +			continue;
> +		}
> +
> +		rte_spinlock_unlock(&vq->active_lock);
> +		vq_num++;
> +		i++;
> +	}
> +}
> +
>   int
>   vhost_user_msg_handler(int vid, int fd)
>   {
> @@ -1241,6 +1281,9 @@ vhost_user_msg_handler(int vid, int fd)
>   		return -1;
>   	}
>   
> +	if (msg.request.master != VHOST_USER_IOTLB_MSG)
> +		vhost_user_lock_all_queue_pairs(dev);
> +

In the case of per-vring requests, like SET_VRING_CALL,
SET_VRING_ENABLE, ... I think it would be better to only lock the
targeted vq.

Also, I think you could have a deadlock here in case of GET_VRING_BASE
request. When receiving it, the .destroy() callback is called with the
new lock held. The problem is that the application is likely to wait for
the rings processing to be completed in the callback.

This is the case for example with Vhost PMD (see update_queuing_status()
in drivers/net/vhost/rte_eth_vhost.c). If the PMD thread is waiting on 
the active_lock, I think we have a deadlock.

Using rte_spinlock_trylock() in the PMD threads as suggested below might
prevent this to happen.

>   	switch (msg.request.master) {
>   	case VHOST_USER_GET_FEATURES:
>   		msg.payload.u64 = vhost_user_get_features(dev);
> @@ -1342,6 +1385,9 @@ vhost_user_msg_handler(int vid, int fd)
>   
>   	}
>   
> +	if (msg.request.master != VHOST_USER_IOTLB_MSG)
> +		vhost_user_unlock_all_queue_pairs(dev);
> +
>   	if (msg.flags & VHOST_USER_NEED_REPLY) {
>   		msg.payload.u64 = !!ret;
>   		msg.size = sizeof(msg.payload.u64);
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 6fee16e5..0337dc28 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -44,6 +44,7 @@
>   #include <rte_udp.h>
>   #include <rte_sctp.h>
>   #include <rte_arp.h>
> +#include <rte_spinlock.h>
>   
>   #include "iotlb.h"
>   #include "vhost.h"
> @@ -1180,9 +1181,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>   	}
>   
>   	vq = dev->virtqueue[queue_id];
> +
>   	if (unlikely(vq->enabled == 0))
>   		return 0;
>   
> +	rte_spinlock_lock(&vq->active_lock);
> +

Using rte_spinlock_trylock might be better, not to block other devices
to be processed on the same CPU.

>   	vq->batch_copy_nb_elems = 0;
>   
>   	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> @@ -1240,6 +1244,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>   		if (rarp_mbuf == NULL) {
>   			RTE_LOG(ERR, VHOST_DATA,
>   				"Failed to allocate memory for mbuf.\n");
> +			rte_spinlock_unlock(&vq->active_lock);
>   			return 0;
>   		}
>   
> @@ -1356,6 +1361,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>   	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
>   		vhost_user_iotlb_rd_unlock(vq);
>   
> +	rte_spinlock_unlock(&vq->active_lock);
> +
>   	if (unlikely(rarp_mbuf != NULL)) {
>   		/*
>   		 * Inject it to the head of "pkts" array, so that switch's mac
> @@ -1366,5 +1373,6 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>   		i += 1;
>   	}
>   
> +
>   	return i;
>   }
> 

It seems the enqueue path part is missing?

Thanks,
Maxime