From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <huawei.xie@intel.com>
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
 by dpdk.org (Postfix) with ESMTP id 7415C37AF
 for <dev@dpdk.org>; Mon, 14 Sep 2015 05:10:38 +0200 (CEST)
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by orsmga101.jf.intel.com with ESMTP; 13 Sep 2015 20:10:36 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.17,525,1437462000"; d="scan'208";a="804515839"
Received: from kmsmsx151.gar.corp.intel.com ([172.21.73.86])
 by fmsmga002.fm.intel.com with ESMTP; 13 Sep 2015 20:10:34 -0700
Received: from shsmsx151.ccr.corp.intel.com (10.239.6.50) by
 KMSMSX151.gar.corp.intel.com (172.21.73.86) with Microsoft SMTP Server (TLS)
 id 14.3.224.2; Mon, 14 Sep 2015 11:08:51 +0800
Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.75]) by
 SHSMSX151.ccr.corp.intel.com ([169.254.3.12]) with mapi id 14.03.0248.002;
 Mon, 14 Sep 2015 11:08:50 +0800
From: "Xie, Huawei" <huawei.xie@intel.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Thread-Topic: virtio optimization idea
Thread-Index: AdDm6zPdM5XrIXmIQz2JKkVwrZfYFQ==
Date: Mon, 14 Sep 2015 03:08:50 +0000
Message-ID: <C37D651A908B024F974696C65296B57B3FBB88F5@SHSMSX101.ccr.corp.intel.com>
References: <C37D651A908B024F974696C65296B57B2BDB8C06@SHSMSX101.ccr.corp.intel.com>
 <20150909073339.GA16849@redhat.com>
 <C37D651A908B024F974696C65296B57B2BDC4012@SHSMSX101.ccr.corp.intel.com>
 <20150910101319-mutt-send-email-mst@redhat.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "dev@dpdk.org" <dev@dpdk.org>, "virtualization@lists.linux-foundation.org"
 <virtualization@lists.linux-foundation.org>
Subject: Re: [dpdk-dev] virtio optimization idea
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Sep 2015 03:10:39 -0000

On 9/10/2015 3:20 PM, Michael S. Tsirkin wrote:=0A=
> On Thu, Sep 10, 2015 at 06:32:35AM +0000, Xie, Huawei wrote:=0A=
>> On 9/9/2015 3:34 PM, Michael S. Tsirkin wrote:=0A=
>>> On Fri, Sep 04, 2015 at 08:25:05AM +0000, Xie, Huawei wrote:=0A=
>>>> Hi:=0A=
>>>>=0A=
>>>> Recently I have done one virtio optimization proof of concept. The=0A=
>>>> optimization includes two parts:=0A=
>>>> 1) avail ring set with fixed descriptors=0A=
>>>> 2) RX vectorization=0A=
>>>> With the optimizations, we could have several times of performance boo=
st=0A=
>>>> for purely vhost-virtio throughput.=0A=
>>> Thanks!=0A=
>>> I'm very happy to see people work on the virtio ring format=0A=
>>> optimizations.=0A=
>>>=0A=
>>> I think it's best to analyze each optimization separately,=0A=
>>> unless you see a reason why they would only give benefit when applied=
=0A=
>>> together.=0A=
>>  =0A=
>> Agree. Will split the patch to see each change's benefit. Of course it=
=0A=
>> is also very common two give much more gain than the sum of individual.=
=0A=
>>  =0A=
>>> Also ideally, we'd need a unit test to show the performance impact.=0A=
>>> We've been using the tests in tools/virtio/ under linux,=0A=
>>> feel free to enhance these to simulate more workloads, or=0A=
>>> to suggest something else entirely.=0A=
>> If possible, we would provide perf test case under tools/virtio.=0A=
>> I am interested  if the optimization could help kernel virtio-net driver=
=0A=
>> performance, if not the other is the bottleneck.=0A=
>>> Avail ring is initialized with fixed descriptor and is never changed,=
=0A=
>>> i.e, the index value of the nth avail ring entry is always n, which=0A=
>>> means virtio PMD is actually refilling desc ring only, without having t=
o=0A=
>>> change avail ring.=0A=
>>> When vhost fetches avail ring, if not evicted, it is always in its firs=
t=0A=
>>> level cache.=0A=
>>>=0A=
>>> When RX receives packets from used ring, we use the used->idx as the=0A=
>>> desc idx. This requires that vhost processes and returns descs from=0A=
>>> avail ring to used ring in order, which is true for both current dpdk=
=0A=
>>> vhost and kernel vhost implementation. In my understanding, there is no=
=0A=
>>> necessity for vhost net to process descriptors OOO. One case could be=
=0A=
>>> zero copy, for example, if one descriptor doesn't meet zero copy=0A=
>>> requirment, we could directly return it to used ring, earlier than the=
=0A=
>>> descriptors in front of it.=0A=
>>> To enforce this, i want to use a reserved bit to indicate in order=0A=
>>> processing of descriptors.=0A=
>>> So what's the point in changing the idx for the used ring?=0A=
>>> You need to communicate the length to the guest anyway, don't you?=0A=
>> For guest virtio driver, we only use the length field in the entry of=0A=
>> the used ring and don't use the index in the entry. Instead, use=0A=
>> used->idx & 255 as the desc idx for RX, and used->idx & 127 as the desc=
=0A=
>> idx for TX.=0A=
> OK but length and index are in the same cache line.=0A=
> As long as we read the length, does it make sense=0A=
> to skip reading the index?=0A=
I don't understand "skipping reading the index". Currently virtio RX=0A=
needs the length field, and CPU will automatically fetch the adjacent=0A=
index field in the unit of cache line, though the optimized driver=0A=
doesn't use the index field.=0A=
/huawei=0A=
>=0A=
>> For vhost driver, as it couldn't assume fixed ring layout(to support=0A=
>> legacy virtio), it needs to update the idx in the used ring entry.=0A=
>>>> For tx ring, the arrangement is like below. Each transmitted mbuf need=
s=0A=
>>>> a desc for virtio_net_hdr, so actually we have only 128 free slots.=0A=
>>> Just fix this one. Support ANY_LAYOUT and then you can put data=0A=
>>> linearly. And/or support INDIRECT_DESC and then you can=0A=
>>> use an indirect descriptor.=0A=
>> Would check those two features.=0A=
>>>>                            =0A=
>>>>=0A=
>>> This one came out corrupted.=0A=
>> Actually i ever replied to the original mail and fixed it. Copy it here=
=0A=
>> again.=0A=
>>=0A=
>>                             ++                                          =
                 =0A=
>>                             ||                                          =
                 =0A=
>>                             ||                                          =
                 =0A=
>>    +-----+-----+-----+--------------+------+------+------+              =
                 =0A=
>>    |  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   avail ring =
                 =0A=
>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+              =
                 =0A=
>>       |     |            |  ||  |      |             |                  =
                 =0A=
>>       v     v            v  ||  v      v             v                  =
                 =0A=
>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+              =
                 =0A=
>>    | 127 | 128 | ... |  255 || 127  | 128  | ...  | 255  |   desc ring f=
or virtio_net_hdr=0A=
>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+              =
                 =0A=
>>       |     |            |  ||  |      |             |                  =
                 =0A=
>>       v     v            v  ||  v      v             v                  =
                 =0A=
>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+              =
                 =0A=
>>    |  0  |  1  | ... |  127 ||  0   |  1   | ...  | 127  |   desc ring f=
or tx dat =0A=
>>          =0A=
>>=0A=
>>>>                      =0A=
>>>> /huawei=0A=
>>> Please Cc virtio related discussion more widely.=0A=
>>> I added the virtualization mailing list.=0A=
>>>=0A=
>>>=0A=
>>> So what you want to do is avoid changing the avail=0A=
>>> ring, isn't it enough to pre-format it and cache=0A=
>>> the values in the guest?=0A=
>>>=0A=
>>> Host can then keep using avail ring without changes, it will stay in ca=
che.=0A=
>>> Something like the below for guest should do the trick (untested):=0A=
>> Good optimization to tackle the cache line transfer issue.=0A=
>> Your change could avoid changing avail ring if it points to same head=0A=
>> index. It could improve kernel virtio-net driver's performance if the=0A=
>> avail ring doesn't change in running.=0A=
>> We would also investigating applying this idea to DPDK virtio PMD slow p=
ath.=0A=
>> For the optimization i did, which i call fast path, the driver "knows"=
=0A=
>> in theory the avail ring willn't never get changed, so it doesn't a)=0A=
>> allocate and free descriptors b) care the avail ring.=0A=
>> Based on the fact the DPDK pmd is actually using the simple desc ring=0A=
>> below, we could directly map avail->idx and used->idx to  desc idx, and=
=0A=
>> use vector instruction to do parallel processing.=0A=
>>=0A=
>> +-+--+-+--+-+-+---------+---+--+---+           =0A=
>> | 0  | 1  | 2 | ... |  254  | 255  |  rx desc ring=0A=
>> +----+----+---+-------------+------+           =0A=
> Yes, using these instructions in kernel is generally problematic,=0A=
> but can work in userspace. Waiting for a description of that.=0A=
About vectorization? Would do that.=0A=
/huawei=0A=
>=0A=
>=0A=
>>  =0A=
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>=0A=
>>>=0A=
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.=
c=0A=
>>> index 096b857..9363b50 100644=0A=
>>> --- a/drivers/virtio/virtio_ring.c=0A=
>>> +++ b/drivers/virtio/virtio_ring.c=0A=
>>> @@ -91,6 +91,7 @@ struct vring_virtqueue {=0A=
>>>  	bool last_add_time_valid;=0A=
>>>  	ktime_t last_add_time;=0A=
>>>  #endif=0A=
>>> +	u16 *avail;=0A=
>>>  =0A=
>>>  	/* Tokens for callbacks. */=0A=
>>>  	void *data[];=0A=
>>> @@ -236,7 +237,10 @@ static inline int virtqueue_add(struct virtqueue *=
_vq,=0A=
>>>  	/* Put entry in available array (but don't update avail->idx until th=
ey=0A=
>>>  	 * do sync). */=0A=
>>>  	avail =3D virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) & (vq->vri=
ng.num - 1);=0A=
>>> -	vq->vring.avail->ring[avail] =3D cpu_to_virtio16(_vq->vdev, head);=0A=
>>> +	if (vq->avail[avail] !=3D head) {=0A=
>>> +		vq->avail[avail] =3D head;=0A=
>>> +		vq->vring.avail->ring[avail] =3D cpu_to_virtio16(_vq->vdev, head);=
=0A=
>>> +	}=0A=
>>>  =0A=
>>>  	/* Descriptors and available array need to be set before we expose th=
e=0A=
>>>  	 * new available array entries. */=0A=
>>> @@ -724,6 +728,11 @@ struct virtqueue *vring_new_virtqueue(unsigned int=
 index,=0A=
>>>  	vq =3D kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);=0A=
>>>  	if (!vq)=0A=
>>>  		return NULL;=0A=
>>> +	vq->avail =3D kzalloc(sizeof (*vq->avail) * num, GFP_KERNEL);=0A=
>>> +	if (!va->avail) {=0A=
>>> +		kfree(vq);=0A=
>>> +		return NULL;=0A=
>>> +	}=0A=
>>>  =0A=
>>>  	vring_init(&vq->vring, num, pages, vring_align);=0A=
>>>  	vq->vq.callback =3D callback;=0A=
>>>=0A=
=0A=