From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <huawei.xie@intel.com>
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
 by dpdk.org (Postfix) with ESMTP id 9D7138DB3
 for <dev@dpdk.org>; Tue,  8 Sep 2015 11:42:33 +0200 (CEST)
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by orsmga102.jf.intel.com with ESMTP; 08 Sep 2015 02:42:32 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.17,489,1437462000"; d="scan'208";a="799923304"
Received: from pgsmsx105.gar.corp.intel.com ([10.221.44.96])
 by orsmga002.jf.intel.com with ESMTP; 08 Sep 2015 02:42:30 -0700
Received: from shsmsx103.ccr.corp.intel.com (10.239.4.69) by
 PGSMSX105.gar.corp.intel.com (10.221.44.96) with Microsoft SMTP Server (TLS)
 id 14.3.224.2; Tue, 8 Sep 2015 17:42:29 +0800
Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.171]) by
 SHSMSX103.ccr.corp.intel.com ([169.254.4.248]) with mapi id 14.03.0224.002;
 Tue, 8 Sep 2015 17:42:28 +0800
From: "Xie, Huawei" <huawei.xie@intel.com>
To: Tetsuya Mukawa <mukawa@igel.co.jp>, "dev@dpdk.org" <dev@dpdk.org>, Thomas
 Monjalon <thomas.monjalon@6wind.com>, Linhaifeng <haifeng.lin@huawei.com>
Thread-Topic: virtio optimization idea
Thread-Index: AdDm6zPdM5XrIXmIQz2JKkVwrZfYFQ==
Date: Tue, 8 Sep 2015 09:42:27 +0000
Message-ID: <C37D651A908B024F974696C65296B57B2BDBFEC2@SHSMSX101.ccr.corp.intel.com>
References: <C37D651A908B024F974696C65296B57B2BDB8C06@SHSMSX101.ccr.corp.intel.com>
 <C37D651A908B024F974696C65296B57B2BDB922F@SHSMSX101.ccr.corp.intel.com>
 <55EE9A75.7020306@igel.co.jp>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "ms >> Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [dpdk-dev] virtio optimization idea
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Sep 2015 09:42:34 -0000

On 9/8/2015 4:21 PM, Tetsuya Mukawa wrote:=0A=
> On 2015/09/05 1:50, Xie, Huawei wrote:=0A=
>> There is some format issue with the ascii chart of the tx ring. Update=
=0A=
>> that chart.=0A=
>> Sorry for the trouble.=0A=
> Hi XIe,=0A=
>=0A=
> Thanks for sharing a way to optimize virtio.=0A=
> I have a few questions.=0A=
>=0A=
>> On 9/4/2015 4:25 PM, Xie, Huawei wrote:=0A=
>>> Hi:=0A=
>>>=0A=
>>> Recently I have done one virtio optimization proof of concept. The=0A=
>>> optimization includes two parts:=0A=
>>> 1) avail ring set with fixed descriptors=0A=
>>> 2) RX vectorization=0A=
>>> With the optimizations, we could have several times of performance boos=
t=0A=
>>> for purely vhost-virtio throughput.=0A=
> When you check performance, have you optimized only virtio-net driver?=0A=
> If so, can we optimize vhost backend(librte_vhost) also using your=0A=
> optimization way?=0A=
=0A=
We could do some optimization to vhost based on the same vring layout,=0A=
but as vhost needs to support legacy virtio as well, it couldn't make=0A=
this assumption.=0A=
>>> Here i will only cover the first part, which is the prerequisite for th=
e=0A=
>>> second part.=0A=
>>> Let us first take RX for example. Currently when we fill the avail ring=
=0A=
>>> with guest mbuf, we need=0A=
>>> a) allocate one descriptor(for non sg mbuf) from free descriptors=0A=
>>> b) set the idx of the desc into the entry of avail ring=0A=
>>> c) set the addr/len field of the descriptor to point to guest blank mbu=
f=0A=
>>> data area=0A=
>>>=0A=
>>> Those operation takes time, and especially step b results in modifed (M=
)=0A=
>>> state of the cache line for the avail ring in the virtio processing=0A=
>>> core. When vhost processes the avail ring, the cache line transfer from=
=0A=
>>> virtio processing core to vhost processing core takes pretty much CPU=
=0A=
>>> cycles.=0A=
>>> To solve this problem, this is the arrangement of RX ring for DPDK=0A=
>>> pmd(for non-mergable case).=0A=
>>>    =0A=
>>>                     avail                      =0A=
>>>                     idx                        =0A=
>>>                     +                          =0A=
>>>                     |                          =0A=
>>> +----+----+---+-------------+------+           =0A=
>>> | 0  | 1  | 2 | ... |  254  | 255  |  avail ring=0A=
>>> +-+--+-+--+-+-+---------+---+--+---+           =0A=
>>>   |    |    |       |   |      |               =0A=
>>>   |    |    |       |   |      |               =0A=
>>>   v    v    v       |   v      v               =0A=
>>> +-+--+-+--+-+-+---------+---+--+---+           =0A=
>>> | 0  | 1  | 2 | ... |  254  | 255  |  desc ring=0A=
>>> +----+----+---+-------------+------+           =0A=
>>>                     |                          =0A=
>>>                     |                          =0A=
>>> +----+----+---+-------------+------+           =0A=
>>> | 0  | 1  | 2 |     |  254  | 255  |  used ring=0A=
>>> +----+----+---+-------------+------+           =0A=
>>>                     |                          =0A=
>>>                     +    =0A=
>>> Avail ring is initialized with fixed descriptor and is never changed,=
=0A=
>>> i.e, the index value of the nth avail ring entry is always n, which=0A=
>>> means virtio PMD is actually refilling desc ring only, without having t=
o=0A=
>>> change avail ring.=0A=
> For example, avail ring is like below.=0A=
> struct vring_avail {=0A=
>         uint16_t flags;=0A=
>         uint16_t idx;=0A=
>         uint16_t ring[QUEUE_SIZE];=0A=
> };=0A=
>=0A=
> My understanding is that virtio-net driver still needs to change=0A=
> avail_ring.idx, but don't need to change avail_ring.ring[].=0A=
> Is this correct?=0A=
=0A=
Yes, avail ring is initialized once and never gets updated. It is like=0A=
virtio frontend is only using descriptor ring.=0A=
>=0A=
> Tetsuya=0A=
>=0A=
>>> When vhost fetches avail ring, if not evicted, it is always in its firs=
t=0A=
>>> level cache.=0A=
>>>=0A=
>>> When RX receives packets from used ring, we use the used->idx as the=0A=
>>> desc idx. This requires that vhost processes and returns descs from=0A=
>>> avail ring to used ring in order, which is true for both current dpdk=
=0A=
>>> vhost and kernel vhost implementation. In my understanding, there is no=
=0A=
>>> necessity for vhost net to process descriptors OOO. One case could be=
=0A=
>>> zero copy, for example, if one descriptor doesn't meet zero copy=0A=
>>> requirment, we could directly return it to used ring, earlier than the=
=0A=
>>> descriptors in front of it.=0A=
>>> To enforce this, i want to use a reserved bit to indicate in order=0A=
>>> processing of descriptors.=0A=
>>>=0A=
>>> For tx ring, the arrangement is like below. Each transmitted mbuf needs=
=0A=
>>> a desc for virtio_net_hdr, so actually we have only 128 free slots.=0A=
>>>                                                                        =
               =0A=
>>>=0A=
>>>                            =0A=
>>>                             ++                                         =
                  =0A=
>>>                             ||                                         =
                  =0A=
>>>                             ||                                         =
                  =0A=
>>>    +-----+-----+-----+--------------+------+------+------+             =
                  =0A=
>>>    |  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   avail ring=
                  =0A=
>>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+             =
                  =0A=
>>>       |     |            |  ||  |      |             |                 =
                  =0A=
>>>       v     v            v  ||  v      v             v                 =
                  =0A=
>>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+             =
                  =0A=
>>>    | 127 | 128 | ... |  255 || 127  | 128  | ...  | 255  |   desc ring =
for virtio_net_hdr=0A=
>>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+             =
                  =0A=
>>>       |     |            |  ||  |      |             |                 =
                  =0A=
>>>       v     v            v  ||  v      v             v                 =
                  =0A=
>>>    +--+--+--+--+-----+---+------+---+--+---+------+--+---+             =
                  =0A=
>>>    |  0  |  1  | ... |  127 ||  0   |  1   | ...  | 127  |   desc ring =
for tx dat                        =0A=
>>>=0A=
>>>=0A=
>>>                      =0A=
>>> /huawei=0A=
>>>=0A=
>=0A=
=0A=