From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <huawei.xie@intel.com>
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
 by dpdk.org (Postfix) with ESMTP id CF475569A
 for <dev@dpdk.org>; Thu, 17 Sep 2015 17:41:41 +0200 (CEST)
Received: from orsmga003.jf.intel.com ([10.7.209.27])
 by fmsmga101.fm.intel.com with ESMTP; 17 Sep 2015 08:41:40 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.17,547,1437462000"; d="scan'208";a="646869626"
Received: from kmsmsx152.gar.corp.intel.com ([172.21.73.87])
 by orsmga003.jf.intel.com with ESMTP; 17 Sep 2015 08:41:39 -0700
Received: from shsmsx102.ccr.corp.intel.com (10.239.4.154) by
 KMSMSX152.gar.corp.intel.com (172.21.73.87) with Microsoft SMTP Server (TLS)
 id 14.3.224.2; Thu, 17 Sep 2015 23:41:37 +0800
Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.75]) by
 shsmsx102.ccr.corp.intel.com ([169.254.2.179]) with mapi id 14.03.0248.002;
 Thu, 17 Sep 2015 23:41:36 +0800
From: "Xie, Huawei" <huawei.xie@intel.com>
To: Stephen Hemminger <stephen@networkplumber.org>
Thread-Topic: [dpdk-dev] virtio optimization idea
Thread-Index: AdDm6zPdM5XrIXmIQz2JKkVwrZfYFQ==
Date: Thu, 17 Sep 2015 15:41:36 +0000
Message-ID: <C37D651A908B024F974696C65296B57B40F1D8C1@SHSMSX101.ccr.corp.intel.com>
References: <C37D651A908B024F974696C65296B57B2BDB8C06@SHSMSX101.ccr.corp.intel.com>
 <20150908083926.3f2f409f@urahara>
 <C37D651A908B024F974696C65296B57B2BDC0872@SHSMSX101.ccr.corp.intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "dev@dpdk.org" <dev@dpdk.org>, "virtualization@lists.linux-foundation.org"
 <virtualization@lists.linux-foundation.org>,
 "ms >> Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [dpdk-dev] virtio optimization idea
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Sep 2015 15:41:42 -0000

On 9/8/2015 11:54 PM, Xie, Huawei wrote:=0A=
> On 9/8/2015 11:39 PM, Stephen Hemminger wrote:=0A=
>> On Fri, 4 Sep 2015 08:25:05 +0000=0A=
>> "Xie, Huawei" <huawei.xie@intel.com> wrote:=0A=
>>=0A=
>>> Hi:=0A=
>>>=0A=
>>> Recently I have done one virtio optimization proof of concept. The=0A=
>>> optimization includes two parts:=0A=
>>> 1) avail ring set with fixed descriptors=0A=
>>> 2) RX vectorization=0A=
>>> With the optimizations, we could have several times of performance boos=
t=0A=
>>> for purely vhost-virtio throughput.=0A=
>>>=0A=
>>> Here i will only cover the first part, which is the prerequisite for th=
e=0A=
>>> second part.=0A=
>>> Let us first take RX for example. Currently when we fill the avail ring=
=0A=
>>> with guest mbuf, we need=0A=
>>> a) allocate one descriptor(for non sg mbuf) from free descriptors=0A=
>>> b) set the idx of the desc into the entry of avail ring=0A=
>>> c) set the addr/len field of the descriptor to point to guest blank mbu=
f=0A=
>>> data area=0A=
>>>=0A=
>>> Those operation takes time, and especially step b results in modifed (M=
)=0A=
>>> state of the cache line for the avail ring in the virtio processing=0A=
>>> core. When vhost processes the avail ring, the cache line transfer from=
=0A=
>>> virtio processing core to vhost processing core takes pretty much CPU=
=0A=
>>> cycles.=0A=
>>> To solve this problem, this is the arrangement of RX ring for DPDK=0A=
>>> pmd(for non-mergable case).=0A=
>>>    =0A=
>>>                     avail                      =0A=
>>>                     idx                        =0A=
>>>                     +                          =0A=
>>>                     |                          =0A=
>>> +----+----+---+-------------+------+           =0A=
>>> | 0  | 1  | 2 | ... |  254  | 255  |  avail ring=0A=
>>> +-+--+-+--+-+-+---------+---+--+---+           =0A=
>>>   |    |    |       |   |      |               =0A=
>>>   |    |    |       |   |      |               =0A=
>>>   v    v    v       |   v      v               =0A=
>>> +-+--+-+--+-+-+---------+---+--+---+           =0A=
>>> | 0  | 1  | 2 | ... |  254  | 255  |  desc ring=0A=
>>> +----+----+---+-------------+------+           =0A=
>>>                     |                          =0A=
>>>                     |                          =0A=
>>> +----+----+---+-------------+------+           =0A=
>>> | 0  | 1  | 2 |     |  254  | 255  |  used ring=0A=
>>> +----+----+---+-------------+------+           =0A=
>>>                     |                          =0A=
>>>                     +    =0A=
>>> Avail ring is initialized with fixed descriptor and is never changed,=
=0A=
>>> i.e, the index value of the nth avail ring entry is always n, which=0A=
>>> means virtio PMD is actually refilling desc ring only, without having t=
o=0A=
>>> change avail ring.=0A=
>>> When vhost fetches avail ring, if not evicted, it is always in its firs=
t=0A=
>>> level cache.=0A=
>>>=0A=
>>> When RX receives packets from used ring, we use the used->idx as the=0A=
>>> desc idx. This requires that vhost processes and returns descs from=0A=
>>> avail ring to used ring in order, which is true for both current dpdk=
=0A=
>>> vhost and kernel vhost implementation. In my understanding, there is no=
=0A=
>>> necessity for vhost net to process descriptors OOO. One case could be=
=0A=
>>> zero copy, for example, if one descriptor doesn't meet zero copy=0A=
>>> requirment, we could directly return it to used ring, earlier than the=
=0A=
>>> descriptors in front of it.=0A=
>>> To enforce this, i want to use a reserved bit to indicate in order=0A=
>>> processing of descriptors.=0A=
>>>=0A=
>>> For tx ring, the arrangement is like below. Each transmitted mbuf needs=
=0A=
>>> a desc for virtio_net_hdr, so actually we have only 128 free slots.=0A=
>>>                                                                        =
               =0A=
>>>=0A=
>>>                            =0A=
=0A=
                            ++                                             =
              =0A=
                            ||                                             =
              =0A=
                            ||                                             =
              =0A=
   +-----+-----+-----+--------------+------+------+------+                 =
              =0A=
   |  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   avail ring    =
              =0A=
   +--+--+--+--+-----+---+------+---+--+---+------+--+---+                 =
              =0A=
      |     |            |  ||  |      |             |                     =
              =0A=
      v     v            v  ||  v      v             v                     =
              =0A=
   +--+--+--+--+-----+---+------+---+--+---+------+--+---+                 =
              =0A=
   | 127 | 128 | ... |  255 || 127  | 128  | ...  | 255  |   desc ring for =
virtio_net_hdr=0A=
   +--+--+--+--+-----+---+------+---+--+---+------+--+---+                 =
              =0A=
      |     |            |  ||  |      |             |                     =
              =0A=
      v     v            v  ||  v      v             v                     =
              =0A=
   +--+--+--+--+-----+---+------+---+--+---+------+--+---+                 =
              =0A=
   |  0  |  1  | ... |  127 ||  0   |  1   | ...  | 127  |   desc ring for =
tx dat =0A=
=0A=
>>>           =0A=
>>>=0A=
>> Does this still work with Linux (or BSD) guest/host.=0A=
>> If you are assuming both virtio/vhost are DPDK this is never going=0A=
>> to be usable.=0A=
> It works with both dpdk vhost and kernel vhost implementations.=0A=
> But to enforce this, we had better add a new feature bit.=0A=
Hi Stephen, some update about compatibility:=0A=
This optimization in theory is compliant with current kernel vhost,=0A=
qemu, and dpdk vhost implementations.=0A=
Today i run dpdk virtio PMD with qemu and kernel vhost, and it works fine.=
=0A=
=0A=
=0A=
>> On a related note, have you looked at getting virtio to support the=0A=
>> new standard (not legacy) mode?=0A=
> Yes, we add it to our plan to support virtio 1.0.=0A=
>>=0A=
>=0A=
=0A=