From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 164E78D3C for ; Fri, 4 Sep 2015 10:25:11 +0200 (CEST) Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga103.fm.intel.com with ESMTP; 04 Sep 2015 01:25:10 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,468,1437462000"; d="scan'208";a="555096012" Received: from pgsmsx101.gar.corp.intel.com ([10.221.44.78]) by FMSMGA003.fm.intel.com with ESMTP; 04 Sep 2015 01:25:09 -0700 Received: from shsmsx102.ccr.corp.intel.com (10.239.4.154) by PGSMSX101.gar.corp.intel.com (10.221.44.78) with Microsoft SMTP Server (TLS) id 14.3.224.2; Fri, 4 Sep 2015 16:25:08 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.171]) by shsmsx102.ccr.corp.intel.com ([169.254.2.206]) with mapi id 14.03.0224.002; Fri, 4 Sep 2015 16:25:06 +0800 From: "Xie, Huawei" To: "dev@dpdk.org" , Thomas Monjalon , Linhaifeng , "Tetsuya Mukawa" Thread-Topic: virtio optimization idea Thread-Index: AdDm6zPdM5XrIXmIQz2JKkVwrZfYFQ== Date: Fri, 4 Sep 2015 08:25:05 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "ms >> Michael S. Tsirkin" Subject: [dpdk-dev] virtio optimization idea X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Sep 2015 08:25:12 -0000 Hi:=0A= =0A= Recently I have done one virtio optimization proof of concept. The=0A= optimization includes two parts:=0A= 1) avail ring set with fixed descriptors=0A= 2) RX vectorization=0A= With the optimizations, we could have several times of performance boost=0A= for purely vhost-virtio throughput.=0A= =0A= Here i will only cover the first part, which is the prerequisite for the=0A= second part.=0A= Let us first take RX for example. Currently when we fill the avail ring=0A= with guest mbuf, we need=0A= a) allocate one descriptor(for non sg mbuf) from free descriptors=0A= b) set the idx of the desc into the entry of avail ring=0A= c) set the addr/len field of the descriptor to point to guest blank mbuf=0A= data area=0A= =0A= Those operation takes time, and especially step b results in modifed (M)=0A= state of the cache line for the avail ring in the virtio processing=0A= core. When vhost processes the avail ring, the cache line transfer from=0A= virtio processing core to vhost processing core takes pretty much CPU=0A= cycles.=0A= To solve this problem, this is the arrangement of RX ring for DPDK=0A= pmd(for non-mergable case).=0A= =0A= avail =0A= idx =0A= + =0A= | =0A= +----+----+---+-------------+------+ =0A= | 0 | 1 | 2 | ... | 254 | 255 | avail ring=0A= +-+--+-+--+-+-+---------+---+--+---+ =0A= | | | | | | =0A= | | | | | | =0A= v v v | v v =0A= +-+--+-+--+-+-+---------+---+--+---+ =0A= | 0 | 1 | 2 | ... | 254 | 255 | desc ring=0A= +----+----+---+-------------+------+ =0A= | =0A= | =0A= +----+----+---+-------------+------+ =0A= | 0 | 1 | 2 | | 254 | 255 | used ring=0A= +----+----+---+-------------+------+ =0A= | =0A= + =0A= Avail ring is initialized with fixed descriptor and is never changed,=0A= i.e, the index value of the nth avail ring entry is always n, which=0A= means virtio PMD is actually refilling desc ring only, without having to=0A= change avail ring.=0A= When vhost fetches avail ring, if not evicted, it is always in its first=0A= level cache.=0A= =0A= When RX receives packets from used ring, we use the used->idx as the=0A= desc idx. This requires that vhost processes and returns descs from=0A= avail ring to used ring in order, which is true for both current dpdk=0A= vhost and kernel vhost implementation. In my understanding, there is no=0A= necessity for vhost net to process descriptors OOO. One case could be=0A= zero copy, for example, if one descriptor doesn't meet zero copy=0A= requirment, we could directly return it to used ring, earlier than the=0A= descriptors in front of it.=0A= To enforce this, i want to use a reserved bit to indicate in order=0A= processing of descriptors.=0A= =0A= For tx ring, the arrangement is like below. Each transmitted mbuf needs=0A= a desc for virtio_net_hdr, so actually we have only 128 free slots.=0A= = =0A= =0A= =0A= ++ =0A= =0A= || =0A= =0A= || =0A= =0A= +-----+-----+-----+--------------+------+------+------+ = =0A= =0A= | 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring=0A= with fixed descriptor =0A= =0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= =0A= | | | || | | =0A= | =0A= v v v || v v =0A= v =0A= =0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= =0A= | 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring=0A= for virtio_net_hdr=0A= =0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= =0A= | | | || | | =0A= | =0A= v v v || v v =0A= v =0A= =0A= +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= =0A= | 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring=0A= for tx dat =0A= =0A= +-----+-----+-----+--------------+------+------+------+ = =0A= =0A= =0A= =0A= /huawei=0A=