From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 7415C37AF for ; Mon, 14 Sep 2015 05:10:38 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga101.jf.intel.com with ESMTP; 13 Sep 2015 20:10:36 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,525,1437462000"; d="scan'208";a="804515839" Received: from kmsmsx151.gar.corp.intel.com ([172.21.73.86]) by fmsmga002.fm.intel.com with ESMTP; 13 Sep 2015 20:10:34 -0700 Received: from shsmsx151.ccr.corp.intel.com (10.239.6.50) by KMSMSX151.gar.corp.intel.com (172.21.73.86) with Microsoft SMTP Server (TLS) id 14.3.224.2; Mon, 14 Sep 2015 11:08:51 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.75]) by SHSMSX151.ccr.corp.intel.com ([169.254.3.12]) with mapi id 14.03.0248.002; Mon, 14 Sep 2015 11:08:50 +0800 From: "Xie, Huawei" To: "Michael S. Tsirkin" Thread-Topic: virtio optimization idea Thread-Index: AdDm6zPdM5XrIXmIQz2JKkVwrZfYFQ== Date: Mon, 14 Sep 2015 03:08:50 +0000 Message-ID: References: <20150909073339.GA16849@redhat.com> <20150910101319-mutt-send-email-mst@redhat.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" , "virtualization@lists.linux-foundation.org" Subject: Re: [dpdk-dev] virtio optimization idea X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Sep 2015 03:10:39 -0000 On 9/10/2015 3:20 PM, Michael S. Tsirkin wrote:=0A= > On Thu, Sep 10, 2015 at 06:32:35AM +0000, Xie, Huawei wrote:=0A= >> On 9/9/2015 3:34 PM, Michael S. Tsirkin wrote:=0A= >>> On Fri, Sep 04, 2015 at 08:25:05AM +0000, Xie, Huawei wrote:=0A= >>>> Hi:=0A= >>>>=0A= >>>> Recently I have done one virtio optimization proof of concept. The=0A= >>>> optimization includes two parts:=0A= >>>> 1) avail ring set with fixed descriptors=0A= >>>> 2) RX vectorization=0A= >>>> With the optimizations, we could have several times of performance boo= st=0A= >>>> for purely vhost-virtio throughput.=0A= >>> Thanks!=0A= >>> I'm very happy to see people work on the virtio ring format=0A= >>> optimizations.=0A= >>>=0A= >>> I think it's best to analyze each optimization separately,=0A= >>> unless you see a reason why they would only give benefit when applied= =0A= >>> together.=0A= >> =0A= >> Agree. Will split the patch to see each change's benefit. Of course it= =0A= >> is also very common two give much more gain than the sum of individual.= =0A= >> =0A= >>> Also ideally, we'd need a unit test to show the performance impact.=0A= >>> We've been using the tests in tools/virtio/ under linux,=0A= >>> feel free to enhance these to simulate more workloads, or=0A= >>> to suggest something else entirely.=0A= >> If possible, we would provide perf test case under tools/virtio.=0A= >> I am interested if the optimization could help kernel virtio-net driver= =0A= >> performance, if not the other is the bottleneck.=0A= >>> Avail ring is initialized with fixed descriptor and is never changed,= =0A= >>> i.e, the index value of the nth avail ring entry is always n, which=0A= >>> means virtio PMD is actually refilling desc ring only, without having t= o=0A= >>> change avail ring.=0A= >>> When vhost fetches avail ring, if not evicted, it is always in its firs= t=0A= >>> level cache.=0A= >>>=0A= >>> When RX receives packets from used ring, we use the used->idx as the=0A= >>> desc idx. This requires that vhost processes and returns descs from=0A= >>> avail ring to used ring in order, which is true for both current dpdk= =0A= >>> vhost and kernel vhost implementation. In my understanding, there is no= =0A= >>> necessity for vhost net to process descriptors OOO. One case could be= =0A= >>> zero copy, for example, if one descriptor doesn't meet zero copy=0A= >>> requirment, we could directly return it to used ring, earlier than the= =0A= >>> descriptors in front of it.=0A= >>> To enforce this, i want to use a reserved bit to indicate in order=0A= >>> processing of descriptors.=0A= >>> So what's the point in changing the idx for the used ring?=0A= >>> You need to communicate the length to the guest anyway, don't you?=0A= >> For guest virtio driver, we only use the length field in the entry of=0A= >> the used ring and don't use the index in the entry. Instead, use=0A= >> used->idx & 255 as the desc idx for RX, and used->idx & 127 as the desc= =0A= >> idx for TX.=0A= > OK but length and index are in the same cache line.=0A= > As long as we read the length, does it make sense=0A= > to skip reading the index?=0A= I don't understand "skipping reading the index". Currently virtio RX=0A= needs the length field, and CPU will automatically fetch the adjacent=0A= index field in the unit of cache line, though the optimized driver=0A= doesn't use the index field.=0A= /huawei=0A= >=0A= >> For vhost driver, as it couldn't assume fixed ring layout(to support=0A= >> legacy virtio), it needs to update the idx in the used ring entry.=0A= >>>> For tx ring, the arrangement is like below. Each transmitted mbuf need= s=0A= >>>> a desc for virtio_net_hdr, so actually we have only 128 free slots.=0A= >>> Just fix this one. Support ANY_LAYOUT and then you can put data=0A= >>> linearly. And/or support INDIRECT_DESC and then you can=0A= >>> use an indirect descriptor.=0A= >> Would check those two features.=0A= >>>> =0A= >>>>=0A= >>> This one came out corrupted.=0A= >> Actually i ever replied to the original mail and fixed it. Copy it here= =0A= >> again.=0A= >>=0A= >> ++ = =0A= >> || = =0A= >> || = =0A= >> +-----+-----+-----+--------------+------+------+------+ = =0A= >> | 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring = =0A= >> +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= >> | | | || | | | = =0A= >> v v v || v v v = =0A= >> +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= >> | 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring f= or virtio_net_hdr=0A= >> +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= >> | | | || | | | = =0A= >> v v v || v v v = =0A= >> +--+--+--+--+-----+---+------+---+--+---+------+--+---+ = =0A= >> | 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring f= or tx dat =0A= >> =0A= >>=0A= >>>> =0A= >>>> /huawei=0A= >>> Please Cc virtio related discussion more widely.=0A= >>> I added the virtualization mailing list.=0A= >>>=0A= >>>=0A= >>> So what you want to do is avoid changing the avail=0A= >>> ring, isn't it enough to pre-format it and cache=0A= >>> the values in the guest?=0A= >>>=0A= >>> Host can then keep using avail ring without changes, it will stay in ca= che.=0A= >>> Something like the below for guest should do the trick (untested):=0A= >> Good optimization to tackle the cache line transfer issue.=0A= >> Your change could avoid changing avail ring if it points to same head=0A= >> index. It could improve kernel virtio-net driver's performance if the=0A= >> avail ring doesn't change in running.=0A= >> We would also investigating applying this idea to DPDK virtio PMD slow p= ath.=0A= >> For the optimization i did, which i call fast path, the driver "knows"= =0A= >> in theory the avail ring willn't never get changed, so it doesn't a)=0A= >> allocate and free descriptors b) care the avail ring.=0A= >> Based on the fact the DPDK pmd is actually using the simple desc ring=0A= >> below, we could directly map avail->idx and used->idx to desc idx, and= =0A= >> use vector instruction to do parallel processing.=0A= >>=0A= >> +-+--+-+--+-+-+---------+---+--+---+ =0A= >> | 0 | 1 | 2 | ... | 254 | 255 | rx desc ring=0A= >> +----+----+---+-------------+------+ =0A= > Yes, using these instructions in kernel is generally problematic,=0A= > but can work in userspace. Waiting for a description of that.=0A= About vectorization? Would do that.=0A= /huawei=0A= >=0A= >=0A= >> =0A= >>> Signed-off-by: Michael S. Tsirkin =0A= >>>=0A= >>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.= c=0A= >>> index 096b857..9363b50 100644=0A= >>> --- a/drivers/virtio/virtio_ring.c=0A= >>> +++ b/drivers/virtio/virtio_ring.c=0A= >>> @@ -91,6 +91,7 @@ struct vring_virtqueue {=0A= >>> bool last_add_time_valid;=0A= >>> ktime_t last_add_time;=0A= >>> #endif=0A= >>> + u16 *avail;=0A= >>> =0A= >>> /* Tokens for callbacks. */=0A= >>> void *data[];=0A= >>> @@ -236,7 +237,10 @@ static inline int virtqueue_add(struct virtqueue *= _vq,=0A= >>> /* Put entry in available array (but don't update avail->idx until th= ey=0A= >>> * do sync). */=0A= >>> avail =3D virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) & (vq->vri= ng.num - 1);=0A= >>> - vq->vring.avail->ring[avail] =3D cpu_to_virtio16(_vq->vdev, head);=0A= >>> + if (vq->avail[avail] !=3D head) {=0A= >>> + vq->avail[avail] =3D head;=0A= >>> + vq->vring.avail->ring[avail] =3D cpu_to_virtio16(_vq->vdev, head);= =0A= >>> + }=0A= >>> =0A= >>> /* Descriptors and available array need to be set before we expose th= e=0A= >>> * new available array entries. */=0A= >>> @@ -724,6 +728,11 @@ struct virtqueue *vring_new_virtqueue(unsigned int= index,=0A= >>> vq =3D kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);=0A= >>> if (!vq)=0A= >>> return NULL;=0A= >>> + vq->avail =3D kzalloc(sizeof (*vq->avail) * num, GFP_KERNEL);=0A= >>> + if (!va->avail) {=0A= >>> + kfree(vq);=0A= >>> + return NULL;=0A= >>> + }=0A= >>> =0A= >>> vring_init(&vq->vring, num, pages, vring_align);=0A= >>> vq->vq.callback =3D callback;=0A= >>>=0A= =0A=